Apache NiFi Data Flow Installation

Apache NiFi is an open-source data flow management system that provides a web-based visual interface for designing, controlling, and monitoring data pipelines between systems. This guide covers installing NiFi on Linux, configuring the web UI, building data flows with processors, securing the instance, and common ETL patterns for data ingestion.

Prerequisites

Ubuntu 20.04+ or CentOS 8+ / Rocky Linux 8+
Java 11 or Java 17 (OpenJDK recommended)
At least 4 GB RAM (8 GB recommended for production)
20 GB disk space for data flow and content repository
Root or sudo access

Installing Apache NiFi

# Install Java 17
# Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y openjdk-17-jre-headless

# CentOS/Rocky:
sudo dnf install -y java-17-openjdk-headless

# Verify Java
java -version

# Download NiFi (check https://nifi.apache.org/download/ for latest version)
NIFI_VERSION="2.0.0"
wget https://downloads.apache.org/nifi/${NIFI_VERSION}/nifi-${NIFI_VERSION}-bin.zip

# Verify checksum
wget https://downloads.apache.org/nifi/${NIFI_VERSION}/nifi-${NIFI_VERSION}-bin.zip.sha256
sha256sum -c nifi-${NIFI_VERSION}-bin.zip.sha256

# Extract
sudo mkdir -p /opt/nifi
sudo unzip nifi-${NIFI_VERSION}-bin.zip -d /opt/nifi
sudo ln -s /opt/nifi/nifi-${NIFI_VERSION} /opt/nifi/current

# Create a dedicated user
sudo useradd -r -s /sbin/nologin nifi
sudo chown -R nifi:nifi /opt/nifi

# Create a data directory
sudo mkdir -p /data/nifi/{content,flowfile,provenance,database}
sudo chown -R nifi:nifi /data/nifi

Starting and Accessing NiFi

# Configure NiFi to use the data directory
sudo -u nifi nano /opt/nifi/current/conf/nifi.properties

# Key properties to change:
# nifi.content.repository.directory.default=/data/nifi/content
# nifi.flowfile.repository.directory=/data/nifi/flowfile
# nifi.provenance.repository.directory.default=/data/nifi/provenance
# nifi.database.directory=/data/nifi/database
# nifi.web.https.host=0.0.0.0
# nifi.web.https.port=8443

# Start NiFi
sudo -u nifi /opt/nifi/current/bin/nifi.sh start

# Or create a systemd service
cat > /etc/systemd/system/nifi.service << 'EOF'
[Unit]
Description=Apache NiFi
After=network.target

[Service]
Type=forking
User=nifi
Group=nifi
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
ExecStart=/opt/nifi/current/bin/nifi.sh start
ExecStop=/opt/nifi/current/bin/nifi.sh stop
ExecReload=/opt/nifi/current/bin/nifi.sh restart
PIDFile=/opt/nifi/current/run/nifi.pid
SuccessExitStatus=143
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now nifi

# Monitor startup (takes 1-2 minutes)
tail -f /opt/nifi/current/logs/nifi-app.log | grep -E "Started|ERROR|WARN"

Get the auto-generated admin credentials:

# NiFi 2.x generates credentials on first start
cat /opt/nifi/current/logs/nifi-app.log | grep -A2 "Generated Username"
# Access at https://your-server:8443/nifi

Core Concepts: Processors and Flows

NiFi's main building blocks:

FlowFile: a unit of data (content + attributes like filename, size, UUID)
Processor: performs an action on FlowFiles (GetFile, PutDatabase, RouteOnAttribute, etc.)
Connection: queues linking processor outputs to processor inputs
Process Group: a container for organizing processors (like a function)
Controller Service: shared resources like database connection pools

Key processor categories:

Ingest: GetFile, GetHTTP, ListenHTTP, ConsumeKafka, QueryDatabaseTable
Route/Decide: RouteOnAttribute, RouteOnContent, SplitJson
Transform: UpdateAttribute, ReplaceText, ConvertRecord, ExecuteScript
Egress: PutFile, PutDatabaseRecord, PublishKafka, PostHTTP

Building a Data Flow

Example: CSV file → PostgreSQL

In the NiFi UI, drag a Processor onto the canvas
Search for and add GetFile → configure:
- Input Directory: /data/ingest
- File Filter: .*\.csv
- Keep Source File: false
Add a CSVReader Controller Service:
- Settings → Controller Services → Add → CSVReader
- Enable it
Add a DatabaseConnectionPoolService Controller Service:
- Add → DBCPConnectionPool
- Database Connection URL: jdbc:postgresql://localhost:5432/mydb
- Database Driver Location: /opt/nifi/current/lib/postgresql.jar
- Database User: myuser / Password: mypass
- Enable it
Add a PutDatabaseRecord processor → configure:
- Record Reader: CSVReader
- Database Connection Pooling Service: DBCPConnectionPool
- Statement Type: INSERT
- Table Name: my_table
Connect GetFile → PutDatabaseRecord (drag from success relationship)
Add a LogAttribute processor on the failure relationship to capture errors

Start processors by right-clicking → Start, or use the top toolbar to start all.

Common ETL Patterns

Incremental database extraction:

QueryDatabaseTable → ConvertRecord → MergeContent → PutS3Object

Configure QueryDatabaseTable:

Table Name: orders
Maximum-value Columns: updated_at (NiFi tracks the max value automatically)
Columns to Return: id, customer_id, amount, updated_at

HTTP API to Kafka:

InvokeHTTP → SplitJson → UpdateAttribute → PublishKafkaRecord

Configure InvokeHTTP:

HTTP Method: GET
Remote URL: https://api.example.com/events?since=${timestamp}
Schedule: Run every 60 seconds

Configure SplitJson to split the response array:

JsonPath Expression: $.events[*]

File format conversion:

GetFile → ConvertRecord → PutFile

ConvertRecord with a JSONReader and CSVRecordSetWriter converts JSON logs to CSV automatically by inferring the schema.

Security Configuration

NiFi generates self-signed TLS by default. For production, use your own certificate:

# Generate a keystore using NiFi's TLS Toolkit
/opt/nifi/current/bin/tls-toolkit.sh standalone \
    -n your-server.example.com \
    -C "CN=admin,OU=NIFI" \
    -o /opt/nifi/certs

# Copy generated files to NiFi conf
sudo cp /opt/nifi/certs/your-server.example.com/keystore.jks /opt/nifi/current/conf/
sudo cp /opt/nifi/certs/your-server.example.com/truststore.jks /opt/nifi/current/conf/
sudo chown nifi:nifi /opt/nifi/current/conf/*.jks

# Update nifi.properties
sudo -u nifi nano /opt/nifi/current/conf/nifi.properties

Key TLS settings in nifi.properties:

nifi.security.keystore=/opt/nifi/current/conf/keystore.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=your_keystore_password
nifi.security.truststore=/opt/nifi/current/conf/truststore.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=your_truststore_password

Configure LDAP or OIDC for user authentication in login-identity-providers.xml and authorizers.xml.

Clustering NiFi

For a 3-node NiFi cluster with ZooKeeper:

# Install ZooKeeper (or use NiFi's embedded ZooKeeper)
# Edit nifi.properties on all nodes:

# Node 1: 192.168.1.10
nifi.cluster.is.node=true
nifi.cluster.node.address=192.168.1.10
nifi.cluster.node.protocol.port=11443
nifi.state.management.provider.cluster=zk-provider
nifi.zookeeper.connect.string=192.168.1.10:2181,192.168.1.11:2181,192.168.1.12:2181

# Enable embedded ZooKeeper on the node that will run it
nifi.state.management.embedded.zookeeper.start=true

Edit state-management.xml to point to the ZooKeeper cluster, then start all nodes. One node becomes the Primary Node and Cluster Coordinator automatically.

Troubleshooting

NiFi won't start:

tail -100 /opt/nifi/current/logs/nifi-app.log
# Check for "java.lang.OutOfMemoryError" or port conflicts

Increase JVM heap for large flows:

# Edit /opt/nifi/current/conf/jvm.options
-Xms2g
-Xmx4g

Processor shows "Invalid" state:

Check the processor configuration for missing required properties
Verify Controller Services are enabled (not just created)

Back pressure / queue full:

# Each connection has back pressure thresholds
# Right-click connection → Configure → Back Pressure
# Increase "Back Pressure Object Threshold" or "Data Size Threshold"

Check data provenance:

Right-click any processor → View data provenance
Shows full lineage of every FlowFile processed

Flow file stuck in queue:

# List queue contents
# Right-click connection → List queue
# Download or empty queue as needed

Conclusion

Apache NiFi provides a visual, drag-and-drop approach to building data pipelines that handles back pressure, data provenance, and reliable delivery out of the box. Its extensive processor library covers most ingestion and transformation needs without custom code, and the web UI makes it easy for data engineers to monitor flow health in real time. For production deployments, use a 3-node cluster with external ZooKeeper and configure TLS with certificate-based authentication.

Apache NiFi Data Flow Installation

En esta página

En esta página

Apache NiFi Data Flow Installation

Prerequisites

Installing Apache NiFi

Starting and Accessing NiFi

Core Concepts: Processors and Flows

Building a Data Flow

Common ETL Patterns

Security Configuration

Clustering NiFi

Troubleshooting

Conclusion

Último Video

Obtén $20 de Crédito Gratis