Apache NiFi Data Flow Installation
Apache NiFi is an open-source data flow management system that provides a web-based visual interface for designing, controlling, and monitoring data pipelines between systems. This guide covers installing NiFi on Linux, configuring the web UI, building data flows with processors, securing the instance, and common ETL patterns for data ingestion.
Prerequisites
- Ubuntu 20.04+ or CentOS 8+ / Rocky Linux 8+
- Java 11 or Java 17 (OpenJDK recommended)
- At least 4 GB RAM (8 GB recommended for production)
- 20 GB disk space for data flow and content repository
- Root or sudo access
Installing Apache NiFi
# Install Java 17
# Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y openjdk-17-jre-headless
# CentOS/Rocky:
sudo dnf install -y java-17-openjdk-headless
# Verify Java
java -version
# Download NiFi (check https://nifi.apache.org/download/ for latest version)
NIFI_VERSION="2.0.0"
wget https://downloads.apache.org/nifi/${NIFI_VERSION}/nifi-${NIFI_VERSION}-bin.zip
# Verify checksum
wget https://downloads.apache.org/nifi/${NIFI_VERSION}/nifi-${NIFI_VERSION}-bin.zip.sha256
sha256sum -c nifi-${NIFI_VERSION}-bin.zip.sha256
# Extract
sudo mkdir -p /opt/nifi
sudo unzip nifi-${NIFI_VERSION}-bin.zip -d /opt/nifi
sudo ln -s /opt/nifi/nifi-${NIFI_VERSION} /opt/nifi/current
# Create a dedicated user
sudo useradd -r -s /sbin/nologin nifi
sudo chown -R nifi:nifi /opt/nifi
# Create a data directory
sudo mkdir -p /data/nifi/{content,flowfile,provenance,database}
sudo chown -R nifi:nifi /data/nifi
Starting and Accessing NiFi
# Configure NiFi to use the data directory
sudo -u nifi nano /opt/nifi/current/conf/nifi.properties
# Key properties to change:
# nifi.content.repository.directory.default=/data/nifi/content
# nifi.flowfile.repository.directory=/data/nifi/flowfile
# nifi.provenance.repository.directory.default=/data/nifi/provenance
# nifi.database.directory=/data/nifi/database
# nifi.web.https.host=0.0.0.0
# nifi.web.https.port=8443
# Start NiFi
sudo -u nifi /opt/nifi/current/bin/nifi.sh start
# Or create a systemd service
cat > /etc/systemd/system/nifi.service << 'EOF'
[Unit]
Description=Apache NiFi
After=network.target
[Service]
Type=forking
User=nifi
Group=nifi
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
ExecStart=/opt/nifi/current/bin/nifi.sh start
ExecStop=/opt/nifi/current/bin/nifi.sh stop
ExecReload=/opt/nifi/current/bin/nifi.sh restart
PIDFile=/opt/nifi/current/run/nifi.pid
SuccessExitStatus=143
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now nifi
# Monitor startup (takes 1-2 minutes)
tail -f /opt/nifi/current/logs/nifi-app.log | grep -E "Started|ERROR|WARN"
Get the auto-generated admin credentials:
# NiFi 2.x generates credentials on first start
cat /opt/nifi/current/logs/nifi-app.log | grep -A2 "Generated Username"
# Access at https://your-server:8443/nifi
Core Concepts: Processors and Flows
NiFi's main building blocks:
- FlowFile: a unit of data (content + attributes like filename, size, UUID)
- Processor: performs an action on FlowFiles (GetFile, PutDatabase, RouteOnAttribute, etc.)
- Connection: queues linking processor outputs to processor inputs
- Process Group: a container for organizing processors (like a function)
- Controller Service: shared resources like database connection pools
Key processor categories:
- Ingest: GetFile, GetHTTP, ListenHTTP, ConsumeKafka, QueryDatabaseTable
- Route/Decide: RouteOnAttribute, RouteOnContent, SplitJson
- Transform: UpdateAttribute, ReplaceText, ConvertRecord, ExecuteScript
- Egress: PutFile, PutDatabaseRecord, PublishKafka, PostHTTP
Building a Data Flow
Example: CSV file → PostgreSQL
-
In the NiFi UI, drag a Processor onto the canvas
-
Search for and add GetFile → configure:
- Input Directory:
/data/ingest - File Filter:
.*\.csv - Keep Source File: false
- Input Directory:
-
Add a CSVReader Controller Service:
- Settings → Controller Services → Add → CSVReader
- Enable it
-
Add a DatabaseConnectionPoolService Controller Service:
- Add → DBCPConnectionPool
- Database Connection URL:
jdbc:postgresql://localhost:5432/mydb - Database Driver Location:
/opt/nifi/current/lib/postgresql.jar - Database User:
myuser/ Password:mypass - Enable it
-
Add a PutDatabaseRecord processor → configure:
- Record Reader: CSVReader
- Database Connection Pooling Service: DBCPConnectionPool
- Statement Type: INSERT
- Table Name:
my_table
-
Connect GetFile → PutDatabaseRecord (drag from success relationship)
-
Add a LogAttribute processor on the failure relationship to capture errors
Start processors by right-clicking → Start, or use the top toolbar to start all.
Common ETL Patterns
Incremental database extraction:
QueryDatabaseTable → ConvertRecord → MergeContent → PutS3Object
Configure QueryDatabaseTable:
- Table Name:
orders - Maximum-value Columns:
updated_at(NiFi tracks the max value automatically) - Columns to Return:
id, customer_id, amount, updated_at
HTTP API to Kafka:
InvokeHTTP → SplitJson → UpdateAttribute → PublishKafkaRecord
Configure InvokeHTTP:
- HTTP Method: GET
- Remote URL:
https://api.example.com/events?since=${timestamp} - Schedule: Run every 60 seconds
Configure SplitJson to split the response array:
- JsonPath Expression:
$.events[*]
File format conversion:
GetFile → ConvertRecord → PutFile
ConvertRecord with a JSONReader and CSVRecordSetWriter converts JSON logs to CSV automatically by inferring the schema.
Security Configuration
NiFi generates self-signed TLS by default. For production, use your own certificate:
# Generate a keystore using NiFi's TLS Toolkit
/opt/nifi/current/bin/tls-toolkit.sh standalone \
-n your-server.example.com \
-C "CN=admin,OU=NIFI" \
-o /opt/nifi/certs
# Copy generated files to NiFi conf
sudo cp /opt/nifi/certs/your-server.example.com/keystore.jks /opt/nifi/current/conf/
sudo cp /opt/nifi/certs/your-server.example.com/truststore.jks /opt/nifi/current/conf/
sudo chown nifi:nifi /opt/nifi/current/conf/*.jks
# Update nifi.properties
sudo -u nifi nano /opt/nifi/current/conf/nifi.properties
Key TLS settings in nifi.properties:
nifi.security.keystore=/opt/nifi/current/conf/keystore.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=your_keystore_password
nifi.security.truststore=/opt/nifi/current/conf/truststore.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=your_truststore_password
Configure LDAP or OIDC for user authentication in login-identity-providers.xml and authorizers.xml.
Clustering NiFi
For a 3-node NiFi cluster with ZooKeeper:
# Install ZooKeeper (or use NiFi's embedded ZooKeeper)
# Edit nifi.properties on all nodes:
# Node 1: 192.168.1.10
nifi.cluster.is.node=true
nifi.cluster.node.address=192.168.1.10
nifi.cluster.node.protocol.port=11443
nifi.state.management.provider.cluster=zk-provider
nifi.zookeeper.connect.string=192.168.1.10:2181,192.168.1.11:2181,192.168.1.12:2181
# Enable embedded ZooKeeper on the node that will run it
nifi.state.management.embedded.zookeeper.start=true
Edit state-management.xml to point to the ZooKeeper cluster, then start all nodes. One node becomes the Primary Node and Cluster Coordinator automatically.
Troubleshooting
NiFi won't start:
tail -100 /opt/nifi/current/logs/nifi-app.log
# Check for "java.lang.OutOfMemoryError" or port conflicts
Increase JVM heap for large flows:
# Edit /opt/nifi/current/conf/jvm.options
-Xms2g
-Xmx4g
Processor shows "Invalid" state:
- Check the processor configuration for missing required properties
- Verify Controller Services are enabled (not just created)
Back pressure / queue full:
# Each connection has back pressure thresholds
# Right-click connection → Configure → Back Pressure
# Increase "Back Pressure Object Threshold" or "Data Size Threshold"
Check data provenance:
- Right-click any processor → View data provenance
- Shows full lineage of every FlowFile processed
Flow file stuck in queue:
# List queue contents
# Right-click connection → List queue
# Download or empty queue as needed
Conclusion
Apache NiFi provides a visual, drag-and-drop approach to building data pipelines that handles back pressure, data provenance, and reliable delivery out of the box. Its extensive processor library covers most ingestion and transformation needs without custom code, and the web UI makes it easy for data engineers to monitor flow health in real time. For production deployments, use a 3-node cluster with external ZooKeeper and configure TLS with certificate-based authentication.


