Apache NiFi Data Flow Installation

Apache NiFi is an open-source data flow management system that provides a web-based visual interface for designing, controlling, and monitoring data pipelines between systems. This guide covers installing NiFi on Linux, configuring the web UI, building data flows with processors, securing the instance, and common ETL patterns for data ingestion.

Prerequisites

  • Ubuntu 20.04+ or CentOS 8+ / Rocky Linux 8+
  • Java 11 or Java 17 (OpenJDK recommended)
  • At least 4 GB RAM (8 GB recommended for production)
  • 20 GB disk space for data flow and content repository
  • Root or sudo access

Installing Apache NiFi

# Install Java 17
# Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y openjdk-17-jre-headless

# CentOS/Rocky:
sudo dnf install -y java-17-openjdk-headless

# Verify Java
java -version

# Download NiFi (check https://nifi.apache.org/download/ for latest version)
NIFI_VERSION="2.0.0"
wget https://downloads.apache.org/nifi/${NIFI_VERSION}/nifi-${NIFI_VERSION}-bin.zip

# Verify checksum
wget https://downloads.apache.org/nifi/${NIFI_VERSION}/nifi-${NIFI_VERSION}-bin.zip.sha256
sha256sum -c nifi-${NIFI_VERSION}-bin.zip.sha256

# Extract
sudo mkdir -p /opt/nifi
sudo unzip nifi-${NIFI_VERSION}-bin.zip -d /opt/nifi
sudo ln -s /opt/nifi/nifi-${NIFI_VERSION} /opt/nifi/current

# Create a dedicated user
sudo useradd -r -s /sbin/nologin nifi
sudo chown -R nifi:nifi /opt/nifi

# Create a data directory
sudo mkdir -p /data/nifi/{content,flowfile,provenance,database}
sudo chown -R nifi:nifi /data/nifi

Starting and Accessing NiFi

# Configure NiFi to use the data directory
sudo -u nifi nano /opt/nifi/current/conf/nifi.properties

# Key properties to change:
# nifi.content.repository.directory.default=/data/nifi/content
# nifi.flowfile.repository.directory=/data/nifi/flowfile
# nifi.provenance.repository.directory.default=/data/nifi/provenance
# nifi.database.directory=/data/nifi/database
# nifi.web.https.host=0.0.0.0
# nifi.web.https.port=8443

# Start NiFi
sudo -u nifi /opt/nifi/current/bin/nifi.sh start

# Or create a systemd service
cat > /etc/systemd/system/nifi.service << 'EOF'
[Unit]
Description=Apache NiFi
After=network.target

[Service]
Type=forking
User=nifi
Group=nifi
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
ExecStart=/opt/nifi/current/bin/nifi.sh start
ExecStop=/opt/nifi/current/bin/nifi.sh stop
ExecReload=/opt/nifi/current/bin/nifi.sh restart
PIDFile=/opt/nifi/current/run/nifi.pid
SuccessExitStatus=143
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now nifi

# Monitor startup (takes 1-2 minutes)
tail -f /opt/nifi/current/logs/nifi-app.log | grep -E "Started|ERROR|WARN"

Get the auto-generated admin credentials:

# NiFi 2.x generates credentials on first start
cat /opt/nifi/current/logs/nifi-app.log | grep -A2 "Generated Username"
# Access at https://your-server:8443/nifi

Core Concepts: Processors and Flows

NiFi's main building blocks:

  • FlowFile: a unit of data (content + attributes like filename, size, UUID)
  • Processor: performs an action on FlowFiles (GetFile, PutDatabase, RouteOnAttribute, etc.)
  • Connection: queues linking processor outputs to processor inputs
  • Process Group: a container for organizing processors (like a function)
  • Controller Service: shared resources like database connection pools

Key processor categories:

  • Ingest: GetFile, GetHTTP, ListenHTTP, ConsumeKafka, QueryDatabaseTable
  • Route/Decide: RouteOnAttribute, RouteOnContent, SplitJson
  • Transform: UpdateAttribute, ReplaceText, ConvertRecord, ExecuteScript
  • Egress: PutFile, PutDatabaseRecord, PublishKafka, PostHTTP

Building a Data Flow

Example: CSV file → PostgreSQL

  1. In the NiFi UI, drag a Processor onto the canvas

  2. Search for and add GetFile → configure:

    • Input Directory: /data/ingest
    • File Filter: .*\.csv
    • Keep Source File: false
  3. Add a CSVReader Controller Service:

    • Settings → Controller Services → Add → CSVReader
    • Enable it
  4. Add a DatabaseConnectionPoolService Controller Service:

    • Add → DBCPConnectionPool
    • Database Connection URL: jdbc:postgresql://localhost:5432/mydb
    • Database Driver Location: /opt/nifi/current/lib/postgresql.jar
    • Database User: myuser / Password: mypass
    • Enable it
  5. Add a PutDatabaseRecord processor → configure:

    • Record Reader: CSVReader
    • Database Connection Pooling Service: DBCPConnectionPool
    • Statement Type: INSERT
    • Table Name: my_table
  6. Connect GetFile → PutDatabaseRecord (drag from success relationship)

  7. Add a LogAttribute processor on the failure relationship to capture errors

Start processors by right-clicking → Start, or use the top toolbar to start all.

Common ETL Patterns

Incremental database extraction:

QueryDatabaseTable → ConvertRecord → MergeContent → PutS3Object

Configure QueryDatabaseTable:

  • Table Name: orders
  • Maximum-value Columns: updated_at (NiFi tracks the max value automatically)
  • Columns to Return: id, customer_id, amount, updated_at

HTTP API to Kafka:

InvokeHTTP → SplitJson → UpdateAttribute → PublishKafkaRecord

Configure InvokeHTTP:

  • HTTP Method: GET
  • Remote URL: https://api.example.com/events?since=${timestamp}
  • Schedule: Run every 60 seconds

Configure SplitJson to split the response array:

  • JsonPath Expression: $.events[*]

File format conversion:

GetFile → ConvertRecord → PutFile

ConvertRecord with a JSONReader and CSVRecordSetWriter converts JSON logs to CSV automatically by inferring the schema.

Security Configuration

NiFi generates self-signed TLS by default. For production, use your own certificate:

# Generate a keystore using NiFi's TLS Toolkit
/opt/nifi/current/bin/tls-toolkit.sh standalone \
    -n your-server.example.com \
    -C "CN=admin,OU=NIFI" \
    -o /opt/nifi/certs

# Copy generated files to NiFi conf
sudo cp /opt/nifi/certs/your-server.example.com/keystore.jks /opt/nifi/current/conf/
sudo cp /opt/nifi/certs/your-server.example.com/truststore.jks /opt/nifi/current/conf/
sudo chown nifi:nifi /opt/nifi/current/conf/*.jks

# Update nifi.properties
sudo -u nifi nano /opt/nifi/current/conf/nifi.properties

Key TLS settings in nifi.properties:

nifi.security.keystore=/opt/nifi/current/conf/keystore.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=your_keystore_password
nifi.security.truststore=/opt/nifi/current/conf/truststore.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=your_truststore_password

Configure LDAP or OIDC for user authentication in login-identity-providers.xml and authorizers.xml.

Clustering NiFi

For a 3-node NiFi cluster with ZooKeeper:

# Install ZooKeeper (or use NiFi's embedded ZooKeeper)
# Edit nifi.properties on all nodes:

# Node 1: 192.168.1.10
nifi.cluster.is.node=true
nifi.cluster.node.address=192.168.1.10
nifi.cluster.node.protocol.port=11443
nifi.state.management.provider.cluster=zk-provider
nifi.zookeeper.connect.string=192.168.1.10:2181,192.168.1.11:2181,192.168.1.12:2181

# Enable embedded ZooKeeper on the node that will run it
nifi.state.management.embedded.zookeeper.start=true

Edit state-management.xml to point to the ZooKeeper cluster, then start all nodes. One node becomes the Primary Node and Cluster Coordinator automatically.

Troubleshooting

NiFi won't start:

tail -100 /opt/nifi/current/logs/nifi-app.log
# Check for "java.lang.OutOfMemoryError" or port conflicts

Increase JVM heap for large flows:

# Edit /opt/nifi/current/conf/jvm.options
-Xms2g
-Xmx4g

Processor shows "Invalid" state:

  • Check the processor configuration for missing required properties
  • Verify Controller Services are enabled (not just created)

Back pressure / queue full:

# Each connection has back pressure thresholds
# Right-click connection → Configure → Back Pressure
# Increase "Back Pressure Object Threshold" or "Data Size Threshold"

Check data provenance:

  • Right-click any processor → View data provenance
  • Shows full lineage of every FlowFile processed

Flow file stuck in queue:

# List queue contents
# Right-click connection → List queue
# Download or empty queue as needed

Conclusion

Apache NiFi provides a visual, drag-and-drop approach to building data pipelines that handles back pressure, data provenance, and reliable delivery out of the box. Its extensive processor library covers most ingestion and transformation needs without custom code, and the web UI makes it easy for data engineers to monitor flow health in real time. For production deployments, use a 3-node cluster with external ZooKeeper and configure TLS with certificate-based authentication.