Rundeck Job Automation and Incident Response

Rundeck is an open-source runbook automation platform that enables teams to create self-service operations, schedule jobs, and coordinate incident response across nodes. This guide covers installing Rundeck on Linux, defining jobs, managing nodes, configuring ACL policies, and building incident response runbooks.

Prerequisites

  • Ubuntu 20.04/22.04 or CentOS 8/Rocky Linux 8+
  • Java 11+ (OpenJDK)
  • At least 2 GB RAM (4 GB recommended)
  • SSH access to target nodes
  • sudo privileges on the Rundeck server

Install Rundeck on Linux

Ubuntu/Debian:

# Install Java
sudo apt update
sudo apt install -y openjdk-11-jdk-headless

# Add Rundeck repository
curl -s https://packagecloud.io/install/repositories/pagerduty/rundeck/script.deb.sh | sudo bash

# Install Rundeck
sudo apt install -y rundeck

# Enable and start the service
sudo systemctl enable rundeck
sudo systemctl start rundeck

# Check status
sudo systemctl status rundeck

CentOS/Rocky Linux:

# Install Java
sudo dnf install -y java-11-openjdk-headless

# Add Rundeck repository
curl -s https://packagecloud.io/install/repositories/pagerduty/rundeck/script.rpm.sh | sudo bash

# Install Rundeck
sudo dnf install -y rundeck

sudo systemctl enable --now rundeck

Rundeck listens on port 4440 by default. Open it in your firewall:

# Ubuntu (UFW)
sudo ufw allow 4440/tcp

# CentOS (firewalld)
sudo firewall-cmd --permanent --add-port=4440/tcp
sudo firewall-cmd --reload

Access the web UI at http://your-server:4440. Default credentials are admin / admin.

Initial Setup and Projects

Rundeck organizes work into projects. Create your first project:

# Via CLI (rd tool)
# Install the rd CLI
sudo apt install -y rundeck-cli

# Configure rd CLI
export RD_URL=http://localhost:4440
export RD_USER=admin
export RD_PASSWORD=admin

# Create a project
rd projects create --project myops -- \
  --project.name=myops \
  --project.description="Operations Project"

# List projects
rd projects list

Via the web UI: New Project > Enter name > Save.

Change the admin password immediately after first login:

# Edit the realm.properties file
sudo nano /etc/rundeck/realm.properties
# Change: admin:admin,user,admin,architect,deploy,build
# To: admin:NewSecurePassword,user,admin,architect,deploy,build

sudo systemctl restart rundeck

Node Management

Nodes are the targets where Rundeck executes commands. The local server is already a node.

Add remote nodes by editing the project's resources.xml or resources.yaml:

# Create a YAML node resource file
sudo mkdir -p /var/rundeck/projects/myops/etc
cat > /var/rundeck/projects/myops/etc/resources.yaml << 'EOF'
web01:
  nodename: web01
  hostname: 192.168.1.10
  username: deploy
  description: Web server 1
  tags: web,production
  ssh-keypath: /var/lib/rundeck/.ssh/id_rsa
  osFamily: unix

db01:
  nodename: db01
  hostname: 192.168.1.20
  username: deploy
  description: Database server
  tags: db,production
  ssh-keypath: /var/lib/rundeck/.ssh/id_rsa
  osFamily: unix
EOF

# Configure project to use this resource file
# In the project settings, set Resource Model Source to File
# Path: /var/rundeck/projects/myops/etc/resources.yaml

# Add Rundeck's SSH key to target nodes
sudo -u rundeck ssh-keygen -t ed25519 -f /var/lib/rundeck/.ssh/id_rsa -N ""
sudo cat /var/lib/rundeck/.ssh/id_rsa.pub
# Copy this key to all target nodes: ssh-copy-id [email protected]

Test node connectivity from the dashboard: Nodes > web01 > Run Command.

Job Definitions

Jobs define the workflow of commands to execute. Create a job via YAML export format:

# Create a job definition file
cat > /tmp/deploy-job.yaml << 'EOF'
- name: Deploy Application
  id: deploy-app
  description: Pull latest code and restart service
  project: myops
  loglevel: INFO
  nodefilters:
    filter: "tags: web"
  sequence:
    keepgoing: false
    strategy: node-first
    commands:
      - script: |
          #!/bin/bash
          set -e
          echo "Pulling latest code..."
          cd /var/www/app
          git pull origin main
          
          echo "Installing dependencies..."
          composer install --no-dev --quiet
          
          echo "Running migrations..."
          php artisan migrate --force
          
          echo "Restarting PHP-FPM..."
          sudo systemctl restart php8.1-fpm
          echo "Deploy complete on $(hostname)"
  notification:
    onfailure:
      email:
        recipients: [email protected]
        subject: "Deploy FAILED on ${node.name}"
    onsuccess:
      email:
        recipients: [email protected]
        subject: "Deploy succeeded"
  scheduleEnabled: true
  schedule:
    time:
      hour: '2'
      minute: '0'
    month: '*'
    weekday:
      day: '*'
EOF

# Import the job
rd jobs load --project myops --file /tmp/deploy-job.yaml --format yaml

# List jobs
rd jobs list --project myops

# Run a job immediately
rd run --project myops --job "Deploy Application"

# Follow job execution
rd executions follow --id <execution-id>

ACL Policies

ACL policies control who can do what in Rundeck. Create role-based policies:

# Create an ACL policy for a read-only developer role
cat > /etc/rundeck/acl/developer.aclpolicy << 'EOF'
description: Developer read-only access
context:
  project: myops
for:
  resource:
    - allow: [read]
  adhoc:
    - allow: [read]
  job:
    - allow: [read, run]
  node:
    - allow: [read, run]
by:
  group: developers

---
description: Developer system access
context:
  application: rundeck
for:
  resource:
    - equals:
        kind: project
      allow: [read]
    - equals:
        kind: system
      allow: [read]
  project:
    - match:
        name: myops
      allow: [read]
by:
  group: developers
EOF

# Create an ops engineer policy
cat > /etc/rundeck/acl/ops.aclpolicy << 'EOF'
description: Ops full access
context:
  project: myops
for:
  resource:
    - allow: [read, create, update, delete]
  adhoc:
    - allow: [read, run, kill]
  job:
    - allow: [read, create, update, delete, run, kill]
  node:
    - allow: [read, run]
by:
  group: ops

---
context:
  application: rundeck
for:
  resource:
    - allow: [read, create, update, delete]
  project:
    - allow: [read, configure, delete, import, export]
by:
  group: ops
EOF

sudo systemctl restart rundeck

Webhook Triggers

Trigger jobs via webhooks from external systems (GitHub, monitoring alerts, etc.):

# Create a webhook in the Rundeck UI:
# Project > Webhooks > Add Webhook
# Name: deploy-on-push
# Event Handler: Run Job
# Job: Deploy Application

# The webhook URL format:
# http://your-server:4440/api/45/webhook/<token>

# Test the webhook
curl -X POST "http://localhost:4440/api/45/webhook/YourWebhookToken" \
  -H "Content-Type: application/json" \
  -d '{"event": "push", "branch": "main"}'

# Use job options from webhook payload
# In job definition, add option: ${RD_WEBHOOK_PAYLOAD_BRANCH}

Incident Response Runbooks

Create structured incident response runbooks as Rundeck jobs:

# Create a database high-CPU incident runbook
cat > /tmp/db-incident-runbook.yaml << 'EOF'
- name: DB High CPU - Incident Response
  description: Automated steps for database high CPU incidents
  project: myops
  loglevel: INFO
  nodefilters:
    filter: "tags: db"
  sequence:
    keepgoing: true
    strategy: sequential
    commands:
      - description: "Step 1: Capture current process list"
        script: |
          #!/bin/bash
          echo "=== Active Queries (top 20) ==="
          mysql -u root -p${DB_PASS} -e "
            SELECT id, user, host, db, command, time, state, info
            FROM information_schema.processlist
            WHERE command != 'Sleep'
            ORDER BY time DESC LIMIT 20;
          " 2>/dev/null || echo "Could not query MySQL"
          
          echo "=== System load ==="
          uptime
          echo "=== Top CPU processes ==="
          ps aux --sort=-%cpu | head -20

      - description: "Step 2: Check slow query log"
        script: |
          #!/bin/bash
          echo "=== Recent slow queries ==="
          tail -n 50 /var/log/mysql/slow-query.log 2>/dev/null || \
            echo "Slow query log not found"

      - description: "Step 3: Kill long-running queries (>300s)"
        script: |
          #!/bin/bash
          mysql -u root -p${DB_PASS} -e "
            SELECT GROUP_CONCAT('KILL ', id SEPARATOR '; ')
            FROM information_schema.processlist
            WHERE command = 'Query' AND time > 300
          " -s -N 2>/dev/null | mysql -u root -p${DB_PASS} 2>/dev/null || true
          echo "Long-running query kill attempt complete"

      - description: "Step 4: Notify team"
        script: |
          #!/bin/bash
          curl -s -X POST "${SLACK_WEBHOOK}" \
            -H 'Content-Type: application/json' \
            -d "{\"text\": \"DB incident runbook executed on \$(hostname). Check Rundeck logs for details.\"}"
  options:
    - name: DB_PASS
      required: true
      secure: true
      valueExposed: false
    - name: SLACK_WEBHOOK
      required: true
EOF

rd jobs load --project myops --file /tmp/db-incident-runbook.yaml --format yaml

Troubleshooting

Rundeck won't start:

# Check Java version
java -version  # Must be 11+

# Check logs
sudo journalctl -u rundeck -n 50
sudo tail -n 50 /var/log/rundeck/service.log

# Verify port 4440 is not in use
ss -tlnp | grep 4440

SSH connection to nodes fails:

# Test SSH manually as rundeck user
sudo -u rundeck ssh -i /var/lib/rundeck/.ssh/id_rsa [email protected]

# Check node definition (username and key path)
# Verify the target node has Rundeck's public key in authorized_keys
grep "rundeck" /home/deploy/.ssh/authorized_keys

Job execution permission denied:

# Check ACL policy syntax
rd acl validate --file /etc/rundeck/acl/developer.aclpolicy

# Reload ACL policies
sudo systemctl restart rundeck

Out of memory errors:

# Increase Java heap size
sudo nano /etc/rundeck/profile
# Add or change: RDECK_JVM_SETTINGS="-Xmx2g -Xms512m"

sudo systemctl restart rundeck

Conclusion

Rundeck brings runbook automation and self-service operations to your infrastructure, reducing mean time to resolution for incidents by providing structured, auditable workflows. Use ACL policies to safely delegate job execution to developers, configure webhooks to trigger automated responses to monitoring alerts, and build incident runbooks to codify your team's institutional knowledge.