Vector Log Collection and Transformation

Vector is a high-performance, open-source observability data pipeline built in Rust that collects, transforms, and routes logs, metrics, and traces with minimal resource overhead. This guide covers installing Vector on Linux, configuring sources and sinks, writing VRL (Vector Remap Language) transformations, designing reliable pipelines, and integrating with common monitoring stacks.

Prerequisites

  • Ubuntu 20.04+ / Debian 11+ or CentOS 8+ / Rocky Linux 8+
  • 512 MB RAM minimum (Vector is very lightweight)
  • Root or sudo access
  • Target log sources (files, syslog, Kubernetes, etc.)
  • Destination: Elasticsearch, Datadog, Loki, S3, etc.

Installing Vector

# Ubuntu/Debian - official apt repository
curl -1sLf 'https://repositories.timber.io/public/vector/cfg/setup/bash.deb.sh' | sudo bash
sudo apt-get install -y vector

# CentOS/Rocky Linux - official rpm repository
curl -1sLf 'https://repositories.timber.io/public/vector/cfg/setup/bash.rpm.sh' | sudo bash
sudo dnf install -y vector

# Or install via script (any distro)
curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | bash

# Verify installation
vector --version

# Enable and start the service
sudo systemctl enable --now vector
sudo systemctl status vector

The default config lives at /etc/vector/vector.yaml. The service runs as the vector user.

Core Concepts: Sources, Transforms, Sinks

Vector pipelines follow a simple DAG (directed acyclic graph) model:

  • Sources: where data comes in (files, syslog, Kafka, HTTP, Kubernetes logs)
  • Transforms: process and reshape data (parse, filter, enrich, route)
  • Sinks: where data goes out (Elasticsearch, Loki, S3, Datadog, stdout)

Every component has a unique id. Transforms and sinks declare their inputs to wire components together.

Basic Log Collection Configuration

Replace /etc/vector/vector.yaml with a working configuration:

# /etc/vector/vector.yaml

# Data directory for Vector's on-disk buffers
data_dir: /var/lib/vector

# Source: tail log files
sources:
  nginx_logs:
    type: file
    include:
      - /var/log/nginx/access.log
      - /var/log/nginx/error.log
    read_from: end  # only new lines (use "beginning" for backfill)

  syslog_input:
    type: syslog
    address: "0.0.0.0:514"
    mode: udp

# Transform: parse nginx access log format
transforms:
  parse_nginx:
    type: remap
    inputs:
      - nginx_logs
    source: |
      # Parse combined log format
      . = parse_nginx_log!(.message, "combined")
      # Add a hostname field
      .host = get_hostname!()
      # Convert status code to integer
      .status = to_int!(.status)

# Sink: send to Elasticsearch
sinks:
  elasticsearch_out:
    type: elasticsearch
    inputs:
      - parse_nginx
      - syslog_input
    endpoints:
      - http://localhost:9200
    index: "logs-%Y-%m-%d"
    bulk:
      action: index

  # Also output to stdout for debugging
  console_debug:
    type: console
    inputs:
      - parse_nginx
    encoding:
      codec: json

Validate and reload the config:

# Validate configuration syntax
vector validate /etc/vector/vector.yaml

# Test config with sample data (dry run)
vector test /etc/vector/vector.yaml

# Reload after changes
sudo systemctl reload vector
# or
sudo kill -HUP $(pidof vector)

VRL Scripting for Transformations

VRL (Vector Remap Language) is a purpose-built scripting language for log transformation. It is safe (no infinite loops), fast, and returns compile-time errors for many mistakes.

transforms:
  enrich_logs:
    type: remap
    inputs:
      - raw_logs
    source: |
      # Parse JSON log message
      structured, err = parse_json(.message)
      if err == null {
        . = merge(., structured)
      }

      # Parse timestamp
      .timestamp = parse_timestamp!(.timestamp, format: "%Y-%m-%dT%H:%M:%S%.fZ")

      # Normalize log level to uppercase
      if exists(.level) {
        .level = upcase(string!(.level))
      }

      # Extract user agent details
      if exists(.user_agent) {
        ua = parse_user_agent(.user_agent) ?? {}
        .browser = ua.browser.family
        .os = ua.os.family
      }

      # Drop sensitive fields
      del(.password)
      del(.credit_card)

      # Add processing metadata
      .pipeline_version = "1.0"
      .processed_at = now()

Key VRL functions:

  • parse_json(), parse_nginx_log(), parse_syslog() - structured parsing
  • parse_timestamp(), format_timestamp() - time handling
  • get_env_var() - read environment variables
  • ip_to_country_code() - GeoIP enrichment
  • del(), merge(), set_field() - field manipulation
  • ! suffix - infallible version (aborts on error instead of returning error)

Routing and Filtering

Use the route transform to send events to different sinks based on conditions:

transforms:
  log_router:
    type: route
    inputs:
      - enrich_logs
    route:
      errors: '.level == "ERROR" || .level == "CRITICAL"'
      slow_requests: '.duration_ms > 1000'
      auth_events: 'starts_with(string!(.path), "/auth")'

sinks:
  errors_to_pagerduty:
    type: http
    inputs:
      - log_router.errors
    uri: https://events.pagerduty.com/v2/enqueue
    method: post
    encoding:
      codec: json

  slow_requests_to_loki:
    inputs:
      - log_router.slow_requests
    # ... Loki config

  # Catch all unmatched routes
  default_sink:
    inputs:
      - log_router._unmatched
    # ...

# Use filter transform to drop unwanted events entirely
transforms:
  drop_health_checks:
    type: filter
    inputs:
      - nginx_logs
    condition: '.path != "/health" && .path != "/ping"'

Metrics Collection

Vector can collect host metrics and convert logs to metrics:

sources:
  host_metrics:
    type: host_metrics
    scrape_interval_secs: 30
    collectors:
      - cpu
      - disk
      - filesystem
      - load
      - memory
      - network

transforms:
  # Extract request duration as a metric from logs
  log_to_metric:
    type: log_to_metric
    inputs:
      - parse_nginx
    metrics:
      - type: histogram
        field: duration_ms
        name: http_request_duration_ms
        namespace: nginx
        tags:
          status: "{{status}}"
          method: "{{method}}"

sinks:
  prometheus_exporter:
    type: prometheus_exporter
    inputs:
      - host_metrics
      - log_to_metric
    address: "0.0.0.0:9598"  # Prometheus scrapes this endpoint

Integration with Monitoring Stacks

Send to Grafana Loki:

sinks:
  loki:
    type: loki
    inputs:
      - enrich_logs
    endpoint: http://loki:3100
    encoding:
      codec: json
    labels:
      app: "{{app}}"
      env: production
      host: "{{host}}"

Send to Datadog:

sinks:
  datadog_logs:
    type: datadog_logs
    inputs:
      - enrich_logs
    default_api_key: "${DATADOG_API_KEY}"
    site: datadoghq.com

  datadog_metrics:
    type: datadog_metrics
    inputs:
      - host_metrics
    default_api_key: "${DATADOG_API_KEY}"

Send to S3 for long-term storage:

sinks:
  s3_archive:
    type: aws_s3
    inputs:
      - enrich_logs
    region: us-east-1
    bucket: my-log-archive
    key_prefix: "logs/year=%Y/month=%m/day=%d/"
    compression: gzip
    encoding:
      codec: ndjson
    batch:
      max_bytes: 10485760  # 10 MB files
      timeout_secs: 300

Troubleshooting

Check Vector's internal metrics:

# Vector exposes its own metrics on port 9598 by default
curl http://localhost:9598/metrics | grep vector_

# Or use the Vector top command
vector top

Enable debug logging:

# Run Vector in foreground with debug output
VECTOR_LOG=debug vector --config /etc/vector/vector.yaml

# Or set log level in config
api:
  enabled: true
  address: "127.0.0.1:8686"

Test VRL scripts interactively:

# Install vrl CLI
cargo install vrl-cli  # or download from GitHub releases

# Test a VRL script against sample data
echo '{"message": "127.0.0.1 - - [01/Jan/2024:12:00:00 +0000] \"GET / HTTP/1.1\" 200 1234"}' \
  | vrl --program '. = parse_nginx_log!(.message, "combined")'

Events not arriving at sink:

# Check for dropped events
vector top  # look for component_errors_total
journalctl -u vector -f

File source not picking up new logs:

# Vector uses checkpoints to track file positions
# Reset checkpoints if needed
sudo rm /var/lib/vector/checkpoints.json
sudo systemctl restart vector

Conclusion

Vector is an exceptionally efficient log and metrics pipeline that can replace multiple agents (Filebeat, Fluentd, Telegraf) with a single binary consuming minimal CPU and memory. VRL provides safe, fast in-line transformations, and the routing transform makes it straightforward to send different event types to different backends. With native support for dozens of sources and sinks, Vector integrates cleanly with any modern observability stack.