Vector Log Collection and Transformation
Vector is a high-performance, open-source observability data pipeline built in Rust that collects, transforms, and routes logs, metrics, and traces with minimal resource overhead. This guide covers installing Vector on Linux, configuring sources and sinks, writing VRL (Vector Remap Language) transformations, designing reliable pipelines, and integrating with common monitoring stacks.
Prerequisites
- Ubuntu 20.04+ / Debian 11+ or CentOS 8+ / Rocky Linux 8+
- 512 MB RAM minimum (Vector is very lightweight)
- Root or sudo access
- Target log sources (files, syslog, Kubernetes, etc.)
- Destination: Elasticsearch, Datadog, Loki, S3, etc.
Installing Vector
# Ubuntu/Debian - official apt repository
curl -1sLf 'https://repositories.timber.io/public/vector/cfg/setup/bash.deb.sh' | sudo bash
sudo apt-get install -y vector
# CentOS/Rocky Linux - official rpm repository
curl -1sLf 'https://repositories.timber.io/public/vector/cfg/setup/bash.rpm.sh' | sudo bash
sudo dnf install -y vector
# Or install via script (any distro)
curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | bash
# Verify installation
vector --version
# Enable and start the service
sudo systemctl enable --now vector
sudo systemctl status vector
The default config lives at /etc/vector/vector.yaml. The service runs as the vector user.
Core Concepts: Sources, Transforms, Sinks
Vector pipelines follow a simple DAG (directed acyclic graph) model:
- Sources: where data comes in (files, syslog, Kafka, HTTP, Kubernetes logs)
- Transforms: process and reshape data (parse, filter, enrich, route)
- Sinks: where data goes out (Elasticsearch, Loki, S3, Datadog, stdout)
Every component has a unique id. Transforms and sinks declare their inputs to wire components together.
Basic Log Collection Configuration
Replace /etc/vector/vector.yaml with a working configuration:
# /etc/vector/vector.yaml
# Data directory for Vector's on-disk buffers
data_dir: /var/lib/vector
# Source: tail log files
sources:
nginx_logs:
type: file
include:
- /var/log/nginx/access.log
- /var/log/nginx/error.log
read_from: end # only new lines (use "beginning" for backfill)
syslog_input:
type: syslog
address: "0.0.0.0:514"
mode: udp
# Transform: parse nginx access log format
transforms:
parse_nginx:
type: remap
inputs:
- nginx_logs
source: |
# Parse combined log format
. = parse_nginx_log!(.message, "combined")
# Add a hostname field
.host = get_hostname!()
# Convert status code to integer
.status = to_int!(.status)
# Sink: send to Elasticsearch
sinks:
elasticsearch_out:
type: elasticsearch
inputs:
- parse_nginx
- syslog_input
endpoints:
- http://localhost:9200
index: "logs-%Y-%m-%d"
bulk:
action: index
# Also output to stdout for debugging
console_debug:
type: console
inputs:
- parse_nginx
encoding:
codec: json
Validate and reload the config:
# Validate configuration syntax
vector validate /etc/vector/vector.yaml
# Test config with sample data (dry run)
vector test /etc/vector/vector.yaml
# Reload after changes
sudo systemctl reload vector
# or
sudo kill -HUP $(pidof vector)
VRL Scripting for Transformations
VRL (Vector Remap Language) is a purpose-built scripting language for log transformation. It is safe (no infinite loops), fast, and returns compile-time errors for many mistakes.
transforms:
enrich_logs:
type: remap
inputs:
- raw_logs
source: |
# Parse JSON log message
structured, err = parse_json(.message)
if err == null {
. = merge(., structured)
}
# Parse timestamp
.timestamp = parse_timestamp!(.timestamp, format: "%Y-%m-%dT%H:%M:%S%.fZ")
# Normalize log level to uppercase
if exists(.level) {
.level = upcase(string!(.level))
}
# Extract user agent details
if exists(.user_agent) {
ua = parse_user_agent(.user_agent) ?? {}
.browser = ua.browser.family
.os = ua.os.family
}
# Drop sensitive fields
del(.password)
del(.credit_card)
# Add processing metadata
.pipeline_version = "1.0"
.processed_at = now()
Key VRL functions:
parse_json(),parse_nginx_log(),parse_syslog()- structured parsingparse_timestamp(),format_timestamp()- time handlingget_env_var()- read environment variablesip_to_country_code()- GeoIP enrichmentdel(),merge(),set_field()- field manipulation!suffix - infallible version (aborts on error instead of returning error)
Routing and Filtering
Use the route transform to send events to different sinks based on conditions:
transforms:
log_router:
type: route
inputs:
- enrich_logs
route:
errors: '.level == "ERROR" || .level == "CRITICAL"'
slow_requests: '.duration_ms > 1000'
auth_events: 'starts_with(string!(.path), "/auth")'
sinks:
errors_to_pagerduty:
type: http
inputs:
- log_router.errors
uri: https://events.pagerduty.com/v2/enqueue
method: post
encoding:
codec: json
slow_requests_to_loki:
inputs:
- log_router.slow_requests
# ... Loki config
# Catch all unmatched routes
default_sink:
inputs:
- log_router._unmatched
# ...
# Use filter transform to drop unwanted events entirely
transforms:
drop_health_checks:
type: filter
inputs:
- nginx_logs
condition: '.path != "/health" && .path != "/ping"'
Metrics Collection
Vector can collect host metrics and convert logs to metrics:
sources:
host_metrics:
type: host_metrics
scrape_interval_secs: 30
collectors:
- cpu
- disk
- filesystem
- load
- memory
- network
transforms:
# Extract request duration as a metric from logs
log_to_metric:
type: log_to_metric
inputs:
- parse_nginx
metrics:
- type: histogram
field: duration_ms
name: http_request_duration_ms
namespace: nginx
tags:
status: "{{status}}"
method: "{{method}}"
sinks:
prometheus_exporter:
type: prometheus_exporter
inputs:
- host_metrics
- log_to_metric
address: "0.0.0.0:9598" # Prometheus scrapes this endpoint
Integration with Monitoring Stacks
Send to Grafana Loki:
sinks:
loki:
type: loki
inputs:
- enrich_logs
endpoint: http://loki:3100
encoding:
codec: json
labels:
app: "{{app}}"
env: production
host: "{{host}}"
Send to Datadog:
sinks:
datadog_logs:
type: datadog_logs
inputs:
- enrich_logs
default_api_key: "${DATADOG_API_KEY}"
site: datadoghq.com
datadog_metrics:
type: datadog_metrics
inputs:
- host_metrics
default_api_key: "${DATADOG_API_KEY}"
Send to S3 for long-term storage:
sinks:
s3_archive:
type: aws_s3
inputs:
- enrich_logs
region: us-east-1
bucket: my-log-archive
key_prefix: "logs/year=%Y/month=%m/day=%d/"
compression: gzip
encoding:
codec: ndjson
batch:
max_bytes: 10485760 # 10 MB files
timeout_secs: 300
Troubleshooting
Check Vector's internal metrics:
# Vector exposes its own metrics on port 9598 by default
curl http://localhost:9598/metrics | grep vector_
# Or use the Vector top command
vector top
Enable debug logging:
# Run Vector in foreground with debug output
VECTOR_LOG=debug vector --config /etc/vector/vector.yaml
# Or set log level in config
api:
enabled: true
address: "127.0.0.1:8686"
Test VRL scripts interactively:
# Install vrl CLI
cargo install vrl-cli # or download from GitHub releases
# Test a VRL script against sample data
echo '{"message": "127.0.0.1 - - [01/Jan/2024:12:00:00 +0000] \"GET / HTTP/1.1\" 200 1234"}' \
| vrl --program '. = parse_nginx_log!(.message, "combined")'
Events not arriving at sink:
# Check for dropped events
vector top # look for component_errors_total
journalctl -u vector -f
File source not picking up new logs:
# Vector uses checkpoints to track file positions
# Reset checkpoints if needed
sudo rm /var/lib/vector/checkpoints.json
sudo systemctl restart vector
Conclusion
Vector is an exceptionally efficient log and metrics pipeline that can replace multiple agents (Filebeat, Fluentd, Telegraf) with a single binary consuming minimal CPU and memory. VRL provides safe, fast in-line transformations, and the routing transform makes it straightforward to send different event types to different backends. With native support for dozens of sources and sinks, Vector integrates cleanly with any modern observability stack.


