ClickHouse Installation for Analytics
ClickHouse is a high-performance, open-source OLAP (Online Analytical Processing) database designed for real-time analytics on massive datasets. It provides exceptional compression, fast aggregation queries, and distributed processing capabilities while using a fraction of storage compared to traditional data warehouses. This guide covers installation, table engine selection, data loading, query optimization, and best practices for building analytics infrastructure with ClickHouse.
Table of Contents
- Architecture and Engine Types
- Installation
- Configuration and Storage
- Table Design and Engines
- Data Loading and Ingestion
- Materialized Views
- SQL Queries and Performance
- Distributed Queries
- Monitoring and Maintenance
- Best Practices
- Conclusion
Architecture and Engine Types
ClickHouse uses column-oriented storage, storing data by column rather than by row. This architecture enables exceptional compression ratios, fast aggregation queries, and efficient use of CPU cache. Data is organized into parts for efficient storage and querying, with automatic merging of parts in the background.
ClickHouse supports multiple table engines optimized for different use cases. MergeTree family engines provide the best performance for analytics. ReplacingMergeTree handles updates. SummingMergeTree automatically aggregates numeric columns. AggregatingMergeTree pre-computes aggregates. Distributed tables enable querying across multiple nodes transparently.
Installation
Install ClickHouse server on Ubuntu/Debian systems using the official repository:
# Add ClickHouse repository
curl https://repo.clickhouse.com/clickhouse-key.gpg | apt-key add -
echo "deb https://repo.clickhouse.com/deb/stable main" | tee /etc/apt/sources.list.d/clickhouse.list
# Update package list
apt-get update
# Install ClickHouse server and client
apt-get install -y clickhouse-server clickhouse-client clickhouse-common-static
# Verify installation
clickhouse-server --version
clickhouse-client --version
On CentOS/RHEL systems:
# Add repository
yum install -y yum-utils
yum-config-manager --add-repo https://repo.clickhouse.com/centos/clickhouse.repo
# Install packages
dnf install -y clickhouse-server clickhouse-client
# Verify installation
clickhouse-server --version
Enable and start ClickHouse service:
# Enable automatic startup
systemctl enable clickhouse-server
# Start the service
systemctl start clickhouse-server
# Verify service is running
systemctl status clickhouse-server
# Test connection
echo "SELECT 1" | clickhouse-client
Create dedicated directories for data and logs with proper permissions:
# Create directories
mkdir -p /var/lib/clickhouse /var/log/clickhouse-server
# Set ownership
chown -R clickhouse:clickhouse /var/lib/clickhouse /var/log/clickhouse-server
# Set permissions
chmod 755 /var/lib/clickhouse /var/log/clickhouse-server
Configuration and Storage
ClickHouse configuration is stored in XML files. The main configuration file is /etc/clickhouse-server/config.xml. Create a custom configuration file for your settings:
nano /etc/clickhouse-server/config.d/custom.xml
Add this configuration for a single-node setup:
<yandex>
<!-- Listen configuration -->
<listen_host>0.0.0.0</listen_host>
<!-- HTTP port for web UI and API -->
<http_port>8123</http_port>
<!-- Native TCP port for client connections -->
<tcp_port>9000</tcp_port>
<!-- Memory settings -->
<max_memory_usage>20000000000</max_memory_usage>
<max_memory_usage_for_user>10000000000</max_memory_usage_for_user>
<!-- Storage paths -->
<path>/var/lib/clickhouse/</path>
<tmp_path>/var/lib/clickhouse/tmp/</tmp_path>
<!-- Data directories for distribution -->
<data_path>/var/lib/clickhouse/data/</data_path>
<log_path>/var/log/clickhouse-server/</log_path>
<!-- Maximum threads for processing -->
<max_threads>16</max_threads>
<!-- Network settings -->
<max_simultaneous_queries>100</max_simultaneous_queries>
<max_connections>100</max_connections>
<!-- Query logging -->
<query_log>
<database>system</database>
<table>query_log</table>
<buffer_size>1000</buffer_size>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>
<!-- Part log for tracking data merges -->
<part_log>
<database>system</database>
<table>part_log</table>
</part_log>
<!-- User configuration -->
<users>
<default>
<password>default_password</password>
<networks>
<ip>::/0</ip>
</networks>
</default>
<analytics>
<password>analytics_password</password>
<quota>default</quota>
<networks>
<ip>::/0</ip>
</networks>
</analytics>
</users>
<!-- Quota for limiting resource usage -->
<quotas>
<default>
<interval>
<duration>3600</duration>
<queries>1000</queries>
<errors>100</errors>
<result_rows>1000000000</result_rows>
<read_rows>10000000000</read_rows>
</interval>
</default>
</quotas>
</yandex>
Restart ClickHouse to apply configuration:
systemctl restart clickhouse-server
# Monitor logs for errors
tail -f /var/log/clickhouse-server/clickhouse-server.log
Verify configuration:
# Connect to ClickHouse
clickhouse-client
-- Check server configuration
SELECT name, value FROM system.settings LIMIT 10;
-- Check loaded configuration
SELECT * FROM system.config LIMIT 20;
Table Design and Engines
Choose the appropriate table engine for your use case. MergeTree is the recommended engine for analytics:
-- Create analytics database
CREATE DATABASE IF NOT EXISTS analytics;
-- MergeTree table for events
CREATE TABLE analytics.events (
event_id UInt64,
event_time DateTime,
user_id UInt64,
event_type String,
event_value Float32,
event_properties Map(String, String)
) ENGINE = MergeTree()
ORDER BY (event_type, event_time)
PARTITION BY toYYYYMM(event_time);
-- Create table with ReplacingMergeTree for updates
CREATE TABLE analytics.user_profiles (
user_id UInt64,
name String,
email String,
signup_date Date,
last_updated DateTime
) ENGINE = ReplacingMergeTree(last_updated)
ORDER BY user_id;
-- SummingMergeTree for pre-aggregated metrics
CREATE TABLE analytics.event_metrics (
event_date Date,
event_type String,
event_count UInt64,
total_value Float64
) ENGINE = SummingMergeTree((event_count, total_value))
ORDER BY (event_date, event_type)
PARTITION BY toYYYYMM(event_date);
-- AggregatingMergeTree for complex aggregates
CREATE TABLE analytics.metric_aggregates (
event_date Date,
metric_name String,
sum_value AggregateFunction(sum, Float64),
avg_value AggregateFunction(avg, Float64),
max_value AggregateFunction(max, Float64)
) ENGINE = AggregatingMergeTree()
ORDER BY (metric_name, event_date)
PARTITION BY toYYYYMM(event_date);
Create a view for partitioned data:
-- Create a view combining multiple partition tables
CREATE VIEW analytics.events_current_month AS
SELECT * FROM analytics.events
WHERE toYYYYMM(event_time) = toYYYYMM(now());
-- Create a table with deduplication
CREATE TABLE analytics.events_deduplicated (
event_id UInt64,
event_time DateTime,
user_id UInt64,
event_type String,
event_value Float32
) ENGINE = ReplacingMergeTree()
ORDER BY (event_id, event_time)
PARTITION BY toYYYYMM(event_time);
Data Loading and Ingestion
Load data into ClickHouse using various methods. Load from CSV file:
# Create CSV file
cat > /tmp/events.csv << 'EOF'
1000,2024-01-15 10:30:00,100,page_view,0.5
1001,2024-01-15 10:31:00,101,click,1.0
1002,2024-01-15 10:32:00,102,form_submit,2.5
EOF
# Load into ClickHouse
clickhouse-client --query="INSERT INTO analytics.events FORMAT CSV" < /tmp/events.csv
# Verify data insertion
clickhouse-client --query="SELECT COUNT() FROM analytics.events"
Insert data programmatically using INSERT queries:
-- Insert single row
INSERT INTO analytics.events VALUES (1003, now(), 103, 'purchase', 100.0, {'product_id': '123', 'category': 'electronics'});
-- Insert multiple rows
INSERT INTO analytics.events VALUES
(1004, now(), 104, 'page_view', 0.5, {}),
(1005, now(), 105, 'click', 1.0, {}),
(1006, now(), 106, 'form_submit', 2.5, {});
-- Load from another table
INSERT INTO analytics.event_metrics
SELECT
toDate(event_time) AS event_date,
event_type,
COUNT() AS event_count,
SUM(event_value) AS total_value
FROM analytics.events
GROUP BY event_date, event_type;
Stream data continuously using HTTP:
# Send data via HTTP POST
curl -X POST 'http://localhost:8123/?query=INSERT%20INTO%20analytics.events%20FORMAT%20JSONEachRow' \
-d '[
{"event_id":1007,"event_time":"2024-01-15 10:33:00","user_id":107,"event_type":"purchase","event_value":150.0},
{"event_id":1008,"event_time":"2024-01-15 10:34:00","user_id":108,"event_type":"click","event_value":1.5}
]'
# Verify insertion
curl -X GET 'http://localhost:8123/?query=SELECT%20COUNT()%20FROM%20analytics.events'
Load from JSON files:
# Create JSON file
cat > /tmp/events.json << 'EOF'
{"event_id":2000,"event_time":"2024-01-15 10:35:00","user_id":200,"event_type":"page_view","event_value":0.5}
{"event_id":2001,"event_time":"2024-01-15 10:36:00","user_id":201,"event_type":"click","event_value":1.0}
EOF
# Load JSON data
clickhouse-client --query="INSERT INTO analytics.events FORMAT JSONEachRow" < /tmp/events.json
Materialized Views
Materialized views automatically populate a target table based on data inserted into a source table:
-- Create target table for aggregated metrics
CREATE TABLE analytics.hourly_metrics (
event_date Date,
event_hour DateTime,
event_type String,
event_count UInt64,
total_value Float64
) ENGINE = SummingMergeTree((event_count, total_value))
ORDER BY (event_date, event_hour, event_type);
-- Create materialized view that populates hourly_metrics
CREATE MATERIALIZED VIEW analytics.events_to_hourly TO analytics.hourly_metrics AS
SELECT
toDate(event_time) AS event_date,
toStartOfHour(event_time) AS event_hour,
event_type,
COUNT() AS event_count,
SUM(event_value) AS total_value
FROM analytics.events
GROUP BY event_date, event_hour, event_type;
-- Data inserted into events will automatically update hourly_metrics
INSERT INTO analytics.events VALUES (3000, now(), 300, 'purchase', 500.0, {});
-- Query the materialized view results
SELECT * FROM analytics.hourly_metrics ORDER BY event_hour DESC LIMIT 10;
Create multiple materialized views for different aggregation levels:
-- Daily aggregation
CREATE TABLE analytics.daily_metrics (
event_date Date,
event_type String,
event_count UInt64,
avg_value Float64
) ENGINE = SummingMergeTree()
ORDER BY (event_date, event_type);
CREATE MATERIALIZED VIEW analytics.events_to_daily TO analytics.daily_metrics AS
SELECT
toDate(event_time) AS event_date,
event_type,
COUNT() AS event_count,
AVG(event_value) AS avg_value
FROM analytics.events
GROUP BY event_date, event_type;
-- User activity summary
CREATE TABLE analytics.user_activity_summary (
user_id UInt64,
activity_date Date,
event_count UInt64,
total_spent Float64
) ENGINE = SummingMergeTree()
ORDER BY (activity_date, user_id);
CREATE MATERIALIZED VIEW analytics.events_to_user_activity TO analytics.user_activity_summary AS
SELECT
user_id,
toDate(event_time) AS activity_date,
COUNT() AS event_count,
SUM(event_value) AS total_spent
FROM analytics.events
GROUP BY user_id, activity_date;
SQL Queries and Performance
Write efficient ClickHouse queries leveraging its strengths:
-- Time-series aggregation
SELECT
toStartOfHour(event_time) AS hour,
event_type,
COUNT() AS count,
SUM(event_value) AS total,
AVG(event_value) AS average
FROM analytics.events
WHERE event_time >= now() - INTERVAL 7 DAY
GROUP BY hour, event_type
ORDER BY hour DESC, count DESC;
-- Top-N query with pruning
SELECT
user_id,
SUM(event_value) AS total_spent
FROM analytics.events
WHERE event_time >= today() - INTERVAL 30 DAY
GROUP BY user_id
ORDER BY total_spent DESC
LIMIT 100;
-- Histogram analysis
SELECT
event_type,
quantiles(0, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0)(event_value) AS percentiles,
COUNT() AS count
FROM analytics.events
WHERE event_time >= now() - INTERVAL 1 DAY
GROUP BY event_type;
-- Join with dimension table
SELECT
e.event_type,
e.user_id,
COUNT(*) AS event_count,
SUM(e.event_value) AS total_value
FROM analytics.events e
INNER JOIN analytics.user_profiles u ON e.user_id = u.user_id
WHERE e.event_time >= now() - INTERVAL 7 DAY
GROUP BY e.event_type, e.user_id
HAVING event_count > 10
ORDER BY total_value DESC
LIMIT 50;
-- String matching with multiple conditions
SELECT
event_type,
COUNT() AS count,
COUNT(DISTINCT user_id) AS unique_users
FROM analytics.events
WHERE event_time >= today()
AND event_type LIKE '%purchase%'
AND event_value > 10
GROUP BY event_type;
Optimize query performance:
-- Check query plan
EXPLAIN SELECT * FROM analytics.events WHERE event_type = 'purchase' LIMIT 100;
-- Use PREWHERE for early filtering (more efficient than WHERE)
SELECT
user_id,
COUNT() as event_count
FROM analytics.events
PREWHERE event_value > 100
WHERE event_type = 'purchase'
GROUP BY user_id;
-- Use FINAL only when necessary (slower, affects query performance)
SELECT * FROM analytics.events_deduplicated
FINAL
WHERE user_id = 123;
Distributed Queries
Set up distributed tables for querying across multiple nodes:
-- On each node, create local tables (shown above)
-- On a query node, create distributed table
CREATE TABLE analytics.events_distributed AS analytics.events
ENGINE = Distributed(cluster_name, analytics, events, user_id);
-- Query distributed table
SELECT
event_type,
COUNT() as count
FROM analytics.events_distributed
WHERE event_time >= now() - INTERVAL 7 DAY
GROUP BY event_type;
Monitoring and Maintenance
Monitor ClickHouse performance and health:
-- Check running queries
SELECT
query_id,
query,
user,
initial_time,
elapsed
FROM system.processes
WHERE is_initial_query = 1
ORDER BY elapsed DESC;
-- View query history
SELECT
query_id,
query_duration_ms,
read_rows,
written_rows,
result_rows
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY event_time DESC
LIMIT 20;
-- Check table sizes
SELECT
database,
table,
sum(bytes) as size
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY size DESC;
-- Monitor memory usage
SELECT
user,
sum(memory_usage) as total_memory
FROM system.processes
GROUP BY user;
-- Check replication status
SELECT
database,
table,
is_leader,
absolute_delay
FROM system.replicas;
Perform maintenance operations:
# Optimize table (merge parts)
clickhouse-client --query="OPTIMIZE TABLE analytics.events FINAL"
# Truncate table safely
clickhouse-client --query="TRUNCATE TABLE analytics.events"
# Backup database
clickhouse-backup create --tables="analytics.*"
# Check database integrity
clickhouse-client --query="CHECK TABLE analytics.events"
Best Practices
Follow these practices for optimal ClickHouse performance:
- Choose appropriate table engines (MergeTree for analytics is standard)
- Use primary keys that match your query patterns
- Partition by date or time for easier data management
- Use materialized views for common aggregations
- Compress data with appropriate codecs
- Monitor table part counts and optimize when needed
- Use sampling for large exploratory queries
- Implement retention policies for old data
- Set appropriate quotas to prevent resource exhaustion
- Test queries on sample data before running on full datasets
-- Example: Implement compression and TTL
CREATE TABLE analytics.events_compressed (
event_id UInt64,
event_time DateTime,
user_id UInt64,
event_type String CODEC(ZSTD(3)),
event_value Float32 CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY event_time
PARTITION BY toYYYYMM(event_time)
TTL event_time + INTERVAL 90 DAY DELETE;
Conclusion
ClickHouse provides an exceptional platform for real-time analytics on massive datasets. Its column-oriented architecture, powerful compression, and SQL capabilities enable fast queries and cost-effective storage. By understanding table engines, implementing effective partitioning strategies, and leveraging materialized views for pre-computed aggregates, you can build scalable analytics infrastructure. Whether analyzing event streams, metrics, or logs, ClickHouse delivers the performance needed for modern data analytics workloads.


