ClickHouse Data Warehouse 101 - The First Billion Rows
ClickHouse Data Warehouse 101 - The First Billion Rows
ClickHouse Data Warehouse 101 - The First Billion Rows
Understands SQL
a b c d a b c d
Runs on bare metal to cloud
Is WAY fast!
Tables are split into indexed, sorted parts for fast queries
Index Columns
Part Indexed
Table
Sorted
Part
If one server is not enough -- ClickHouse can scale out easily
ClickHouse
tripdata_dist tripdata
(Distributed) (MergeTable)
SELECT ...
FROM ClickHouse
tripdata_dist
tripdata_dist tripdata
Result Set
ClickHouse
tripdata_dist tripdata
Getting Started:
Data Loading
Installation: Use packages on Linux host
$ sudo apt -y install clickhouse-client=19.6.2 \
clickhouse-server=19.6.2 \
clickhouse-common-static=19.6.2
...
$ sudo systemctl start clickhouse-server
...
$ clickhouse-client
11e99303c78e :) select version()
...
┌─version()─┐
│ 19.6.2.11 │
└───────────┘
Decision tree for ClickHouse basic schema design
No No
Use array
Use scalar Select
columns to
columns with partition key
store key
String type and sort order
value pairs
Tabular data structure typically gives the best results
See https://github.com/Altinity/altinity-datasets
How long does it take to load 1.3B rows?
$ time ad-cli dataset load nyc_taxi_rides --repo_path=/data1/sample-data
Creating database if it does not exist: nyc_timed
Executing DDL: /data1/sample-data/nyc_taxi_rides/ddl/taxi_zones.sql
. . .
Loading data: table=tripdata, file=data-200901.csv.gz
. . .
Operation summary: succeeded=193, failed=0
real 11m4.827s
user 63m32.854s
sys 2m41.235s
(Amazon md5.2xlarge: Xeon(R) Platinum 8175M, 8vCPU, 30GB RAM, NVMe SSD)
Do we really have 1B+ table?
:) select count() from tripdata;
SELECT count()
FROM tripdata
┌────count()─┐
│ 1310903963 │
└────────────┘
1 rows in set. Elapsed: 0.324 sec. Processed 1.31 billion rows, 1.31 GB (4.05
billion rows/s., 4.05 GB/s.)
1 rows in set. Elapsed: 3.420 sec. Processed 1.31 billion rows, 10.49 GB (383.29
million rows/s., 3.07 GB/s.)
Now we try with the real data
SELECT avg(passenger_count)
FROM tripdata
┌─avg(passenger_count)─┐
│ 1.6817462943317076 │
└──────────────────────┘
┌─avg(passenger_count)─┐
│ 1.6817462943317076 │
└──────────────────────┘
1 rows in set. Elapsed: 1.084 sec. Processed 1.31 billion rows, 1.31 GB (1.21
billion rows/s., 1.21 GB/s.)
Even faster!!!!
Data type and cardinality matters
What if we add a filter
SELECT avg(passenger_count)
FROM tripdata
WHERE toYear(pickup_date) = 2016
┌─avg(passenger_count)─┐
│ 1.6571129913837774 │
└──────────────────────┘
1 rows in set. Elapsed: 0.162 sec. Processed 131.17 million rows, 393.50 MB (811.05
million rows/s., 2.43 GB/s.)
What if we add a group by
SELECT
pickup_location_id AS location_id,
avg(passenger_count),
count()
FROM tripdata
WHERE toYear(pickup_date) = 2016
GROUP BY location_id LIMIT 10
...
10 rows in set. Elapsed: 0.251 sec. Processed 131.17 million rows, 655.83 MB
(522.62 million rows/s., 2.61 GB/s.)
What if we add a join
SELECT
zone,
avg(passenger_count),
count()
FROM tripdata
INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id
WHERE toYear(pickup_date) = 2016
GROUP BY zone
LIMIT 10
10 rows in set. Elapsed: 0.803 sec. Processed 131.17 million rows, 655.83 MB (163.29
million rows/s., 816.44 MB/s.)
Yes, ClickHouse is FAST!
https://tech.marksblogg.com/benchmarks.html
Optimization
Techniques
How to make ClickHouse
even faster
You can optimize
Server settings
Schema
Column storage
Queries
You can optimize
SELECT avg(passenger_count) Default is a half of
FROM tripdata
SETTINGS max_threads = 1
available cores --
good enough
...
1 rows in set. Elapsed: 4.855 sec. Processed 1.31 billion rows, 1.31 GB
(270.04 million rows/s., 270.04 MB/s.)
SELECT avg(passenger_count)
FROM tripdata
SETTINGS max_threads = 8
...
1 rows in set. Elapsed: 1.092 sec. Processed 1.31 billion rows, 1.31 GB (1.20
billion rows/s., 1.20 GB/s.)
Schema optimizations
Data types
Index
Dictionaries
Arrays
https://www.percona.com/blog/2019/02/15/clickhouse-performance-uint32-vs-uint64-vs-float32-vs-float64/
MaterializedView with SummingMergeTree
CREATE MATERIALIZED VIEW tripdata_mv
ENGINE = SummingMergeTree MaterializedView
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_location_id, dropoff_location_id, vendor_id) AS works as an INSERT
SELECT trigger
pickup_date,
vendor_id,
pickup_location_id,
dropoff_location_id,
sum(passenger_count) AS passenger_count_sum,
sum(trip_distance) AS trip_distance_sum,
sum(fare_amount) AS fare_amount_sum,
sum(tip_amount) AS tip_amount_sum,
sum(tolls_amount) AS tolls_amount_sum,
sum(total_amount) AS total_amount_sum, SummingMergeTree
count() AS trips_count automatically
FROM tripdata
GROUP BY aggregates data in
pickup_date, the background
vendor_id,
pickup_location_id,
dropoff_location_id
MaterializedView with SummingMergeTree
INSERT INTO tripdata_mv SELECT
pickup_date,
vendor_id, Note, no group by!
pickup_location_id,
dropoff_location_id,
passenger_count, SummingMergeTree
trip_distance,
fare_amount, automatically
tip_amount, aggregates data in
tolls_amount,
total_amount, the background
1
FROM tripdata;
Ok.
┌──count()─┐
│ 20742525 │
└──────────┘
1 rows in set. Elapsed: 0.015 sec. Processed 20.74 million rows, 41.49 MB (1.39 billion
rows/s., 2.78 GB/s.)
SELECT
zone,
sum(passenger_count_sum)/sum(trips_count),
sum(trips_count)
FROM tripdata_mv
INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id
WHERE toYear(pickup_date) = 2016
GROUP BY zone
LIMIT 10
10 rows in set. Elapsed: 0.036 sec. Processed 3.23 million rows, 64.57 MB (89.14 million
rows/s., 1.78 GB/s.)
Realtime Aggreation with Materialized Views
Summing
MergeTree
Summing
INSERTS Raw Data MergeTree
Summing
MergeTree
Column storage optimizations
Compression
LowCardinality
Column encodings
LowCardinality example. Another 1B rows.
:) create table test_lc ( LowCardinality
a String, a_lc LowCardinality(String) DEFAULT a) Engine = MergeTree encodes column
PARTITION BY tuple() ORDER BY tuple(); with a dictionary
encoding
:) INSERT INTO test_lc (a) SELECT
concat('openconfig-interfaces:interfaces/interface/subinterfaces/subinter
face/state/index', toString(rand() % 1000))
└─────────┴──────┴────────────────────────┴────────────┴──────────────┘
LowCardinality example. Another 1B rows
:) select a a, count(*) from test_lc group by a order by count(*) desc limit 10;
┌─a────────────────────────────────────────────────────────────────────────────────────┬─count()─┐
│ openconfig-interfaces:interfaces/interface/subinterfaces/subinterface/state/index396 │ 1002761 │
...
│ openconfig-interfaces:interfaces/interface/subinterfaces/subinterface/state/index5 │ 1002203 │
└──────────────────────────────────────────────────────────────────────────────────────┴─────────┘
10 rows in set. Elapsed: 11.627 sec. Processed 1.00 billion rows, 92.89 GB (86.00 million
rows/s., 7.99 GB/s.)
Faster
:) select a_lc a, count(*) from test_lc group by a order by count(*) desc limit 10;
...
10 rows in set. Elapsed: 1.569 sec. Processed 1.00 billion rows, 3.42 GB (637.50 million
rows/s., 2.18 GB/s.)
Array example. Another 1B rows
create table test_array ( Arrays efficiently model 1-to-N
s String, relationship
a Array(LowCardinality(String)) default arrayDistinct(splitByChar(',', s))
Storage:
┌─table──────┬─name─┬─type──────────────────────────┬────────comp─┬──────uncomp─┐
│ test_array │ s │ String │ 11239860686 │ 31200058000 │
│ test_array │ a │ Array(LowCardinality(String)) │ 4275679420 │ 11440948123 │
└────────────┴──────┴───────────────────────────────┴─────────────┴─────────────┘
Array example. Another 1B rows
:) select count() from test_array where s like '%ClickHouse%';
┌───count()─┐
│ 343877409 │
└───────────┘
1 rows in set. Elapsed: 7.363 sec. Processed 1.00 billion rows, 39.20 GB (135.81 million
rows/s., 5.32 GB/s.)
┌───count()─┐
│ 343877409 │
└───────────┘
1 rows in set. Elapsed: 8.428 sec. Processed 1.00 billion rows, 11.44 GB (118.66 million
rows/s., 1.36 GB/s.)
10 rows in set. Elapsed: 0.248 sec. Processed 131.17 million rows, 655.83
MB (529.19 million rows/s., 2.65 GB/s.)
ClickHouse
Integrations
...And a nice set of supporting ecosystem tools
https://github.com/Altinity/clickhouse-operator
Where to get more information
Visit us at:
Thank you! https://www.altinity.com