ClickHouse is a columnar database management system optimized for online analytical query processing (OLAP).
ClickHouse was originally developed to support the Yandex.Metrics web analytics platform, and then it was separated into a separate open-source project. You can evaluate ClickHouse's capabilities by Yandex's claims that their database can successfully handle 13 trillion records and 20 billion events per day, generating customized reports on the fly.
The data in ClickHouse has a columnar (columnar) organization, in which values of a single attribute are grouped together. This allows you to efficiently obtain masses of values of specific attributes, analyze their mutual influence and patterns. Queries that require access only to specific columns (attributes) can be executed very quickly and efficiently with columnar organization of data.
On the other hand, in a traditional row-based database organization, data is stored row-by-row, where each record groups all attribute values for a particular object. Such data organization is effective for operations that require access to all attributes of a particular object at once, but is less effective when performing analytical queries that operate with masses of data on individual attributes.
Along with columnar data organization, ClickHouse implements a number of measures aimed at improving performance:
ClickHouse supports constant-length values so that values of type "number" don't have to be stored next to their length.
- Support for data compression
Data compression implemented in ClickHouse plays an important role in ensuring good performance.
- Storing data on a regular hard disk drive
Many columnar DBMSs can only run in RAM. ClickHouse allows you to use hard disks for data storage.
- Parallel query processing
ClickHouse implements efficient query parallelization, maximizing the resources available on the server.
- Distributed query processing
In ClickHouse, a query can be executed on all distributed shards (database segments) in parallel.
ClickHouse has its own SQL-based query language and in many cases its syntax is the same as SQL.
Data in ClickHouse can be processed by vectors, column fragments. This results in high processing efficiency.
ClickHouse supports primary key tables to speed up the execution of primary key range queries. Continuous addition of data to the table without locking is available.
Physical sorting of data by primary key allows you to retrieve data for its specific values or their ranges with low latency.
- Suitable for online queries
Low latency allows you to respond online instead of delaying the query execution.
Support for approximate calculations
ClickHouse gives various ways to lower the precision of computation when it is not needed, in return getting a performance boost.
The disadvantages of ClickHouse or its features include:
- Lack of implementation of full-fledged transactions.
- Deleting and modifying specific data has high latency, but there are efficient means of mass deletion and modification of data.
- The sparse index makes ClickHouse ill-suited for point reads of single rows.
- Does not support ANSI SQL 2008 or PostgreSQL.
- Does not know how to do local and distributed JOINs.
ClickHouse is not suitable for operations on key-value data, such operations in ClickHouse can be performed, but with high latency and low performance. However, ClickHouse would be a good option for time series databases, providing high query execution speed. The purpose of ClickHouse is primarily analytics, and for other purposes it is probably better to use other DBMSs.
Typical tasks for which ClickHouse is used are:
- Online real-time analytics
ClickHouse allows you to run analytic queries in real-time, providing low latency in query response. At the same time, ClickHouse offers powerful aggregation, grouping, filtering and sorting capabilities, making it effective for performing complex analytical queries, including multivariate analysis, data segmentation, calculation of statistical indicators, web traffic analytics, financial analysis, etc.
ClickHouse is capable of processing and analyzing huge amounts of data. It efficiently handles terabyte-sized datasets and provides high performance for queries that require access to huge amounts of data.
- Trend detection and behavior prediction
ClickHouse is widely used to process event logs, logs, audits and other event-driven data. It enables real-time analysis of this data, identifying trends and system issues, predicting system behavior, etc.
- IIoT (Industrial Internet of Things) analytics
ClickHouse is used to process and analyze data generated by various IIoT devices, sensors and controllers. It is capable of processing streaming data, performing real-time aggregation and analytics, and storing historical data for later analysis. For example, use it for production planning, assessing equipment performance, identifying bottlenecks and predicting faults.
- Analyze marketing performance
ClickHouse can be a useful tool for analyzing marketing performance. It can be used to track impressions, clicks, conversions, etc., segment data, calculate marketing performance indicators based on them, and create reports.