This blog is about ClickHouse, a blazing fast column-oriented DBMS, and how to configure it to consume data from Kafka in a simple way.
As per the documentation, you can easily create a specific Kafka table that will, behind the scenes, generate 1 (or more) consumer(s) for the Kafka topic(s) you chose.
Let’s say we are consuming from a metrics
topic, messages with the following structure:
{
"name": "<string>",
"date": <epoch_unix>,
"message": "<string>"
}
The following table will allow to consume messages from this topic:
CREATE TABLE IF NOT EXISTS metrics
(
`name` String,
`date` DateTime,
`message` String
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'metrics',
kafka_group_name = 'metrics-group',
kafka_format = 'JSONEachRow'
The use of JSONEachRow
assumes that all your messages have this structure (with the three keys name
, date
and message
). This might break if some of them dont have this structure but it is OK if you can ensure that this will not happen.
Be careful, this table only temporary stores messages. As soon as the message are requested at least one time, they will be lost if not stored (and linked offset will be commited to Kafka). To actually store the message in a table, you will need to stream them to a MATERIALIZED VIEW
that will then put it to the final table.
Lets keep it simple for now and lets just use a classic MergeTree()
engine table:
CREATE TABLE final_metrics
(
`name` String,
`date` DateTime,
`message` String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY name
MATERIALIZED VIEW
The following table will receive the rows from the metrics
table and put it in the final_metrics
table:
CREATE MATERIALIZED VIEW metrics_consumer TO final_metrics
AS SELECT
name,
date,
message
FROM metrics
If wanted, we would be able to use this MATERIALIZED VIEW
to filter the kind of data we want. For instance, filtering on the name:
CREATE MATERIALIZED VIEW metrics_consumer TO final_metrics
AS SELECT
name,
date,
message
FROM metrics
WHERE name = 'magic-metric'