Kafka Ebook SoftwareMill
Kafka Ebook SoftwareMill
SoftwareMill Team
Before we dive into Kafka, let’s start with a quick recap of what
publish / subscribe messaging is.
It’s a messaging pattern where the sender doesn’t send data directly to
a specic receiver. Instead the publisher classies messages without
knowing if there are any subscribers interested in particular types
of messages and the receiver subscribes to receive a certain type of
messages without knowing if there are any senders sending them. All
messages are published in a broker which ensures that messages are
delivered to the correct subscribers.
SENDER RECEIVER
BROKER
metrics
subscribes for
users
messages on a
given topic
Kafka is used heavily in the big data space as a reliable way to ingest
and move large amounts of data very quickly. Allows us to build a
modern and scalable ETL (extract, transform, load), CDC (change data
capture and Big Data Ingest systems).
Currently there are 28k companies using Apache Kafka. Among them
Uber, Netix, Activision, Spotify, Slack, Pinterest, Coursera and LinkedIn.
Event-driven Microservices
Application architecture is shifting from monolithic
enterprise systems to exible, event-driven approaches.
Apache Kafka is scalable, highly available, and fault-tolerant
asynchronous communication backbone for microservices
architecture. Services publish events to Kafka while
downstream services react to those events instead of being
called directly. In this fashion, event-producing services are
decoupled from event-consuming services.
Stream processing
By denition Apache Kafka is a distributed streaming
platform for building real-time data pipelines and real-time
streaming applications. Adopting stream processing enables
a signicant reduction of time between when an event is
recorded and when the system and data application reacts
to it, enabling real time reaction.
Metrics
Kafka is often used for operational monitoring data pipelines
and enables alerting and reporting on operational metrics.
It aggregates statistics from distributed applications and
produces centralized feeds of operational data.
Log aggregation
Kafka can be used across an organization to collect logs
from multiple services and make them available in standard
format to multiple consumers. It provides low-latency
processing and easier support for multiple data sources and
distributed data consumption.
Messaging
Kafka works well as a replacement for a more traditional
message broker. In comparison to most messaging systems
Kafka has better throughput, built-in partitioning, replication,
and fault-tolerance which makes it a good solution for large
scale message processing applications.
Schema Registry
Messages sent to Kafka usually have predened schemas.
Schemas can be managed and versioned using the Schema
registry project. Message contains in such case binary
serialised data and the reference to the specic schema
version. What is more, formats such as Apache Avro allows
for schema migrations, supporting changes in data format.
Kafka Connect
Integration with external systems is a very important topic.
That’s the role of Kafka Connect. Various connectors allow
you to read data from different sources, such as databases
and send them to Kafka. Other types allow you to write Kafka
messages to special sinks, which similarly can be a database
or other type of external system you want to integrate with.
Messages sent to the topic are pretty simple, but they are built
of two important parts: a key and a value. The value is the data
you are passing, but the key denes the partition to which the
message will be put into. By default Kafka calculates partitions
for a given message based on the message key. If the key is not
specied then messages are distributed among partitions using
Round Robin.
To get know more about Schema Registry take a look at the blog
post „Dog ate my schema”.
Kubernetes Operator
Before diving into the typical cloud specic offerings, let’s start with
Kubernetes. Your company may be already leveraging it. If that is the
case the reasonable option may be just to set up a Kafka Cluster on K8S
using one of the available Kubernetes Operators. It may be the cheapest
option, however, keep in mind that in this approach maintenance &
monitoring is on your side, and this adds up to the nal costs.
Strimzi
Strimzi is probably the most known Kafka operator. It not only allows
running the brokers, but offers support for Kafka Connect, Mirror Maker,
Exporter (for Metrics) and Kafka Bridge (HTTP API). It is even possible
to manage topics automatically. We have used Strimzi in some of our
projects and it worked quite well.
Conuent operator
As an alternative, Conuent has released its own Kubernetes Operator.
It supports only commercial Conuent Platform. However, on the other
side it supports additional products from the Kafka family that Strimzi
does not - such as Schema Registry and ksqlDB.
Koperator
There is one more operator, less known but promising. The Koperator
was named formerly as Banzai Cloud Kafka Operator and is now part of
Cisco. It allows to set up the Kafka cluster, together with Cruise Control
and Prometheus. The Open Source version has a limited set of features
but in the enterprise variant it is possible to use Kafka Connect and
ksqlDB.
MSK Connect
Amazon MSK Connect provides managed Kafka Connect
services. It is charged hourly for connector usage, depending on
the number of workers. You can use it together with MSK, but not
only - other Apache Kafka clusters are compatible as well.
Azure
Apache Kafka on HDInsight architecture
Azure HDInsight is a product that allows running Apache
Hadoop, Apache Spark and other big data systems. It is possible
to leverage it for Apache Kafka as well. You are billed for the
provisioned cluster (pricing depends on a number of nodes
and their types). Data is encrypted at rest and the service has a
99.9% uptime. You have to monitor it using Azure monitor, quite
similarly to Amazon MSK.
Externally Hosted-services
Apart from cloud specic services, external companies offer
running and managing Kafka and related products in the cloud
of your choice, sometimes even on your own cloud account.
Conuent Cloud
Conuent Cloud is the product that offers the biggest number
of known Kafka-related services. It supports Connect (including
Conuent connectors), Schema Registry, but also ksqlDB. It has
a few custom solutions as well - like Stream designer or Stream
Governance that can be a nice base for an event-streaming
platform.
Aiven
Aiven is quite similar to Instaclustr. It offers Kafka, Kafka Connect and
Karaspace. Can be used with AWS, Azure, GCP, DigitalOcean and
UpCloud. They offer 3 pricing plans: Startup, Business and Premium.
Only premium can be run under your own cloud account. Plans have
different max storage and under each of them different variants of
CPU & RAM per VM are available.
Summary
Apache Kafka becomes a de-facto standard for event-driven and
data streaming architectures hence every major cloud provider offers
Kafka-related services. They offer different pricing models, SLA terms
and features. Some of the projects due to licensing limitations can’t
be offered as a SaaS model from anyone apart from Conuent. That is
why alternatives are being created or integrations with other similar
already existing services.
What to choose for your next project? That is not an easy question.
It depends! It depends on SLA, pricing, security requirements - what
certications are needed, whether you need a 100% automatically
managed environment or if a bit of maintenance is ok for you. It
depends on the status of the project and what the performance
requirements are. There are various factors that you need to consider
before making a decision.
Good luck!
What about the upper bound? When do we know that there are
too many partitions? Well, a large number of partitions inuences
various processes. Publishers buffer batches of messages sent to
Kafka per partition. Bigger number of partitions = lower probability
that messages with different keys will land on the same partition,
meaning lower probability of getting larger batches. Additionally
more partitions means more separate buffers = more memory.
This means that a too big partition number can inuence producer
throughput and memory usage.
• Documentation
• Kafka: The Denitive Guide: Real-Time Data and Stream
Processing at Scale
• Streaming Architecture: New Designs Using Apache
Kafka and MapR Streams
• Slack Conuent
• SoftwareMill Tech Blog on Kafka topics
• Kafka Visualization Tool
Start
Start with
with Apache
Apache Kafka
Kafka 2.0 — 24
2.0 — 24
Our Apache Kafka Experts
Grzegorz Kocur
Senior DevOps with hands-on experience
operating Kafka clusters - on-premise as
well as in cloud computing environments,
including Kubernetes
Krzysztof Atłasik
Seasoned developer with Kafka
experience, certied from 2021
Krzysztof Grajek
Senior Software Engineer, 15 years of
experience, working with PubSub-based
production systems for 5 years
Krzysztof Ciesielski
Senior Software Engineer, 15 years of
experience, working with Kafka-based
production systems for 5 years
Michał Matłoka
Software Architect, certied for Apache
Kafka in 2019. > 10y of software development
experience, including big data technologies
Adam Warski
Scala & Distributed Systems Expert, CTO at
SoftwareMill, OSS Developer. Specialises in
developing high-performance, clustered,
fault-tolerant software
Maria Wąchal
CMO and Technology Evangelist driven
by a passion for modern technology and
a commitment to making complex
concepts understandable