Understanding Apache Kafka White Paper
Understanding Apache Kafka White Paper
Understanding
Apache Kafka
Introduction
Apache Kafka is a hot technology amongst application
developers and architects looking to build the latest
generation of real-time and web-scale applications.
According the official Apache Kafka website “Kafka
is used for building real-time data pipelines and
streaming apps. It is horizontally scalable, fault-tolerant,
wicked fast, and runs in production in thousands of
companies.”
These newer technologies break through scalability and performance limitations of the traditional
solutions while meeting similar needs, Apache Kafka can also be compared to proprietary solutions
offered by the big cloud providers such as AWS Kinesis, Google Cloud Dataflow and Azure Stream
Analytics.
The wealth of very popular options in this family of technologies is clear evidence of real and
widespread need. However, it may not be immediately obvious what role these technologies play in
an architecture. Why would I want to stick some other complicated thing in between the source of my
events and the consumers that use the events?
www.instaclustr.com 2
To illustrate this, consider an architecture where you initially have a web front end that captures new
customer details and some backend process that stores these details in a database. By putting a queue
in the middle and posting “new customer” events to that queue I can, without changing existing code,
do things like:
• add an new API application that accepts customer registrations from a new partner and posts
them to the queue; or
• add a new consumer application that registers the customer in a CRM system.
Instaclustr’s Kongo series of blog posts provides some very detail examples and considerations when
architecting an application this way.
These properties (and others) of Kafka lead to be suitable for additional architectural functions
compared to the broad family of queuing and streaming engines. In particular, Kafka can be used as:
www.instaclustr.com 3
Looking under the hood
Let’s take a look at how Kafka achieves all this:
We’ll start with PRODUCERS - producers are the applications that generate events and publish
them to Kafka. Of course, they don’t randomly generate events - they create the events based on
interactions with people, things or systems. For example a mobile app could generate an event when
someone clicks on a button, an IoT device could generate an event when a reading occurs or an API
application could generate an event when called by another application (in fact, it is likely an API
application would sit between a mobile app or IoT device and Kafka). These producer applications use
a Kafka producer library (similar in concept to a database driver) to send events to Kafka with libraries
available for Java, C/C++, Python, Go and .NET.
The next component to understand is the CONSUMERS. Consumers are applications that read the
event from Kafka and perform some processing on them. Like producers, they can be written in
various languages using the Kafka client libraries.
The core of the system is the Kafka BROKERS. When people talk about a Kafka cluster they are
typically talking about the cluster of brokers. The brokers receive events from the producer and
reliably store them so they can be read by consumers.
The brokers are configured with TOPICS. Topics are a bit like tables in a database, separating different
types of data. Each topic is split into PARTITIONS. When an event is received, a record is appended
to the log file for the topic and partition that the event belongs to (as determined by the metadata
provided by the producer). Each of the partitions that make up a topic are allocated to the brokers in
the cluster. This allows each broker to share the processing of a topic. When a topic is created, it can
be configured to be replicated multiple times across the cluster so that the data is still available for
even if a server fails. For each partition, there is a single leader broker at any point in time that serves
all reads and writes. The leader is responsible for synchronising with the replicas. If the leader fails,
Kafka will automatically transfer leader responsibility for its partitions to one of the replicas.
As well as reliability, this topic and partition schema has implications for scalability. There can be as
many active brokers receiving and providing events as there are partitions in the topic so, provided
sufficient partitions are configured, Kafka clusters can be scaled-out to provider increased processing
throughput.
In some instances, guaranteed ordering of message delivery is important so that events are consumed
in the same order they are produced. Kafka can support this guarantee at the topic level. To facilitate
this, consumer applications are placed in consumer groups and within a CONSUMER GROUP a
partition is associated with only a single consumer instance per consumer group.
www.instaclustr.com 4
The following diagram illustrates all these Kafka concepts and their relationships:
Operating Kafka
A Kafka cluster is a complex distributed system with many configuration properties and possible
interactions between components in the system. Operated well, Kafka can operate at the highest
levels of reliability even in relatively unreliable infrastructure environments such as the cloud.
At a high level, the principles for successfully operating Kafka are the same as other distributed server
systems:
• choose hardware and operating system configuration that is appropriate for the characteristics
the system;
• have monitoring system in place and understand and alert on the key metrics that indicate the
health of the system;
• have documented and tested procedures (or better yet, automated processes) for dealing with
failures; and
• consider, test and monitor security of your configuration.
Specifically for Kafka you need to consider factors such as appropriate choice of topics and partitions,
placement of brokers into racks aligned with failure domains and placement and configuration of
www.instaclustr.com 5
Zookeeper. Our white paper on Ten Rules for Managing Kafka provides a great primer on the key
considerations. Visit our Resource section to download the same.
• Kafka, like Cassandra and Spark, is used when you need to build applications that support
the highest levels of reliability and scale. The three technologies are often used together in a
single application. The applications demand the same mission critical levels of service from a
managed service provider.
• Kafka is Apache Foundation open source software with a massive user community - the
software is maintained under a robust governance model ensuring it is not overly influenced
by commercial interests and that users can freely use the software as they need to. There are
no licensing fees and no vendor lock-in.
• Kafka has many architectural similarities to Cassandra and Spark allowing us to leverage
our operational experience such as tuning and troubleshooting JVMs, dealing with public cloud
environments and their idiosyncrasies and operating according to SOC2 principles for a secure
and robust environment.
Customer Testimonials
We see Apache Kafka as a core capability for our architectural strategy as we scale our
business. Getting set up with Instaclustr’s Kafka service was easy and significantly accelerated
our timelines. Instaclustr consulting services were also instrumental in helping us understand
how to properly use Kafka in our architecture.
As very happy users of Instaclustr’s Cassandra and Spark managed services, we’re excited
about the new Apache Kafka managed service. Instaclustr quickly got us up and running with
Kafka and provided the support we needed throughout the process.
www.instaclustr.com 6
About
Instaclustr
Instaclustr delivers reliability at scale through our integrated data platform of open source
technologies such as Apache Cassandra, Apache Kafka, Apache Spark and Elassandra.
Our expertize stems from delivering almost 20 million node hours under management, allowing
us to run the world’s most powerful NoSQL distributed database effortlessly.
We provide a range of managed, consulting and support services to help our customers
develop and deploy solutions around open source technologies. Our integrated data platform,
built on open source technologies, powers mission critical, highly available applications for our
customers and help them achieve scalability, reliability and performance for their applications.
www.instaclustr.com 7