Apache Kafka Essentials
Apache Kafka Essentials
CONTENTS
∙ Introduction
∙ Conclusion
∙ Additional Resources
Two trends have emerged in the information technology space. The main benefits of Kafka
First, are:
the diversity and velocity of the data that an enterprise wants 1. High throughput: Each server is capable of handling
to hundreds
collect for decision-making continues to grow. Second, there is of MB per second of data.
a 2. High availability: Data can be stored redundantly in
growing need for an enterprise to make decisions in real-time multiple
servers and can survive individual server failure.
based on that collected data. For example, financial institutions
3. High scalability: New servers can be added over time to
want to not only detect fraud immediately, but also offer a scale
better banking out the system.
Apache Kafka is a streaming engine for collecting, caching, and
experience high
processing through features
volumes like in
of data real-time alerting,
real-time. real-time
As illustrated in 4. Easy integration with external data sources or data
product
Figure sinks.
recommendations,
1, and more
Kafka typically serves effective
as a part customer
of a central dataservice.
hub in which 5. Built-in real-time processing
layer.
data within an enterprise is collected. The data can then be used
for continuous processing or fed into other systems and
applications in real time. Kafka is used by more than 40% of
Fortune 500 companies across all industries. Refer to Figure 1
ABOUT
on page 3. APACHE KAFKA
1
APACHE KAFKA
Figure 1: Apache Kafka as a central real-time Defines a logical name for producing
hub Topic
and consuming records.
QUICKSTART FOR APACHE KAFKA Defines a non-overlapping subset of
Partition
records within a topic.
It’s easy to get started on Kafka. The following are the steps to
get A unique sequential number assigned to
Offset
each record within a topic partition.
Kafka running in your environment:
A record contains a key, a value, a
1. Download the latest Apache Kafka binary distribution Record
from timestamp, and a list of headers.
http://kafka.apache.org /downloads and untar it. Server where records are stored.
Broker
Multiple brokers can be used to form a
2. Start the Zookeeper
server cluster.
3. Start the Kafka Figure 2 depicts a topic with two partitions. Partition 0 has 5
broker records,
4. Create a topic with offsets from 0 to 4, and partition 1 has 4 records, with offsets
from 0 to 3.
5. Produce and Consume data
props.put("value.serializer",
Records within a partition are always delivered to the consumer
in
"org.apache.kafka.common.serializ ation.
offset order. By saving the offset of the last consumed record
StringSerializer");
from each partition, the consumer can resume from where it left
Producer<String, String> producer = new off after a restart. In the example above, we use the
c o m m i t S y n c ( ) API to save the offsets explicitly after consuming
KafkaProducer<>(props
); a batch of records. One can also save the offsets automatically by
setting the property e n a b l e . a u t o . commit to true.
p r o d u c e r. s e n d ( A record in Kafka is not removed from the broker immediately
after
new ProducerRecord<String, String>("test", "key", it is consumed. Instead, it is retained according to a
"value"));
configured retention policy. The following table summarizes
the two common policies:
In the above example, both the key and value are strings, so we Retention Policy Meaning
are
The number of hours to keep a record
using a StringSerializer . It’s possible to customize the log.retention.hours
on the broker.
serializer when types become more complex.
The maximum size of records retained
The following code snippet shows how to consume records log.retention.bytes
in a partition.
with
string key and value in Java.
Properties p r o p s = new P r o p e r t i e s ( ) ; props. KAFKA CONNECT
put("bootstrap.servers", "localhost:9092");
The second component in Kafka is Kafka Connect, which is a
p r o p s . p u t ( " k e y. d e s e r i a l i z e r " , framework that makes it easy to stream data betw een Kafka
and
"org.apache.kafka.common.serializ ation. other systems. As shown in Figure 3, one can deploy a Connect
StringDeserializer");
cluster and run various connectors to import data from sources
like MySQL, TIBCO Messaging, or Splunk into Kafka (Source
props.put("value.deserializer",
Connectors) and export data from Kafka (Sink Connectors) such as
"org.apache.kafka.common.serializ ation. HDFS, S3,
StringDeserializer"); See
and page 5 for Figure 3.
Elasticsearch.
c o n s u m e r. s u b s c r i b e ( A r r a y s . a s L i s t ( " t e s t " )
);
while (true) {
"optional":false},
The following steps show how to run the existing file
connector
"payload":"hello"}
in standalone mode to copy the content from a source file
to a destination file via Kafka:
{"schema":{"type":"string",
1. Prepare some data in a source
file:
"optional":false},
> echo -e \"hello\nworld\" > test.txt
"payload":"world"}
2. Start a file source and a file sink
connector:
> bin/connect-standalone.sh In the example above, the data in the source file test.txt is
first
config/connect-file-source.properties streamed into a Kafka topic connect-test through a file
source connector. The records in connect-test are then
config/connect-file-sink.properties
streamed into
the destination file test.sink.txt . If a new line is added to
3. Verify the data in the destination file: test.
. c o n n e c t . t r a n s f o r m s . H o i s t F i e l d $ Va l u e
/ / b u i l d a s t r e a m from an i n p u t t o p i c
transforms.MakeMap.field=line
K S t r e a m < S t r i n g , S t r i n g > s o u r c e = b u i l d e r. s t r e a m (
transforms.InsertSource.type=org.apac
he "streams-plaintext-input",
. I n s e r t F i e l d $ Va l u e K Ta b l e < S t r i n g, L o ng > c o un t s = s o u r c e
transforms.InsertSource.static.field . f l a t M a p Va l u e s ( v a l u e -> A r r a y s . a s L i s t ( v a l u e .
= toLowerCase().split(" ")))
data_source . g r o u p B y ( ( k e y , v a l u e ) -> v a l u e )
transforms.InsertSource.static.valu .count();
e =t e s t - f i l e - s o u r c e
-file-source"}
{"line":"world","data_source":"test
-file-source"}
/ / b u i l d a s t r e a m from an i n p u t t o p i c
hello world
K S t r e a m s ou r c e = b u i l d e r . s t r e a m (
4. Verify the data in the output topic:
"streams-plaintext-input",
b i n / k a f k a - c o n s o l e - c o n s u m e r. s h
Consumed.with(stringSerde, stringSerde));
--bootstrap-server localhost:9092
K Ta b l e c o u n t s = s o u r c e
--topic streams-wordcount-output
. f l a t M a p Va l u e s ( v a l u e -\> --from-beginning
Arrays.asList(value.
--formatter kafka.tools.
toLowerCase().split(" ")))
. g r o u p B y ( ( k e y, v a l u e ) -\> value)
.count(); DefaultMessageFormatter
records and thus the sum of the values is 7. On the other hand, if the and the combined value using
Va l u e J o i n e r .
topic is viewed as a KTable, the second record is treated as an
update to the first record since they have the same key “k1”.
Therefore, only the second record is retained in the stream and the
sum is 5 instead.
KSTREAMS DSL
COMMONLY USED OPERATIONS IN functional, many client applications are looking for a lighter
KGROUPEDSTREAM weight
Operation Example interface to Apache Kafka Streams via low-code environments
count() kt = kgs.count(); or continuous queries using SQL-like commands. In many
kgs: ("k1", (("k1",
cases, developers looking to leverage continuous query
Count the number of records 1), ("k1", 3))) ("k2",
in this stream by the grouped (("k2", 2))) kt: ("k1", functionality are looking for a low-code environment where
key and return it as a KTable. 2) ("k2", 1)
stream processing
can be dynamically accessed, modified, and scaled in real-
time.
reduce(Reducer) kt = kgs.reduce(
( a g g Va l u e , n e w Va l u e ) Accessing Apache Kafka data streams via SQL is just one
Combine the values of -> a g g Va l u e +
approach to addressing this low-code stream processing, and
records in this stream by the newValue
); kgs: ("k1", many commercial vendors (TIBCO Software, Confluent) as
grouped key and return it as ("k1", well
Apache Kafka is a data distribution platform; it’s what you do
a KTable. 1), ("k1", 3))) ("k2", with
as open-source solutions (Apache Spark) offer solutions to
(("k2", 2))) kt: ("k1",
the data that
providing SQLisaccess
important.
to Once data is available via Kafka, it
windowedBy(Windows) tw
4 ) ks( " =k 2kgs.wi
" , 2 )ndowed By (
T i m e Wi n d o w s . o f ( 1 0 0 ) ) ; can be distributed to many different processing engines from
unlock Apache Kafka data streams for stream processing.
Further group the records by kgs: ("k1", (("k1", 1, integration services, event streaming, and AI/ML functions to
the timestamp and return it as 1 0 0 t ) , ( " k 1 " , 3 , 1 5 0 t ) ) )
a data analytics.
("k2", (("k2", 2, 100t),
TimeWindowedKStream . ("k2", 4, 250t))) * t
indicates a timestamp.
twks: ("k1", 100t --
200t, (("k1", 1, 100t),
("k1", 3, 150t))) ("k2",
100t -- 200t, (("k2", 2,
100t))) ("k2", 200t --
300t, (("k2", 4, 250t)))
previous example. Those states can be queried interactively including SQL access to Apache Kafka, please see the
Additional
through an
Resources section below.
API described in the Interactive Queries section of the Kafka ONE SIZE DOES NOT FIT ALL
documentation. This avoids the need of an external data store
With the increasing popularity of real-time stream processing and
for exporting and serving those states.
EXACTLY-ONCE PROCESSING IN KSTREAMS the rise of event-driven architectures, a number of alternatives
have started to gain traction for real-time data distribution.
Failures in the brokers or the clients may introduce duplicates
Apache Kafka is the flavor of choice for distributed high volume
during the processing of records. KStreams provides the
data streaming; however, many implementations have begun to
capability of processing records exactly once even under failures.
struggle with building solutions at scale when the application’s
This can be achieved by simply setting the property
requirements go beyond a single data center or single location.
p r o c e s s i n g . g u a r a n t e e to e x a c t l y _ o n c e in KStreams.
So, while Apache Kafka is purpose-built for real-time data
EXTENDING APACHE KAFKA distribution and streaming processing, it will not fit all the
requirements of every enterprise application. Alternatives like
While processing data directly into an application via KStreams is
Apache Pulsar, Eclipse Mosquitto, and many others may be
worth
and other data distribution solutions, please see the ∙ Article on real-time stock processing with Apache NiFi
Additional and
Apache Kafka
Resources section below.
CONCLUSION ∙ Apache Kafka Summit websit
e
Apache Kafka has become the de-facto standard for high ∙ Apache Kafka Mirroring and Replica
performance, distributed data streaming. It has a large and tion
growing community of developers, corporations, and ∙ Apache Pulsar Vs. Apache Kafka O’Reilly
eBook
applications that are supporting, maintaining, and leveraging
∙ Apache Pulsar websit
it. If you are building an event-driven architecture or looking e
for a way to
stream data in real-time, Apache Kafka is a clear leader in
providing a proven, robust platform for enabling stream
processing
and
enterprise communications.
With over 20 years of experience building, architecting, and designing large scale messaging infrastructure, William McLane is one of
the
thought leaders for global data distribution. William and TIBCO have history and experience building mission critical, real world data
distribution architectures that power some of the largest financial services institutions, to the global scale of tracking transportation
and logistics operations. From Pub/Sub, to point-to-point, to real-time data streaming, William has experience designing, building,
and
leveraging the right tools for building a nervous system that can connect, augment and unify your enterprise.
some of the most innovative technology and tech-enabled retrieval system, or transmi tted, in any form or by m eans
companies in the world including Red Hat, Cloud Elements, of electronic, mechanical, photocopying, or otherwise,
Sensu, and Sauce Labs. without prior written permission of the publisher.