0% found this document useful (0 votes)
7 views29 pages

HD Mod011 Kafka

Uploaded by

hlidio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
7 views29 pages

HD Mod011 Kafka

Uploaded by

hlidio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 29

Module 11

Module 11 – Kafka

After completing this module, the student should be able to


execute Kafka messages and be able to describe:
• What is Kafka
• Kafka core concepts including:
• What are: Topics, Producers, Broker and Consumers,
• How to install/stop/start Kafka
• How to create a new Topic
• How to Consume messages from a Topic

http://spark.apache.org
https://www.youtube.com/watch?v=VWeWViFCzzg&list=PLTPXxbhUt-YWSgAUhrnkyphnh0oKIT8-j

Kafka Page 1
Page 2 Kafka
Table Of Contents
What is Kafka? ................................................................................................................................ 4
Kafka vs Flume ............................................................................................................................... 6
Kafka Use Cases and Comparisons................................................................................................. 8
Terminology – Producers, Brokers, Consumers ........................................................................... 10
Terminology – Topics and Partitions (1 of 2) ............................................................................... 12
Terminology - Topics and Partitions (2 of 2)................................................................................ 14
Lab01: Start Kafka ........................................................................................................................ 16
Generic ‘Create Topic’ Example .................................................................................................. 18
Lab02 : Create Topic..................................................................................................................... 20
Lab03: Create Producer................................................................................................................. 22
Lab04: Consuming messages ........................................................................................................ 24
Lab06: To stop Kakfa.................................................................................................................... 26
In Review - Kafka ......................................................................................................................... 28

Kafka Page 3
What is Kafka?
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per
second from thousands of clients.

Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a
large organization. It can be elastically and transparently expanded without downtime.
Data streams are partitioned and spread over a cluster of machines to allow data
streams larger than the capability of any single machine and to allow clusters of
coordinated consumers

Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss.
Each broker can handle terabytes of messages without performance impact.

Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-
tolerance guarantees.

Page 4 Kafka
What is Kafka?

• Is a publish-subscribe messaging system that handles real-time data feeds


• Must haves:
• High throughput to support feeds
• Support real-time processing to create new derived feeds
• Support low-latency delivery to handle traditional message use cases
• Fault-tolerant
• Originated at LinkedIn and Implemented in Scala
• Used for real-time monitoring, website tracking, log collection
• Why is Kafka so fast?
• While Kafka persists all data to disk, essentially all writes go to the
page cache (RAM)
• Efficient data transfer from page cache to network socket
• Up to 2 million writes/second on 3 commodity machines
• Used with Storm and Spark streaming

Kafka Page 5
Kafka vs Flume

• Kafka is very much a general-purpose system. You can have many producers
and many consumers sharing multiple topics. In contrast, Flume is a special-
purpose tool designed to send data to HDFS and HBase. It has specific
optimizations for HDFS and it integrates with Hadoop’s security. As a result,
Cloudera recommends using Kafka if the data will be consumed by multiple
applications, and Flume if the data is designated for Hadoop.

• Those of you familiar with Flume know that Flume has many built-in sources and
sinks. Kafka, however, has a significantly smaller producer and consumer
ecosystem, and it is not well supported by the Kafka community. Hopefully this
situation will improve in the future, but for now: Use Kafka if you are prepared to
code your own producers and consumers. Use Flume if the existing Flume
sources and sinks match your requirements and you prefer a system that can be
set up without any development.

• Flume can process data in-flight using interceptors. These can be very useful for
data masking or filtering. Kafka requires an external stream processing system
for that.

• Both Kafka and Flume are reliable systems that with proper configuration can
guarantee zero data loss. However, Flume does not replicate events. As a result,
even when using the reliable file channel, if a node with Flume agent crashes,
you will lose access to the events in the channel until you recover the disks. Use
Kafka if you need an ingest pipeline with very high availability.

• Flume and Kafka can work quite well together. If your design requires streaming
data from Kafka to Hadoop, using a Flume agent with Kafka source to read the
data makes sense: You don’t have to implement your own consumer, you get all
the benefits of Flume’s integration with HDFS and HBase, you have Cloudera
Manager monitoring the consumer and you can even add an interceptor and do
some stream processing on the way.

Page 6 Kafka
Kafka vs Flume

• Use Kafka if the data will be consumed by multiple applications, and


Flume if the data is designated for Hadoop
• Use Kafka if you are prepared to code your own producers and
consumers. Use Flume if the existing Flume sources and sinks match
your requirements and you prefer a system that can be set up without
any development
• Flume can process data in-flight using interceptors. These can be very
useful for data masking or filtering. Kafka requires an external stream
processing system for that
• Flume does not replicate events. As a result, even when using the
reliable file channel, if a node with Flume agent crashes, you will lose
access to the events in the channel until you recover the disks. Use
Kafka if you need an ingest pipeline with very high availability
• Flume and Kafka can work quite well together. If your design requires
streaming data from Kafka to Hadoop, using a Flume agent with Kafka
source to read the data makes sense

Kafka Page 7
Kafka Use Cases and Comparisons

Here is a description of a few of the popular use cases for Apache Kafka.

Messaging
Kafka works well as a replacement for a more traditional message broker. Message brokers are
used for a variety of reasons (to decouple processing from data producers, to buffer
unprocessed messages, etc). In comparison to most messaging systems Kafka has better
throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution
for large scale message processing applications.

In our experience messaging uses are often comparatively low-throughput, but may require low
end-to-end latency and often depend on the strong durability guarantees Kafka provides.

Website Activity Tracking


The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a
set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other
actions users may take) is published to central topics with one topic per activity type. These
feeds are available for subscription for a range of use cases including real-time processing, real-
time monitoring, and loading into Hadoop or offline data warehousing systems for offline
processing and reporting.

Metrics
Kafka is often used for operation monitoring data pipelines. This involves aggregating statistics
from distributed applications to produce centralized feeds of operational data.

Log Aggregation
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation
typically collects physical log files off servers and puts them in a central place (a file server or
HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner
abstraction of log or event data as a stream of messages. This allows for lower-latency
processing and easier support for multiple data sources and distributed data consumption. In
comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance,
stronger durability guarantees due to replication, and much lower end-to-end latency.

Stream Processing
Many users end up doing stage-wise processing of data where data is consumed from topics of
raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for
further consumption. For example a processing flow for article recommendation might crawl
article content from RSS feeds and publish it to an "articles" topic; further processing might help
normalize or deduplicate this content to a topic of cleaned article content; a final stage might
attempt to match this content to users. This creates a graph of real-time data flow out of the
individual topics. The Storm framework is one popular way for implementing some of these
transformations.

Page 8 Kafka
Kafka Use Case and Comparisons

• Kafka Use Cases


• Website activity tracking, Metrics (monitoring), stream processing
• Many people use Kafka as a replacement for a log aggregation solution.
Log aggregation typically collects physical log files off servers and
puts them in a central place (a file server or HDFS perhaps) for
processing. Kafka abstracts away the details of files and gives a
cleaner abstraction of log or event data as a stream of messages. This
allows for lower-latency processing and easier support for multiple data
sources and distributed data consumption
• Kafka comparisons
• In comparison to log-centric systems like Scribe or Flume, Kafka offers
equally good performance, stronger durability guarantees due to
replication, and much lower end-to-end latency(Flume pushes data,
Kafka consumers pull data from Broker)
• In comparison to most messaging systems Kafka has better
throughput, built-in partitioning, replication, and fault-tolerance which
makes it a good solution for large scale message processing
applications

Kafka Page 9
Terminology – Producers, Brokers, Consumers
In Kafka, a Topic is a user-defined category to which messages are published.

Producers are applications that create Messages and publish them to the Kafka broker
for further consumption.

In the bigger picture, Kafka Producers publish messages to one or more topics
and Consumers subscribe to topics and process the published messages. So, at a high
level, producers send messages over the network to the Kafka cluster which in turn
serves them up to consumers. Finally, a Kafka cluster consists of one or more servers,
called Brokers that manage the persistence and replication of message data (i.e. the
commit log).

Page 10 Kafka
Terminology –
Producers, Brokers, Consumers
• Producers write data (publish message to partition within Topic) to Brokers
• Brokers manage persistence (read/writes) and replication of message data
• Consumers read data (subscribe to Topic) from Brokers via Poll message

Kafka Page 11
Terminology – Topics and Partitions (1 of 2)

In Kafka, a Topic is a user-defined category to which messages are published.


Kafka Producers publish messages to one or more topics and Consumers subscribe
to topics and process the published messages. So, at a high level, producers send
messages over the network to the Kafka cluster which in turn serves them up to
consumers. Finally, a Kafka cluster consists of one or more servers, called Brokers that
manage the persistence and replication of message data (i.e. the commit log).

Page 12 Kafka
Terminology – Topics and Partitions (1 of 2)

• The data is stored in Topics (feed name) to which messages are published
• Topics are split into Partitions which are replicated
• In Kafka, Topics consist of one or more Partitions that are ordered, immutable
sequences of messages. Since writes to a partition are sequential, this design
greatly reduces the number of hard disk seeks

Kafka Page 13
Terminology - Topics and Partitions (2 of 2)

Page 14 Kafka
Terminology – Topics and Partitions (2 of 2)

• Each message in the Topic's Partitions is assigned a sequential Offset


(unique ID)
• After a configurable amount of time (Default = 7 days), the published message
is discarded to free up space
• Consumers keep track of which messages have been consumed via Offset

Old

New

Kafka Page 15
Lab01: Start Kafka

Page 16 Kafka
Lab01: Start Kafka

cd usr/hdp/2.2.0.0-2041/kafka/bin

kafka status -- see what status is


kafka start -- if not running, start Kafka
jps -- to confirm is it started

kafka stop -- if want to stop (don't run this yet)

Kafka Page 17
Generic ‘Create Topic’ Example

Page 18 Kafka
Generic 'Create Topic' example

• The Replication Factor controls how many servers will replicate each message
that is written. If you have a replication factor of 3 then up to 2 servers can fail
before you will lose access to your data. We recommend you use a replication
factor of 2 or 3 so that you can transparently bounce machines without
interrupting data consumption
• The Partition count controls how many logs the topic will be sharded into.
There are several impacts of the partition count. First each partition must fit
entirely on a single server. So if you have 20 partitions the full data set (and
read and write load) will be handled by no more than 20 servers (no counting
replicas). Finally the partition count impacts the maximum parallelism of your
consumers
• The Configurations added (--config) on the command line override the default
settings the server has for things like length of time data should be retained

http://kafka.apache.org/documentation.html#introduction

Kafka Page 19
Lab02 : Create Topic

Page 20 Kafka
Lab02: Create Topic

cd usr/hdp/2.2.0.0-2041/kafka
Execute this command to create Topics for 'truckevent':
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 1 --partitions 1 --topic truckevent

Check if topic 'truckevent' was created successfully


bin/kafka-topics.sh --list --zookeeper localhost:2181

Describe the topic 'truckevent'


bin/kafka-topics.sh --zookeeper localhost:2181 –describe –topic truckevent

Kafka Page 21
Lab03: Create Producer
To start the Kafka Producer we execute the following command to see the output as
shown in the screenshot.

We have now successfully compiled and had the Kafka producer publish some
messages to the Kafka cluster.

Page 22 Kafka
Lab03: Create Producer

We will use a custom Java script that will run the Producer and stream in
messages about the truck's currently driving patterns. Note if the below code
fails, you'll have to use Maven to rebuild a Clean Page (see next slides)
cd /opt/TruckEvents/Tutorials-master
To start the Producer
java -cp target/Tutorial-1.0-SNAPSHOT.jar
com.hortonworks.tutorials.tutorial1.TruckEventsProducer
sandbox:6667 sandbox:2181 &

Kafka Page 23
Lab04: Consuming messages
Now let’s consume the messages by viewing them in the console.

Page 24 Kafka
Lab04: Consuming messages

Open a new Hadoop PuTTY prompt and paste the below, then click 'Enter'
root@sandbox:
/usr/hdp/2.2.0.0-2041/kafka/bin/kafka-console-consumer.sh -
zookeeper localhost:2181 -topic truckevent -from-beginning
Success. The client is reading the messages from the queue

Kafka Page 25
Lab06: To stop Kakfa
If you wanted to stop the Kafka service, you would use these commands

Page 26 Kafka
Lab06: To stop Kafka

Do NOT do this lab.


You will need Kafka for next Module !!
If you wanted to stop Kafka, you would do the below:
Close any PuTTY prompts, then open a new PuTTY prompt

First delete the TRUCKEVENT topic we created earlier, then stop Kafka

cd /usr/hdp/2.2.0.0-2041/kafka/bin

bin/kafka-run-class.sh kafka.adminDeleteTopicCommand
--zookeeper localhost:2181 -topic truckevent

kafka stop
kafka status

Kafka Page 27
In Review - Kafka

Page 28 Kafka
In Review – Kafka

After completing this module, the student should be able to


execute Kafa messages and be able to describe:
• What is Kafka
• Kafka core concepts including:
• What are: Topics, Producers, Broker and Consumers,
• How to install/stop/start Kafka
• How to create a new Topic
• How to Consume messages from a Topic

Kafka Page 29

You might also like