HD Mod011 Kafka
HD Mod011 Kafka
Module 11 – Kafka
http://spark.apache.org
https://www.youtube.com/watch?v=VWeWViFCzzg&list=PLTPXxbhUt-YWSgAUhrnkyphnh0oKIT8-j
Kafka Page 1
Page 2 Kafka
Table Of Contents
What is Kafka? ................................................................................................................................ 4
Kafka vs Flume ............................................................................................................................... 6
Kafka Use Cases and Comparisons................................................................................................. 8
Terminology – Producers, Brokers, Consumers ........................................................................... 10
Terminology – Topics and Partitions (1 of 2) ............................................................................... 12
Terminology - Topics and Partitions (2 of 2)................................................................................ 14
Lab01: Start Kafka ........................................................................................................................ 16
Generic ‘Create Topic’ Example .................................................................................................. 18
Lab02 : Create Topic..................................................................................................................... 20
Lab03: Create Producer................................................................................................................. 22
Lab04: Consuming messages ........................................................................................................ 24
Lab06: To stop Kakfa.................................................................................................................... 26
In Review - Kafka ......................................................................................................................... 28
Kafka Page 3
What is Kafka?
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per
second from thousands of clients.
Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a
large organization. It can be elastically and transparently expanded without downtime.
Data streams are partitioned and spread over a cluster of machines to allow data
streams larger than the capability of any single machine and to allow clusters of
coordinated consumers
Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss.
Each broker can handle terabytes of messages without performance impact.
Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-
tolerance guarantees.
Page 4 Kafka
What is Kafka?
Kafka Page 5
Kafka vs Flume
• Kafka is very much a general-purpose system. You can have many producers
and many consumers sharing multiple topics. In contrast, Flume is a special-
purpose tool designed to send data to HDFS and HBase. It has specific
optimizations for HDFS and it integrates with Hadoop’s security. As a result,
Cloudera recommends using Kafka if the data will be consumed by multiple
applications, and Flume if the data is designated for Hadoop.
• Those of you familiar with Flume know that Flume has many built-in sources and
sinks. Kafka, however, has a significantly smaller producer and consumer
ecosystem, and it is not well supported by the Kafka community. Hopefully this
situation will improve in the future, but for now: Use Kafka if you are prepared to
code your own producers and consumers. Use Flume if the existing Flume
sources and sinks match your requirements and you prefer a system that can be
set up without any development.
• Flume can process data in-flight using interceptors. These can be very useful for
data masking or filtering. Kafka requires an external stream processing system
for that.
• Both Kafka and Flume are reliable systems that with proper configuration can
guarantee zero data loss. However, Flume does not replicate events. As a result,
even when using the reliable file channel, if a node with Flume agent crashes,
you will lose access to the events in the channel until you recover the disks. Use
Kafka if you need an ingest pipeline with very high availability.
• Flume and Kafka can work quite well together. If your design requires streaming
data from Kafka to Hadoop, using a Flume agent with Kafka source to read the
data makes sense: You don’t have to implement your own consumer, you get all
the benefits of Flume’s integration with HDFS and HBase, you have Cloudera
Manager monitoring the consumer and you can even add an interceptor and do
some stream processing on the way.
Page 6 Kafka
Kafka vs Flume
Kafka Page 7
Kafka Use Cases and Comparisons
Here is a description of a few of the popular use cases for Apache Kafka.
Messaging
Kafka works well as a replacement for a more traditional message broker. Message brokers are
used for a variety of reasons (to decouple processing from data producers, to buffer
unprocessed messages, etc). In comparison to most messaging systems Kafka has better
throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution
for large scale message processing applications.
In our experience messaging uses are often comparatively low-throughput, but may require low
end-to-end latency and often depend on the strong durability guarantees Kafka provides.
Metrics
Kafka is often used for operation monitoring data pipelines. This involves aggregating statistics
from distributed applications to produce centralized feeds of operational data.
Log Aggregation
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation
typically collects physical log files off servers and puts them in a central place (a file server or
HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner
abstraction of log or event data as a stream of messages. This allows for lower-latency
processing and easier support for multiple data sources and distributed data consumption. In
comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance,
stronger durability guarantees due to replication, and much lower end-to-end latency.
Stream Processing
Many users end up doing stage-wise processing of data where data is consumed from topics of
raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for
further consumption. For example a processing flow for article recommendation might crawl
article content from RSS feeds and publish it to an "articles" topic; further processing might help
normalize or deduplicate this content to a topic of cleaned article content; a final stage might
attempt to match this content to users. This creates a graph of real-time data flow out of the
individual topics. The Storm framework is one popular way for implementing some of these
transformations.
Page 8 Kafka
Kafka Use Case and Comparisons
Kafka Page 9
Terminology – Producers, Brokers, Consumers
In Kafka, a Topic is a user-defined category to which messages are published.
Producers are applications that create Messages and publish them to the Kafka broker
for further consumption.
In the bigger picture, Kafka Producers publish messages to one or more topics
and Consumers subscribe to topics and process the published messages. So, at a high
level, producers send messages over the network to the Kafka cluster which in turn
serves them up to consumers. Finally, a Kafka cluster consists of one or more servers,
called Brokers that manage the persistence and replication of message data (i.e. the
commit log).
Page 10 Kafka
Terminology –
Producers, Brokers, Consumers
• Producers write data (publish message to partition within Topic) to Brokers
• Brokers manage persistence (read/writes) and replication of message data
• Consumers read data (subscribe to Topic) from Brokers via Poll message
Kafka Page 11
Terminology – Topics and Partitions (1 of 2)
Page 12 Kafka
Terminology – Topics and Partitions (1 of 2)
• The data is stored in Topics (feed name) to which messages are published
• Topics are split into Partitions which are replicated
• In Kafka, Topics consist of one or more Partitions that are ordered, immutable
sequences of messages. Since writes to a partition are sequential, this design
greatly reduces the number of hard disk seeks
Kafka Page 13
Terminology - Topics and Partitions (2 of 2)
Page 14 Kafka
Terminology – Topics and Partitions (2 of 2)
Old
New
Kafka Page 15
Lab01: Start Kafka
Page 16 Kafka
Lab01: Start Kafka
cd usr/hdp/2.2.0.0-2041/kafka/bin
Kafka Page 17
Generic ‘Create Topic’ Example
Page 18 Kafka
Generic 'Create Topic' example
• The Replication Factor controls how many servers will replicate each message
that is written. If you have a replication factor of 3 then up to 2 servers can fail
before you will lose access to your data. We recommend you use a replication
factor of 2 or 3 so that you can transparently bounce machines without
interrupting data consumption
• The Partition count controls how many logs the topic will be sharded into.
There are several impacts of the partition count. First each partition must fit
entirely on a single server. So if you have 20 partitions the full data set (and
read and write load) will be handled by no more than 20 servers (no counting
replicas). Finally the partition count impacts the maximum parallelism of your
consumers
• The Configurations added (--config) on the command line override the default
settings the server has for things like length of time data should be retained
http://kafka.apache.org/documentation.html#introduction
Kafka Page 19
Lab02 : Create Topic
Page 20 Kafka
Lab02: Create Topic
cd usr/hdp/2.2.0.0-2041/kafka
Execute this command to create Topics for 'truckevent':
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 1 --partitions 1 --topic truckevent
Kafka Page 21
Lab03: Create Producer
To start the Kafka Producer we execute the following command to see the output as
shown in the screenshot.
We have now successfully compiled and had the Kafka producer publish some
messages to the Kafka cluster.
Page 22 Kafka
Lab03: Create Producer
We will use a custom Java script that will run the Producer and stream in
messages about the truck's currently driving patterns. Note if the below code
fails, you'll have to use Maven to rebuild a Clean Page (see next slides)
cd /opt/TruckEvents/Tutorials-master
To start the Producer
java -cp target/Tutorial-1.0-SNAPSHOT.jar
com.hortonworks.tutorials.tutorial1.TruckEventsProducer
sandbox:6667 sandbox:2181 &
Kafka Page 23
Lab04: Consuming messages
Now let’s consume the messages by viewing them in the console.
Page 24 Kafka
Lab04: Consuming messages
Open a new Hadoop PuTTY prompt and paste the below, then click 'Enter'
root@sandbox:
/usr/hdp/2.2.0.0-2041/kafka/bin/kafka-console-consumer.sh -
zookeeper localhost:2181 -topic truckevent -from-beginning
Success. The client is reading the messages from the queue
Kafka Page 25
Lab06: To stop Kakfa
If you wanted to stop the Kafka service, you would use these commands
Page 26 Kafka
Lab06: To stop Kafka
First delete the TRUCKEVENT topic we created earlier, then stop Kafka
cd /usr/hdp/2.2.0.0-2041/kafka/bin
bin/kafka-run-class.sh kafka.adminDeleteTopicCommand
--zookeeper localhost:2181 -topic truckevent
kafka stop
kafka status
Kafka Page 27
In Review - Kafka
Page 28 Kafka
In Review – Kafka
Kafka Page 29