0% found this document useful (0 votes)
15 views7 pages

Kafka

Kafka is a publish-subscribe messaging system that allows data from many sources to be streamed in real-time to a cluster for processing. Kafka servers store incoming messages from publishers in topics, and consumers subscribe to topics to receive the streaming data. Kafka can scale horizontally by distributing processes and storage across multiple servers, and consumers can also be distributed so that messages are load balanced among consumer groups.

Uploaded by

Nouhaila
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
15 views7 pages

Kafka

Kafka is a publish-subscribe messaging system that allows data from many sources to be streamed in real-time to a cluster for processing. Kafka servers store incoming messages from publishers in topics, and consumers subscribe to topics to receive the streaming data. Kafka can scale horizontally by distributing processes and storage across multiple servers, and consumers can also be distributed so that messages are load balanced among consumer groups.

Uploaded by

Nouhaila
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 7

STREAMING WITH

KAFKA
Publish/Subscribe Messaging with Kafka
What is streaming?

■ So far we’ve really just talked about processing historical, existing big data
– Sitting on HDFS
– Sitting in a database
■ But how does new data get into your cluster? Especially if it’s “Big data”?
– New log entries from your web servers
– New sensor data from your IoT system
– New stock trades
■ Streaming lets you publish this data, in real time, to your cluster.
– And you can even process it in real time as it comes in!
Two problems

■ How to get data from many different sources flowing into your cluster
■ Processing it when it gets there
■ First, let’s focus on the first problem
Enter Kafka

■ Kafka is a general-purpose publish/subscribe messaging system


■ Kafka servers store all incoming messages from publishers for some period of
time, and publishes them to a stream of data called a topic.
■ Kafka consumers subscribe to one or more topics, and receive data as it’s
published
■ A stream / topic can have many different consumers, all with their own
position in the stream maintained
■ It’s not just for Hadoop
Kafka architecture
Producers

App App App

DB App

Connectors Kafka Cluster Stream


Processors

DB App

App App App

Consumers
How Kafka scales

■ Kafka itself may be distributed among


many processes on many servers
– Will distribute the storage of stream
data as well
■ Consumers may also be distributed
– Consumers of the same group will
have messages distributed amongst
them
– Consumers of different groups will get
their own copy of each message

Image: kafka.apache.org
Let’s play

■ Start Kafka on our sandbox


■ Set up a topic
– Publish some data to it, and watch it get consumed
■ Set up a file connector
– Monitor a log file and publish additions to it

You might also like