0% found this document useful (0 votes)
128 views27 pages

Lecture Intro Kafka

This document introduces Apache Kafka. It defines Kafka as a high-performance, real-time messaging system that is distributed, fault-tolerant and highly scalable. It describes Kafka's data model involving messages and topics, as well as its architecture which uses brokers to distribute messages from producers to consumers. Key aspects of Kafka like partitions, producers, consumers and guarantees of message ordering and fault tolerance are also summarized.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
128 views27 pages

Lecture Intro Kafka

This document introduces Apache Kafka. It defines Kafka as a high-performance, real-time messaging system that is distributed, fault-tolerant and highly scalable. It describes Kafka's data model involving messages and topics, as well as its architecture which uses brokers to distribute messages from producers to consumers. Key aspects of Kafka like partitions, producers, consumers and guarantees of message ordering and fault tolerance are also summarized.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 27

Introduction to Kafka

Dr. Rajiv Misra


Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Preface
Content of this Lecture:

Define Kafka

Describe some use cases for Kafka

Describe the Kafka data model

Describe Kafka architecture

List the types of messaging systems

Explain the importance of brokers


Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Batch vs. Streaming

Batch Streaming

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Vu Pham
Introduction: Apache Kafka
Kafka is a high-performance, real-time messaging
system. It is an open source tool and is a part of Apache
projects.

The characteristics of Kafka are:

1. It is a distributed and partitioned messaging system.


2. It is highly fault-tolerant
3. It is highly scalable.
4. It can process and send millions of messages per second
to several receivers.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka History
Apache Kafka was originally developed by LinkedIn and
later, handed over to the open source community in early
2011.

It became a main Apache project in October, 2012.


A stable Apache Kafka version 0.8.2.0 was release in Feb,
2015.
A stable Apache Kafka version 0.8.2.1 was released in May,
2015, which is the latest version.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka Use Cases
Kafka can be used for various purposes in an organization,
such as:

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Apache Kafka: a Streaming Data Platform
 Most of what a business does can be thought as event
streams. They are in a
• Retail system: orders, shipments, returns, …
• Financial system: stock ticks, orders, …
• Web site: page views, clicks, searches, …
• IoT: sensor readings, …
and so on.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Enter Kafka
Adopted at 1000s of companies worldwide

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Aggregating User Activity Using Kafka-Example

Kafka can be used to aggregate user activity data such as clicks,


navigation, and searches from different websites of an
organization; such user activities can be sent to a real-time
monitoring system and hadoop system for offline processing.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka Data Model
The Kafka data model consists of messages and topics.
Messages represent information such as, lines in a log file, a row of stock
market data, or an error message from a system.
Messages are grouped into categories called topics.
Example: LogMessage and Stock Message.
The processes that publish messages into a topic in Kafka are known as
producers.
The processes that receive the messages from a topic in Kafka are known as
consumers.
The processes or servers within Kafka that process the messages are known as
brokers.
A Kafka cluster consists of a set of brokers that process the messages.

Vu Pham Introduction to Kafka


Topics
A topic is a category of messages in Kafka.
The producers publish the messages into topics.
The consumers read the messages from topics.
A topic is divided into one or more partitions.
A partition is also known as a commit log.
Each partition contains an ordered set of messages.
Each message is identified by its offset in the partition.
Messages are added at one end of the partition and consumed
at the other.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Partitions
Topics are divided into partitions, which are the unit of
parallelism in Kafka.

Partitions allow messages in a topic to be distributed to


multiple servers.
A topic can have any number of partitions.
Each partition should fit in a single Kafka server.
The number of partitions decide the parallelism of the topic.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Partition Distribution
Partitions can be distributed across the Kafka cluster.
Each Kafka server may handle one or more partitions.
A partition can be replicated across several servers fro fault-tolerance.
One server is marked as a leader for the partition and the others are
marked as followers.
The leader controls the read and write for the partition, whereas, the
followers replicate the data.
If a leader fails, one of the followers automatically become the leader.
Zookeeper is used for the leader selection.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Producers
The producer is the creator of the message in Kafka.

The producers place the message to a particular topic.


The producers also decide which partition to place the message into.
Topics should already exist before a message is placed by the producer.
Messages are added at one end of the partition.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Consumers
The consumer is the receiver of the message in Kafka.

Each consumer belongs to a consumer group.


A consumer group may have one or more consumers.
The consumers specify what topics they want to listen to.
A message is sent to all the consumers in a consumer group.
The consumer groups are used to control the messaging system.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka Architecture
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topic. Brokers provide the
messages to the consumers from the partitions.
• A topic is divided into multiple partitions.
• The messages are added to the partitions at one end and consumed in
the same order.
• Each partition acts as a message queue.
• Consumers are divided into consumer groups.
• Each message is delivered to one consumer in each consumer group.
• Zookeeper is used for coordination.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Types of Messaging Systems
Kafka architecture supports the publish-subscribe and queue system.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Example: Queue System

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Example: Publish-Subscribe System

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Brokers
Brokers are the Kafka processes that process the messages in Kafka.

• Each machine in the cluster can run one broker.

• They coordinate among each other using Zookeeper.

• One broker acts as a leader for a partition and handles the


delivery and persistence, where as, the others act as followers.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Kafka Guarantees
Kafka guarantees the following:

1. Messages sent by a producer to a topic and a partition


are appended in the same order

2. A consumer instance gets the messages in the same


order as they are produced.

3. A topic with replication factor N, tolerates upto N-1


server failures.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Replication in Kafka
Kafka uses the primary-backup method of replication.
One machine (one replica) is called a leader and is chosen
as the primary; the remaining machines (replicas) are
chosen as the followers and act as backups.
The leader propagates the writes to the followers.
The leader waits until the writes are completed on all the
replicas.
If a replica is down, it is skipped for the write until it
comes back.
If the leader fails, one of the followers will be chosen as
the new leader; this mechanism can tolerate n-1 failures if
the replication factor is ‘n’
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Persistence in Kafka
Kafka uses the Linux file system for persistence of messages
Persistence ensures no messages are lost.
Kafka relies on the file system page cache for fast reads
and writes.
All the data is immediately written to a file in file system.
Messages are grouped as message sets for more efficient
writes.
Message sets can be compressed to reduce network
bandwidth.
A standardized binary message format is used among
producers, brokers, and consumers to minimize data
modification.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka
Apache Kafka: a Streaming Data Platform
 Apache Kafka is an open source streaming data platform (a new
category of software!) with 3 major components:
1. Kafka Core: A central hub to transport and store event
streams in real-time.
2. Kafka Connect: A framework to import event streams from
other source data systems into Kafka and export event
streams from Kafka to destination data systems.
3. Kafka Streams: A Java library to process event streams live as
they occur.

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Further Learning
o Kafka Streams code examples
o Apache Kafka
https://github.com/apache/kafka/tree/trunk/streams/examples/src/main/java/org/apache/kafka/
streams/examples
o Confluent https://github.com/confluentinc/examples/tree/master/kafka-streams

o Source Code https://github.com/apache/kafka/tree/trunk/streams


o Kafka Streams Java docs
http://docs.confluent.io/current/streams/javadocs/index.html

o First book on Kafka Streams (MEAP)


o Kafka Streams in Action https://www.manning.com/books/kafka-streams-in-action

o Kafka Streams download


o Apache Kafka https://kafka.apache.org/downloads
o Confluent Platform http://www.confluent.io/download

Cloud Computing and DistributedVuSystems


Pham Introduction to Kafka
Conclusion
Kafka is a high-performance, real-time messaging system.

Kafka can be used as an external commit log for distributed


systems.

Kafka data model consists of messages and topics.

Kafka architecture consists of brokers that take messages from the


producers and add to a partition of a topics.

Kafka architecture supports two types of messaging system called


publish-subscribe and queue system.

Brokers are the Kafka processes that process the messages in Kafka.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka

You might also like