This document introduces Apache Kafka. It defines Kafka as a high-performance, real-time messaging system that is distributed, fault-tolerant and highly scalable. It describes Kafka's data model involving messages and topics, as well as its architecture which uses brokers to distribute messages from producers to consumers. Key aspects of Kafka like partitions, producers, consumers and guarantees of message ordering and fault tolerance are also summarized.
This document introduces Apache Kafka. It defines Kafka as a high-performance, real-time messaging system that is distributed, fault-tolerant and highly scalable. It describes Kafka's data model involving messages and topics, as well as its architecture which uses brokers to distribute messages from producers to consumers. Key aspects of Kafka like partitions, producers, consumers and guarantees of message ordering and fault tolerance are also summarized.
Associate Professor Dept. of Computer Science & Engg. Indian Institute of Technology Patna [email protected] Cloud Computing and DistributedVuSystems Pham Introduction to Kafka Preface Content of this Lecture:
Define Kafka
Describe some use cases for Kafka
Describe the Kafka data model
Describe Kafka architecture
List the types of messaging systems
Explain the importance of brokers
Cloud Computing and DistributedVuSystems Pham Introduction to Kafka Batch vs. Streaming
Batch Streaming
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Vu Pham Introduction: Apache Kafka Kafka is a high-performance, real-time messaging system. It is an open source tool and is a part of Apache projects.
The characteristics of Kafka are:
1. It is a distributed and partitioned messaging system.
2. It is highly fault-tolerant 3. It is highly scalable. 4. It can process and send millions of messages per second to several receivers.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Kafka History Apache Kafka was originally developed by LinkedIn and later, handed over to the open source community in early 2011.
It became a main Apache project in October, 2012.
A stable Apache Kafka version 0.8.2.0 was release in Feb, 2015. A stable Apache Kafka version 0.8.2.1 was released in May, 2015, which is the latest version.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Kafka Use Cases Kafka can be used for various purposes in an organization, such as:
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Apache Kafka: a Streaming Data Platform Most of what a business does can be thought as event streams. They are in a • Retail system: orders, shipments, returns, … • Financial system: stock ticks, orders, … • Web site: page views, clicks, searches, … • IoT: sensor readings, … and so on.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Enter Kafka Adopted at 1000s of companies worldwide
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Aggregating User Activity Using Kafka-Example
Kafka can be used to aggregate user activity data such as clicks,
navigation, and searches from different websites of an organization; such user activities can be sent to a real-time monitoring system and hadoop system for offline processing.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Kafka Data Model The Kafka data model consists of messages and topics. Messages represent information such as, lines in a log file, a row of stock market data, or an error message from a system. Messages are grouped into categories called topics. Example: LogMessage and Stock Message. The processes that publish messages into a topic in Kafka are known as producers. The processes that receive the messages from a topic in Kafka are known as consumers. The processes or servers within Kafka that process the messages are known as brokers. A Kafka cluster consists of a set of brokers that process the messages.
Vu Pham Introduction to Kafka
Topics A topic is a category of messages in Kafka. The producers publish the messages into topics. The consumers read the messages from topics. A topic is divided into one or more partitions. A partition is also known as a commit log. Each partition contains an ordered set of messages. Each message is identified by its offset in the partition. Messages are added at one end of the partition and consumed at the other.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Partitions Topics are divided into partitions, which are the unit of parallelism in Kafka.
Partitions allow messages in a topic to be distributed to
multiple servers. A topic can have any number of partitions. Each partition should fit in a single Kafka server. The number of partitions decide the parallelism of the topic.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Partition Distribution Partitions can be distributed across the Kafka cluster. Each Kafka server may handle one or more partitions. A partition can be replicated across several servers fro fault-tolerance. One server is marked as a leader for the partition and the others are marked as followers. The leader controls the read and write for the partition, whereas, the followers replicate the data. If a leader fails, one of the followers automatically become the leader. Zookeeper is used for the leader selection.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Producers The producer is the creator of the message in Kafka.
The producers place the message to a particular topic.
The producers also decide which partition to place the message into. Topics should already exist before a message is placed by the producer. Messages are added at one end of the partition.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Consumers The consumer is the receiver of the message in Kafka.
Each consumer belongs to a consumer group.
A consumer group may have one or more consumers. The consumers specify what topics they want to listen to. A message is sent to all the consumers in a consumer group. The consumer groups are used to control the messaging system.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Kafka Architecture Kafka architecture consists of brokers that take messages from the producers and add to a partition of a topic. Brokers provide the messages to the consumers from the partitions. • A topic is divided into multiple partitions. • The messages are added to the partitions at one end and consumed in the same order. • Each partition acts as a message queue. • Consumers are divided into consumer groups. • Each message is delivered to one consumer in each consumer group. • Zookeeper is used for coordination.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Types of Messaging Systems Kafka architecture supports the publish-subscribe and queue system.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Example: Queue System
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Example: Publish-Subscribe System
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Brokers Brokers are the Kafka processes that process the messages in Kafka.
• Each machine in the cluster can run one broker.
• They coordinate among each other using Zookeeper.
• One broker acts as a leader for a partition and handles the
delivery and persistence, where as, the others act as followers.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Kafka Guarantees Kafka guarantees the following:
1. Messages sent by a producer to a topic and a partition
are appended in the same order
2. A consumer instance gets the messages in the same
order as they are produced.
3. A topic with replication factor N, tolerates upto N-1
server failures.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Replication in Kafka Kafka uses the primary-backup method of replication. One machine (one replica) is called a leader and is chosen as the primary; the remaining machines (replicas) are chosen as the followers and act as backups. The leader propagates the writes to the followers. The leader waits until the writes are completed on all the replicas. If a replica is down, it is skipped for the write until it comes back. If the leader fails, one of the followers will be chosen as the new leader; this mechanism can tolerate n-1 failures if the replication factor is ‘n’ Cloud Computing and DistributedVuSystems Pham Introduction to Kafka Persistence in Kafka Kafka uses the Linux file system for persistence of messages Persistence ensures no messages are lost. Kafka relies on the file system page cache for fast reads and writes. All the data is immediately written to a file in file system. Messages are grouped as message sets for more efficient writes. Message sets can be compressed to reduce network bandwidth. A standardized binary message format is used among producers, brokers, and consumers to minimize data modification. Cloud Computing and DistributedVuSystems Pham Introduction to Kafka Apache Kafka: a Streaming Data Platform Apache Kafka is an open source streaming data platform (a new category of software!) with 3 major components: 1. Kafka Core: A central hub to transport and store event streams in real-time. 2. Kafka Connect: A framework to import event streams from other source data systems into Kafka and export event streams from Kafka to destination data systems. 3. Kafka Streams: A Java library to process event streams live as they occur.
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Further Learning o Kafka Streams code examples o Apache Kafka https://github.com/apache/kafka/tree/trunk/streams/examples/src/main/java/org/apache/kafka/ streams/examples o Confluent https://github.com/confluentinc/examples/tree/master/kafka-streams
o Source Code https://github.com/apache/kafka/tree/trunk/streams
o Kafka Streams Java docs http://docs.confluent.io/current/streams/javadocs/index.html
o First book on Kafka Streams (MEAP)
o Kafka Streams in Action https://www.manning.com/books/kafka-streams-in-action
o Kafka Streams download
o Apache Kafka https://kafka.apache.org/downloads o Confluent Platform http://www.confluent.io/download
Cloud Computing and DistributedVuSystems
Pham Introduction to Kafka Conclusion Kafka is a high-performance, real-time messaging system.
Kafka can be used as an external commit log for distributed
systems.
Kafka data model consists of messages and topics.
Kafka architecture consists of brokers that take messages from the
producers and add to a partition of a topics.
Kafka architecture supports two types of messaging system called
publish-subscribe and queue system.
Brokers are the Kafka processes that process the messages in Kafka. Cloud Computing and DistributedVuSystems Pham Introduction to Kafka
Mastering Event-Driven Microservices in AWS: Design, Develop, and Deploy Scalable, Resilient, and Reactive Architectures with AWS Serverless Services (English Edition)
Mastering Event-Driven Microservices in AWS: Design, Develop, and Deploy Scalable, Resilient, and Reactive Architectures with AWS Serverless Services (English Edition)