Airbnb
Airbnb
Airbnb
Jay Kreps
Messaging
JMS, AMQP, etc
Logs
Syslog, log4j, etc
Search
Social Graph
Recommendations
Live Storage
Hadoop
Data Warehouse
Monitoring Systems
Point-to-Point Pipelines
Centralized Pipeline
Problems
My Experience
LinkedIns Pipeline
Problems
Fragility
Multi-hour delay
Coverage
Labor intensive
Slow
Does it work?
Four Ideas
1
2
3
4
Four Ideas
1
2
3
4
Very confused
Messaging (JMS, AMQP, )
Log aggregation
CEP, Streaming
First Attempt:
Dont reinvent the wheel!
Persistence is an afterthought
Ad hoc distribution
Odd semantics
Featuritis
Second Attempt:
Reinvent the wheel!
Data Flow
Apache Kafka
Some Terminology
Producers send messages to Brokers
Consumers read messages from
Brokers
Messages are sent to a Topic
Each topic is broken into one or more
ordered partitions of messages
APIs
send(String topic, String key, Message
message)
Iterator<Message>
Distribution
Performance
50MB/sec writes
110MB/sec reads
Performance
Performance Tricks
Batching
Producer
Broker
Consumer
Batch Compression
Kafka Replication
In 0.8 release
Messages are highly available
No centralized master
Kafka Info
http://incubator.apache.org/k
afka
Usage at LinkedIn
10 billion messages/day
Sustained peak:
172,000 messages/second written
950,000 messages/second read
367 topics
40 real-time consumers
Many ad hoc consumers
10k connections/colo
9.5TB log retained
End-to-end delivery time: 10 seconds (avg)
Datacenters
Four Ideas
1
2
3
4
Problem
Schema free?
LOADstudentUSINGPigStorage()
AS(name:chararray,age:int,gpa:float)
Schemas
Structure can be exploited
Performance
Size
Compatibility
Need a formal contract
Avro Schema
Avro data definition and schema
Central repository of all schemas
Reader always uses same schema as
writer
Programatic compatibility model
Workflow
1 Check in schema
2 Code review
3 Ship
Four Ideas
1
2
3
4
Four Ideas
1
2
3
4
Does it work?
Audit Trail
Each producer, broker, and consumer
periodically reports how many
messages it saw
Reconcile these counts every few
minutes
Graph and alert
Audit Trail
Questions?