Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas

Intelligent People. Uncommon Ideas.
Vineet Gupta | GM Software Engineering | Directi http://vineetgupta.spaces.live.com

Licensed under Creative Commons Attribution Sharealike Noncommercial
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Offline Processing (Batching / Queuing) Distributed Processing Map Reduce Non-blocking IO Fault Detection, Tolerance and Recovery
22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine
Source: http://highscalability.com/digg-architecture
1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months
Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-millionblogs-indexed-everyday/
2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day
Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B SQL queries / day 1 B page views / day 3 B API calls / month 15,000 App servers
Source: http://highscalability.com/ebay-architecture/
450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce
100k jobs / day 20 PB of data processed / day 10k MapReduce programs
Source: http://highscalability.com/google-architecture/
Data Size ~ PB Data Growth ~ TB / day No of servers 10s to 10,000 No of datacenters 1 to 10 Queries B+ / day Specialized needs more / other than RDBMS
CPU CPU CPU
RAM RAM RAM
App Server
DB Server
Host
Sunfire E20k 36x 1.8GHz processors $450,000 - $2,500,000
PowerEdge SC1435 Dualcore 1.8 GHz processor Around $1,500
Increasing the hardware resources on a host Pros

Simple to implement Fast turnaround time
Cons
Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially
App Server
DB Server
Host
Host
Split services on separate nodes

Each node performs different tasks
Pros
Increases per application Availability Task-based specialization, optimization and tuning possible Reduces context switching Simple to implement for out of band processes No changes to App required Flexibility increases
Cons
Sub-optimal resource utilization May not increase overall availability Finite Scalability
Web Server
Load Balancer
Web Server
DB Server
Web Server
Add more nodes for the same service

Identical, doing the same task
Load Balancing
Hardware balancers are faster Software balancers are more customizable
Web Server
User 1
Load Balancer
User 2
Web Server
DB Server
Web Server
Web Server
User 1
Load Balancer
User 2
Asymmetrical load distribution Downtime
Web Server
DB Server
Web Server
Web Server
User 1
Load Balancer
User 2
SPOF Reads and Writes generate network + disk IO
Web Server
Session Store
Web Server
Web Server
User 1
Load Balancer
User 2
Web Server
Web Server
Pros
No SPOF Easier to setup Fast Reads
Cons
n x Writes Increase in network IO with increase in nodes Stale data (rare)
Web Server
User 1
Load Balancer
User 2
Web Server
DB Server
Web Server
No Sessions
Stuff state in a cookie and sign it! Cookie is sent with every request / response
Super Slim Sessions

Keep small amount of frequently used data in cookie Pull rest from DB (or central session store)
Bad
Sticky sessions
Good
Clustered sessions for small number of nodes and / or small write volume Central sessions for large number of nodes or large write volume
Great
No Sessions!
HTTP Accelerators / Reverse Proxy

Static content caching, redirect to lighter HTTP Async NIO on user-side, Keep-alive connection pool
CDN
Get closer to your user Akamai, Limelight
IP Anycasting Async NIO
App-Layer
Add more nodes and load balance! Avoid Sticky Sessions Avoid Sessions!!
Data Store
Tricky! Very Tricky!!!
App Layer
T1, T2, T3, T4
App Layer
T1, T2, T3, T4
T1, T2, T3, T4
T1, T2, T3, T4
T1, T2, T3, T4
T1, T2, T3, T4
Each node has its own copy of data Shared Nothing Cluster
Read : Write = 4:1

Scale reads at cost of writes!
Duplicate Data each node has its own copy
Master Slave
Writes sent to one node, cascaded to others
Multi-Master
Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management
App Layer
Master
Slave
Slave
Slave
Slave
n x Writes Async vs. Sync SPOF Async - Critical Reads from Master!
App Layer
Master
Master
Slave
Slave
Slave
n x Writes Async vs. Sync No SPOF Conflicts!
Asynchronous
Guaranteed, but out-of-band replication from Master to Slave Master updates its own db and returns a response to client Replication from Master to Slave takes place asynchronously Faster response to a client Slave data is marginally behind the Master Requires modification to App to send critical reads and writes to master, and load balance all other reads
Synchronous
Guaranteed, in-band replication from Master to Slave Master updates its own db, and confirms all slaves have updated their db before returning a response to client Slower response to a client Slaves have the same data as the Master at all times Requires modification to App to send writes to master and load balance all reads
Replication at RDBMS level

Support may exists in RDBMS or through 3rd party tool Faster and more reliable App must send writes to Master, reads to any db and critical reads to Master
Replication at Driver / DAO level

Driver / DAO layer ensures
writes are performed on all connected DBs Reads are load balanced
Critical reads are sent to a Master In most cases RDBMS agnostic Slower and in some cases less reliable
Read
Read Write Write Read Write
Per Server: 4R, 1W 2R, 1W 1R, 1W
Read Write
Read Write
Read Write
Read Write
Vertical Partitioning
Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite
Horizontal Partitioning
Divide data on rows Scale to as many boxes as there are rows! Limitless scaling
App Layer
T1, T2, T3, T4, T5
Note: A node here typically represents a shared nothing cluster
App Layer
T1
T2
T3
T4
T5
Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)
App Layer
First million rows
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
Second million rows
Third million rows
T1
T2
T3
T4
T5
Value Based
Split on timestamp of posts Split on first alphabet of user name
Hash Based
Use a hash function to determine cluster
Lookup Map
First Come First Serve Round Robin
Consistency
Availability
Partition Tolerance
Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
Transactions make you feel alone

No one else manipulates the data when you are
Transactional serializability
The behavior is as if a serial order exists
Ti Doesnt Know About These Transactions and They Dont Know About Ti
Te Ta Tc Tb Td Tf
Tg Ti
Tj
Tl Th Tk Tm
Tn To
These Transactions Precede Ti
Transaction Serializability
These Transactions Follow Ti

Slide 46
Source: http://blogs.msdn.com/pathelland/
Transactions live in the now inside services

Time marches forward Transactions commit Advancing time Transactions see the committed transactions
Service
A services biz-logic lives in the now
Each Transaction Only Sees a Simple Advancing of Time with a Clear Set of Preceding Transactions
Slide 47
Messages contain unlocked data

Assume no shared transactions
Unlocked data may change

Unlocking it allows change
Messages are not from the now

They are from the past
There is no simultaneity at a distance! Similar to speed of light Knowledge travels at speed of light By the time you see a distant object it may have changed! By the time you see a message, the data may have changed! Services, transactions, and locks bound simultaneity! Inside a transaction, things appear simultaneous (to others) Simultaneity only inside a transaction! Simultaneity only inside a service!
Source: http://blogs.msdn.com/pathelland/ Slide 48
All data from distant stars is from the past 10 light years away; 10 year old knowledge The sun may have blown up 5 minutes ago We wont know for 3 minutes more
All data seen from a distant service is from the past
By the time you see it, it has been unlocked and may change
Each service has its own perspective

Inside data is now; outside data is past My inside is not your inside; my outside is not your outside
This is like going from Newtonian to Einstonian physics Newtons time marched forward uniformly Instant knowledge Classic distributed computing: many systems look like one RPC, 2-phase commit, remote method calls In Einsteins world, everything is relative to ones perspective Today: No attempt to blur the boundary
Source: http://blogs.msdn.com/pathelland/ Slide 49
Cant have the same data at many locations Unless it is a snapshot Changing distributed data needs versions Creates a snapshot
Data Owning Service
Wednesdays Price-List
Price-List
Listening Partner Service-8
Tuesdays Price-List
Tuesdays Price-List
Tuesdays Price-List
Mondays Price-List
Mondays Price-List
Given what I know here and now, make a decision Remember the versions of all the data used to make this decision Record the decision as being predicated on these versions Other copies of the object may make divergent decisions Try to sort out conflicts within the family If necessary, programmatically apologize Very rarely, whine and fuss for human help
Subjective Consistency
Given the information I have at hand, make a decision and act on it ! Remember the information at hand !
Ambassadors Had Authority

Back before radio, it could be months between communication with the king. Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later! Source: http://blogs.msdn.com/pathelland/
Eventually, all the copies of the object share their changes Ill show you mine if you show me yours! Now, apply subjective consistency: Given the information I have at hand, make a decision and act on it! Everyone has the same information, everyone comes to the same conclusion about the decisions to take
Eventual Consistency
Given the same knowledge, produce the same result ! Everyone sharing their knowledge leads to the same result...
This is NOT magic; it is a design requirement !

Idempotence, commutativity, and associativity of the operations (decisions made) are all implied by this requirement Source: http://blogs.msdn.com/pathelland/
Normalizations Goal Is Eliminating Update Anomalies

Can Be Changed Without Funny Behavior Each Data Item Lives in One Place De-normalization is OK if you arent going to update!
Emp # Emp Name 47 Joe 18 Sally 91 Pete 66 Mary Classic problem with de-normalization Cant update Sams phone # since there are many copies
Emp Phone Mgr # Mgr Name Mgr Phone 5-1234 13 Sam 6-9876 3-3123 38 Harry 5-6782 2-1112 13 Sam 6-9876 5-7349 02 Betty 4-0101 Source: http://blogs.msdn.com/pathelland/
affiliations table affiliation_id description Microsoft Georgia Tech member_count 18,656 23,488
user table
42
598 user_affiliations table
relati religi user_work_history first_ last_ table hom inter politi user onsh ous_ user_idnam nam affiliation_id sex etow este cal_v _id ip_st view (foreign_key) (foreign user_phone_numbers e table user_screen_names e nkey) table d_in iews company_affil atus s user_id company_na 12345 42 iation_id job_title Atlan me (foreign_key) 1234 marr wom (foreign key) 12345 John user_id 598 user_id Doe Male ta, (null)im_service (null) phone_number phone_type screen_name 5 ied en (foreign_key) (foreign_key) Program GA 12345 42 Microsoft Manager 12345 425-555-1203 Home geeknproud@exam 12345 AIM ple.com Quality 12345 425-555-6161 Work i2 12345 78 Assurance voip4life@example. Technologies 12345 206-555-0932 Cell 12345 Skype Engineer org
6 joins for 1 query!

Do you think FB would do this? And how would you do joins with partitioned data?
De-normalization removes joins But increases data volume

But disk is cheap and getting cheaper
And can lead to inconsistent data

If you are lazy However this is not really an issue
Many Kinds of Computing are Append-Only Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You cant change the history but you can add new observations Derived Results May Be Calculated Estimate of the current inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements
Transaction Logs Are the Truth

High-performance & write-only Describe ALL the changes to the data
Log
Data-Base the Current Opinion

Describes the latest value of the data as perceived by the application The Database Is a Caching of the Transaction Log !
It is the subset of the latest committed values represented in the transaction log Source: http://blogs.msdn.com/pathelland/ DB
Data Owning Service
Price-List
Tuesdays Price-List
Tuesdays Price-List
Tuesdays Price-List
Mondays Price-List
Mondays Price-List
Makes scaling easier (cheaper) Core Idea

Read data from persistent store into memory Store in a hash-table Read first from cache, if not, load from persistent store
App Server
Cache
App Server
Cache
App Server
Cache
In-memory Distributed Hash Table Memcached instance manifests as a process (often on the same machine as web-server) Memcached Client maintains a hash table
Which item is stored on which instance
Memcached Server maintains a hash table

Which item is stored in which memory location
Amazon - S3, SimpleDb, Dynamo Google - App Engine Datastore, BigTable Microsoft SQL Data Services, Azure Storages Facebook Cassandra LinkedIn - Project Voldemort Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable
Basic Concepts
No tables - Containers-Entity No schema - each tuple has its own set of properties
Amazon SimpleDB strings only
Microsoft Azure SQL Data Services

Strings, blob, datetime, bool, int, double, etc. No x-container joins as of now
Google App Engine Datastore

Strings, blob, datetime, bool, int, double, etc.
Google BigTable
Sparse, Distributed, multi-dimensional sorted map Indexed by row key, column key, timestamp Each value is an un-interpreted array of bytes
Amazon Dynamo
Data partitioned and replicated using consistent hashing Decentralized replica sync protocol Consistency thru versioning
Facebook Cassandra
Used for Inbox search Open Source
Scalaris
Keys stored in lexicographical order Improved Paxos to provide ACID Memory resident, no persistence
Real Life Scaling requires trade offs No Silver Bullet Need to learn new things Need to un-learn Balance!
Intelligent People. Uncommon Ideas.
Licensed under Creative Commons Attribution Sharealike Noncommercial

Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas

Uploaded by

Copyright:

Available Formats

Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas

Uploaded by

Copyright:

Available Formats

Intelligent People. Uncommon Ideas.

Vineet Gupta | GM Software Engineering | Directi http://vineetgupta.spaces.live.com

CPU CPU CPU

RAM RAM RAM

Sunfire E20k 36x 1.8GHz processors $450,000 - $2,500,000

PowerEdge SC1435 Dualcore 1.8 GHz processor Around $1,500

Increasing the hardware resources on a host Pros

Split services on separate nodes

Add more nodes for the same service

Super Slim Sessions

HTTP Accelerators / Reverse Proxy

IP Anycasting Async NIO

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

Read : Write = 4:1

Duplicate Data each node has its own copy

n x Writes Async vs. Sync No SPOF Conflicts!

Replication at RDBMS level

Replication at Driver / DAO level

Per Server: 4R, 1W 2R, 1W 1R, 1W

T1, T2, T3, T4, T5

Note: A node here typically represents a shared nothing cluster

First million rows

Second million rows

Third million rows

Transactions make you feel alone

These Transactions Precede Ti

These Transactions Follow Ti

Transactions live in the now inside services

A services biz-logic lives in the now

Messages contain unlocked data

Unlocked data may change

Messages are not from the now

Each service has its own perspective

Data Owning Service

Listening Partner Service-8

Listening Partner Service-1

Listening Partner Service-5

Listening Partner Service-7

Ambassadors Had Authority

This is NOT magic; it is a design requirement !

Normalizations Goal Is Eliminating Update Anomalies

598 user_affiliations table

6 joins for 1 query!

De-normalization removes joins But increases data volume

And can lead to inconsistent data

Transaction Logs Are the Truth

Data-Base the Current Opinion

Data Owning Service

Listening Partner Service-8

Listening Partner Service-1

Listening Partner Service-5

Listening Partner Service-7

Makes scaling easier (cheaper) Core Idea