Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 73

Intelligent People. Uncommon Ideas.

Vineet Gupta | GM Software Engineering | Directi http://vineetgupta.spaces.live.com


Licensed under Creative Commons Attribution Sharealike Noncommercial

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

Offline Processing (Batching / Queuing) Distributed Processing Map Reduce Non-blocking IO Fault Detection, Tolerance and Recovery

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine

Source: http://highscalability.com/digg-architecture

1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months

Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-millionblogs-indexed-everyday/

2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day

Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html

Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B SQL queries / day 1 B page views / day 3 B API calls / month 15,000 App servers
Source: http://highscalability.com/ebay-architecture/

450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce
100k jobs / day 20 PB of data processed / day 10k MapReduce programs
Source: http://highscalability.com/google-architecture/

Data Size ~ PB Data Growth ~ TB / day No of servers 10s to 10,000 No of datacenters 1 to 10 Queries B+ / day Specialized needs more / other than RDBMS

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

CPU CPU CPU

RAM RAM RAM

App Server

DB Server

Host

Sunfire E20k 36x 1.8GHz processors $450,000 - $2,500,000

PowerEdge SC1435 Dualcore 1.8 GHz processor Around $1,500

Increasing the hardware resources on a host Pros


Simple to implement Fast turnaround time

Cons
Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially

App Server

DB Server

Host

Host

Split services on separate nodes


Each node performs different tasks

Pros
Increases per application Availability Task-based specialization, optimization and tuning possible Reduces context switching Simple to implement for out of band processes No changes to App required Flexibility increases

Cons
Sub-optimal resource utilization May not increase overall availability Finite Scalability

Web Server

Load Balancer

Web Server

DB Server

Web Server

Add more nodes for the same service


Identical, doing the same task

Load Balancing
Hardware balancers are faster Software balancers are more customizable

Web Server
User 1

Load Balancer
User 2

Web Server

DB Server

Web Server

Web Server
User 1

Load Balancer
User 2
Asymmetrical load distribution Downtime

Web Server

DB Server

Web Server

Web Server
User 1

Load Balancer
User 2
SPOF Reads and Writes generate network + disk IO

Web Server

Session Store

Web Server

Web Server
User 1

Load Balancer
User 2

Web Server

Web Server

Pros
No SPOF Easier to setup Fast Reads

Cons
n x Writes Increase in network IO with increase in nodes Stale data (rare)

Web Server
User 1

Load Balancer
User 2

Web Server

DB Server

Web Server

No Sessions
Stuff state in a cookie and sign it! Cookie is sent with every request / response

Super Slim Sessions


Keep small amount of frequently used data in cookie Pull rest from DB (or central session store)

Bad
Sticky sessions

Good
Clustered sessions for small number of nodes and / or small write volume Central sessions for large number of nodes or large write volume

Great
No Sessions!

HTTP Accelerators / Reverse Proxy


Static content caching, redirect to lighter HTTP Async NIO on user-side, Keep-alive connection pool

CDN
Get closer to your user Akamai, Limelight

IP Anycasting Async NIO

App-Layer
Add more nodes and load balance! Avoid Sticky Sessions Avoid Sessions!!

Data Store
Tricky! Very Tricky!!!

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

App Layer

T1, T2, T3, T4

App Layer

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

Each node has its own copy of data Shared Nothing Cluster

Read : Write = 4:1


Scale reads at cost of writes!

Duplicate Data each node has its own copy

Master Slave
Writes sent to one node, cascaded to others

Multi-Master
Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management

App Layer

Master

Slave

Slave

Slave

Slave

n x Writes Async vs. Sync SPOF Async - Critical Reads from Master!

App Layer

Master

Master

Slave

Slave

Slave

n x Writes Async vs. Sync No SPOF Conflicts!

Asynchronous
Guaranteed, but out-of-band replication from Master to Slave Master updates its own db and returns a response to client Replication from Master to Slave takes place asynchronously Faster response to a client Slave data is marginally behind the Master Requires modification to App to send critical reads and writes to master, and load balance all other reads

Synchronous
Guaranteed, in-band replication from Master to Slave Master updates its own db, and confirms all slaves have updated their db before returning a response to client Slower response to a client Slaves have the same data as the Master at all times Requires modification to App to send writes to master and load balance all reads

Replication at RDBMS level


Support may exists in RDBMS or through 3rd party tool Faster and more reliable App must send writes to Master, reads to any db and critical reads to Master

Replication at Driver / DAO level


Driver / DAO layer ensures
writes are performed on all connected DBs Reads are load balanced

Critical reads are sent to a Master In most cases RDBMS agnostic Slower and in some cases less reliable

Read
Read Write Write Read Write

Per Server: 4R, 1W 2R, 1W 1R, 1W

Read Write

Read Write

Read Write

Read Write

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

Vertical Partitioning
Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite

Horizontal Partitioning
Divide data on rows Scale to as many boxes as there are rows! Limitless scaling

App Layer

T1, T2, T3, T4, T5

Note: A node here typically represents a shared nothing cluster

App Layer

T1

T2

T3

T4

T5

Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)

App Layer

First million rows

T1
T1

T2
T2

T3
T3

T4
T4

T5
T5

Second million rows

Third million rows

T1

T2

T3

T4

T5

Value Based
Split on timestamp of posts Split on first alphabet of user name

Hash Based
Use a hash function to determine cluster

Lookup Map
First Come First Serve Round Robin

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

Consistency

Availability

Partition Tolerance

Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495

Transactions make you feel alone


No one else manipulates the data when you are

Transactional serializability
The behavior is as if a serial order exists
Ti Doesnt Know About These Transactions and They Dont Know About Ti

Te Ta Tc Tb Td Tf

Tg Ti

Tj

Tl Th Tk Tm

Tn To

These Transactions Precede Ti

Transaction Serializability

These Transactions Follow Ti


Slide 46

Source: http://blogs.msdn.com/pathelland/

Transactions live in the now inside services


Time marches forward Transactions commit Advancing time Transactions see the committed transactions
Service

A services biz-logic lives in the now

Each Transaction Only Sees a Simple Advancing of Time with a Clear Set of Preceding Transactions

Source: http://blogs.msdn.com/pathelland/

Slide 47

Messages contain unlocked data


Assume no shared transactions

Unlocked data may change


Unlocking it allows change

Messages are not from the now


They are from the past

There is no simultaneity at a distance! Similar to speed of light Knowledge travels at speed of light By the time you see a distant object it may have changed! By the time you see a message, the data may have changed! Services, transactions, and locks bound simultaneity! Inside a transaction, things appear simultaneous (to others) Simultaneity only inside a transaction! Simultaneity only inside a service!
Source: http://blogs.msdn.com/pathelland/ Slide 48

All data from distant stars is from the past 10 light years away; 10 year old knowledge The sun may have blown up 5 minutes ago We wont know for 3 minutes more
All data seen from a distant service is from the past
By the time you see it, it has been unlocked and may change

Each service has its own perspective


Inside data is now; outside data is past My inside is not your inside; my outside is not your outside

This is like going from Newtonian to Einstonian physics Newtons time marched forward uniformly Instant knowledge Classic distributed computing: many systems look like one RPC, 2-phase commit, remote method calls In Einsteins world, everything is relative to ones perspective Today: No attempt to blur the boundary
Source: http://blogs.msdn.com/pathelland/ Slide 49

Cant have the same data at many locations Unless it is a snapshot Changing distributed data needs versions Creates a snapshot

Data Owning Service

Wednesdays Price-List

Price-List

Wednesdays Price-List

Listening Partner Service-8

Wednesdays Price-List

Wednesdays Price-List

Tuesdays Price-List

Tuesdays Price-List

Tuesdays Price-List

Mondays Price-List

Mondays Price-List

Listening Partner Service-1

Listening Partner Service-5

Listening Partner Service-7

Source: http://blogs.msdn.com/pathelland/

Given what I know here and now, make a decision Remember the versions of all the data used to make this decision Record the decision as being predicated on these versions Other copies of the object may make divergent decisions Try to sort out conflicts within the family If necessary, programmatically apologize Very rarely, whine and fuss for human help

Subjective Consistency
Given the information I have at hand, make a decision and act on it ! Remember the information at hand !

Ambassadors Had Authority


Back before radio, it could be months between communication with the king. Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later! Source: http://blogs.msdn.com/pathelland/

Eventually, all the copies of the object share their changes Ill show you mine if you show me yours! Now, apply subjective consistency: Given the information I have at hand, make a decision and act on it! Everyone has the same information, everyone comes to the same conclusion about the decisions to take

Eventual Consistency
Given the same knowledge, produce the same result ! Everyone sharing their knowledge leads to the same result...

This is NOT magic; it is a design requirement !


Idempotence, commutativity, and associativity of the operations (decisions made) are all implied by this requirement Source: http://blogs.msdn.com/pathelland/

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

Normalizations Goal Is Eliminating Update Anomalies


Can Be Changed Without Funny Behavior Each Data Item Lives in One Place De-normalization is OK if you arent going to update!
Emp # Emp Name 47 Joe 18 Sally 91 Pete 66 Mary Classic problem with de-normalization Cant update Sams phone # since there are many copies

Emp Phone Mgr # Mgr Name Mgr Phone 5-1234 13 Sam 6-9876 3-3123 38 Harry 5-6782 2-1112 13 Sam 6-9876 5-7349 02 Betty 4-0101 Source: http://blogs.msdn.com/pathelland/

affiliations table affiliation_id description Microsoft Georgia Tech member_count 18,656 23,488

user table

42

598 user_affiliations table

relati religi user_work_history first_ last_ table hom inter politi user onsh ous_ user_idnam nam affiliation_id sex etow este cal_v _id ip_st view (foreign_key) (foreign user_phone_numbers e table user_screen_names e nkey) table d_in iews company_affil atus s user_id company_na 12345 42 iation_id job_title Atlan me (foreign_key) 1234 marr wom (foreign key) 12345 John user_id 598 user_id Doe Male ta, (null)im_service (null) phone_number phone_type screen_name 5 ied en (foreign_key) (foreign_key) Program GA 12345 42 Microsoft Manager 12345 425-555-1203 Home geeknproud@exam 12345 AIM ple.com Quality 12345 425-555-6161 Work i2 12345 78 Assurance voip4life@example. Technologies 12345 206-555-0932 Cell 12345 Skype Engineer org

6 joins for 1 query!


Do you think FB would do this? And how would you do joins with partitioned data?

De-normalization removes joins But increases data volume


But disk is cheap and getting cheaper

And can lead to inconsistent data


If you are lazy However this is not really an issue

Many Kinds of Computing are Append-Only Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You cant change the history but you can add new observations Derived Results May Be Calculated Estimate of the current inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements

Transaction Logs Are the Truth


High-performance & write-only Describe ALL the changes to the data
Log

Data-Base the Current Opinion


Describes the latest value of the data as perceived by the application The Database Is a Caching of the Transaction Log !
It is the subset of the latest committed values represented in the transaction log Source: http://blogs.msdn.com/pathelland/ DB

Data Owning Service

Wednesdays Price-List

Price-List

Wednesdays Price-List

Listening Partner Service-8

Wednesdays Price-List

Wednesdays Price-List

Tuesdays Price-List

Tuesdays Price-List

Tuesdays Price-List

Mondays Price-List

Mondays Price-List

Listening Partner Service-1

Listening Partner Service-5

Listening Partner Service-7

Source: http://blogs.msdn.com/pathelland/

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

Makes scaling easier (cheaper) Core Idea


Read data from persistent store into memory Store in a hash-table Read first from cache, if not, load from persistent store

App Server

Cache

App Server

Cache

App Server

Cache

In-memory Distributed Hash Table Memcached instance manifests as a process (often on the same machine as web-server) Memcached Client maintains a hash table
Which item is stored on which instance

Memcached Server maintains a hash table


Which item is stored in which memory location

Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

Amazon - S3, SimpleDb, Dynamo Google - App Engine Datastore, BigTable Microsoft SQL Data Services, Azure Storages Facebook Cassandra LinkedIn - Project Voldemort Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable

Basic Concepts
No tables - Containers-Entity No schema - each tuple has its own set of properties

Amazon SimpleDB strings only

Microsoft Azure SQL Data Services


Strings, blob, datetime, bool, int, double, etc. No x-container joins as of now

Google App Engine Datastore


Strings, blob, datetime, bool, int, double, etc.

Google BigTable
Sparse, Distributed, multi-dimensional sorted map Indexed by row key, column key, timestamp Each value is an un-interpreted array of bytes

Amazon Dynamo
Data partitioned and replicated using consistent hashing Decentralized replica sync protocol Consistency thru versioning

Facebook Cassandra
Used for Inbox search Open Source

Scalaris
Keys stored in lexicographical order Improved Paxos to provide ACID Memory resident, no persistence

Real Life Scaling requires trade offs No Silver Bullet Need to learn new things Need to un-learn Balance!

Intelligent People. Uncommon Ideas.

Licensed under Creative Commons Attribution Sharealike Noncommercial

You might also like