All1 7ForMidTerm PDF
All1 7ForMidTerm PDF
All1 7ForMidTerm PDF
1. The most reliable systems are built using cheap, unreliable components.
2. The techniques that Google uses to scale to billions of users follow the same patterns you can use to scale a system that handles
hundreds of users.
3. The more risky a procedure is, the more you should do it.
4. Some of the most important software features are the ones that users never see.
5. You should pick random machines and power them off.
6. The code for every feature Facebook will announce in the next six months is probably in your browser already.
7. Updating software multiple times a day requires little human effort.
8. Being oncall doesn’t have to be a stressful, painful experience.
9. You shouldn’t monitor whether machines are up.
10. Operations and management can be conducted using the scientific principles of experimentation and evidence.
11. Google has rehearsed what it would do in case of a zombie attack.
The CAP Principle
• CAP stands for consistency, availability, and partition resistance.
• The CAP Principle states that it is not possible to build a distributed
system that guarantees consistency, availability, and resistance to
partitioning.
• Consistency means that all nodes see the same data at the same
time.
• Availability is a guarantee that every request receives a response
about whether it was successful or failed.
• Partition resistance means the system continues to operate despite
arbitrary message loss or failure of part of the system.
The CAP Principle
Trouble with a Naive Least Loaded Algorithm
• Without slow start, load balancers have been known to cause many problems. One famous example is what
happened to the CNN.com web site on the day of the September 11, 2001, terrorist attacks. So many people
tried to access CNN.com that the backends became overloaded. One crashed, and then crashed again after it
came back up, because the naive least loaded algorithm sent all traffic to it. When it was down, the other
backends became overloaded and crashed. One at a time, each backend would get overloaded, crash, and
become overloaded from again receiving all the traffic and crash again.
• As a result the service was essentially unavailable as the system administrators rushed to figure out what
was going on. In their defense, the web was new enough that no one had experience with handling sudden
traffic surges like the one encountered on September 11.
• The solution CNN used was to halt all the backends and boot them at the same time so they would all show
zero load and receive equal amounts of traffic.
• The CNN team later discovered that a few days prior, a software upgrade for their load balancer had arrived
but had not yet been installed. The upgrade added a slow start mechanism.
Chapter 2. Designing for Operations
• The best strategy for providing a highly available service is to build
features into the software that enhance one’s ability to perform and
automate operational tasks.
• When we design for operations, we take into account the normal
functions of an infrastructure life cycle.
Operational Requirements
They include the following:
1. Configuration
2. Startup and shutdown
3. Queue draining
4. Software upgrades
5. Backups and restores
6. Redundancy
7. Replicated databases
8. Hot swaps
9. Toggles for individual features
10. Graceful degradation
11. Access controls and rate limits
12. Data import controls
13. Monitoring
14. Auditing
15. Debug instrumentation
16. Exception collection
Implementing Design for Operations
There are 4 main ways that you can get features into software:
1. Build them in from the beginning.
2. Request features as they are identified.
3. Write the features yourself.
4. Work with a third-party vendor.
Implementation priorities for design for
operations
CBD-3354 Distributed Systems
and Cloud Computing (Class 3)
1. Default to Virtual
2. Make a Cost-Based Decision
3. Leverage Provider Expertise
4. Get Started Quickly
5. Implement Ephemeral Computing
6. Use the Cloud for Overflow Capacity
7. Leverage Superior Infrastructure
8. Develop an In-House Service Provider
9. Contract for an On-Premises, Externally Run Service
10. Maximize Hardware Output
11. Implement a Bare Metal Cloud
Challenging Questions
1. Compare IaaS, PaaS, and SaaS on the basis of cost, configurability,
and control.
2. What are the warnings to consider in adopting Software as a
Service?
3. List the key advantages of virtual machines.
4. Why might you choose physical over virtual machines?
5. Which factors might make you choose private over public cloud
services?
CBD-3354 Distributed Systems and
Cloud Computing (Chapter 4)
Distributed systems must be built to be scalable from the start because growth is
expected.
The initial design must be engineered to scale to meet the requirements of the
service, but it also must include features that create options for future growth.
Once the system is in operation, we will always be optimizing the system to help it
scale better.
General Strategy
The basic strategy for building a scalable system is to design it with scalability in
mind from the start and to avoid design elements that will prevent additional
scaling in the future.
Once the system is running, performance limits will be discovered. This is where
the design features that enable further scaling come into play.
The additional design and coding effort that will help deal with future potential
scaling issues is lower priority than writing code to fix the immediate issues of the
day.
General Strategy
Some recommendations are:
1. Identify Bottlenecks
3. Measure Results
4. Be Proactive
Scaling Up
The simplest methodology for scaling a system is to use bigger, faster equipment.
A system that runs too slowly can be moved to a machine with a faster CPU, more
CPUs, more RAM, faster disks, faster network interfaces, and so on.
Often an existing computer can have one of those attributes improved without
replacing the entire machine.
z-axis scaling is similar to y-axis scaling except that it divides the data instead of the processing.
• By Hash Prefix
• By Customer Functionality
• By Utilization
• By Organizational Division
• Hierarchically
• By Arbitrary Group
Combinations
Many scaling techniques combine multiple axes of the AKF Scaling Cube.
• Dynamic Replicas
• Architectural Change
Caching
A cache is a small data store using fast/expensive media, intended to improve a slow/cheap bigger data
store.
1. Cache Effectiveness
2. Cache Placement
3. Cache Persistence
6. Cache Size
Cache Placement
Not all caches are found in RAM.
The cache medium simply must be faster than the main medium.
Data Sharding
Sharding is a way to segment a database (z-axis) that is flexible, scalable, and
resilient.
It divides the database based on the hash value of the database keys.
There are limits to the number of threads a machine can handle, based on RAM
and CPU core limits.
Queueing
Another way that data can be processed differently to achieve better scale is
called queuing.
A queue is a data structure that holds requests until the software is ready to
process them.
Most queues release elements in the order that they were received, called first in,
first out (FIFO) processing.
In fair queueing, the algorithm prevents a low-priority item from being “starved” by
a flood of high-priority items.
Queueing Variations
Variations of the queueing model can optimize performance.
CDNs have extremely large, fast connections to the internet. They have more
bandwidth to the internet than most web sites.
Manufacturers calculate their components’ reliability and publish their mean time
between failure (MTBF) ratings.
The techniques are grouped into four categories: physical failures, attacks,
human errors, and unexpected load.
Software Resiliency Beats Hardware Reliability
Better hardware means special-purpose CPUs, components, and storage
systems.
This leaves the hardware systems engineer with the impossible task of delivering
hardware that never fails. We fake it by using redundant array of independent
disks (RAID) systems that let the software go on pretending that disks never fail.
Failures are detected and those units are removed from service.
The total capacity of the system is reduced but the system is still able to run.
This means that systems must be built with spare capacity to begin with.
Failure of any one replica is detected and that replica is taken out of service
automatically.
The term N + 1 redundancy is used when we wish to indicate that there is enough
spare capacity for one failure.
If we added a fifth server, the system would be able to survive two simultaneous
failures and would be described as N + 2 redundancy.
How Much Spare Capacity
Selecting the granularity of our unit of capacity enables us to manage the
efficiency.
The other factors in selecting the amount of redundancy are how quickly we can
bring up additional capacity and how likely it is that a second failure will happen
during that time.
The time it takes to repair or replace the down capacity is called the mean time to
repair (MTTR).
The probability an outage will happen during that time is the reciprocal of the
mean time between failures. The percent probability that a second failure will
happen during the repair window is MTTR/MTBF × 100.
How Much Spare Capacity
MTTR is a function of a number of factors.
A process that dies and needs to be restarted has a very fast MTTR.
In this approach, the primary replica receives the entire workload but the
secondary replica is ready to take over at any time.
This is sometimes called the hot spare or “hot standby” strategy since the spare is
connected to the system, running (hot), and can be switched into operation
instantly. It is also known as an active–passive or master–slave pair.
1. A regular crash occurs when the software does something prohibited by the
operating system. Due to a software bug, the program may try to write to
memory that is marked read-only by the operating system. The OS detects
this and kills the process.
2. A panic occurs when the software itself detects something is wrong and
decides the best course is to exit. The software may detect a situation that
shouldn’t exist and cannot be corrected. If internal data structures are
corrupted and there is no safe way to rectify them, it is best to stop work
immediately rather than continue with bad data. A panic is an intentional
crash.
Software Hangs
Sometimes when software has a problem it does not crash, but instead hangs or
gets caught in an infinite loop.
A strategy for detecting hangs is to monitor the server and detect if it has stopped
processing requests.
These active requests, which are called pings, are designed to be light-weight,
simply verifying basic functionality.
If pings are sent at a specific, periodic rate and are used to detect hangs as well
as crashes, they are called heartbeat requests.
Another technique for dealing with software hangs is called a watchdog timer.
Physical Failures
Distributed systems also need to be resilient when faced with physical failures.
Providing resiliency through the use of redundancy at every level is expensive and
difficult to scale.
Many components of a computer can fail. The parts whose utilization you monitor
can fail, such as the CPU, the RAM, the disks, and the network interfaces.
Supporting components can also fail, such as fans, power supplies, batteries, and
motherboards.
Clos Networking
It is reasonable to expect that eventually there will be network products on the
open market that provide non-blocking, full-speed connectivity between any two
machines in an entire datacenter. We’ve known how to do this since 1953 (Clos
1953). When this product introduction happens, it will change how we design
services.
Overload Failures
Distributed systems need to be resilient when faced with high levels of load that
can happen as the result of a temporary surge in traffic, an intentional attack, or
automated systems querying the system at a high rate, possibly for malicious
reasons.
Traffic Surges
Scraping Attacks
Chapter 7. Operations in a Distributed World
Operations is the work done to keep a system running in a way that meets or
exceeds operating parameters specified by a service level agreement (SLA).
Operations includes all aspects of a service’s life cycle: from initial launch to the
final decommissioning and everything in between.
“The rate at which organizations learn may soon become the only sustainable
source of competitive advantage.” (Peter Senge)
Change versus Stability
A system starts at a baseline of stability. A change is then made. All changes have
some kind of a destabilizing effect. Eventually the system becomes stable again,
usually through some kind of intervention. This is called the change-instability
cycle.
There is a tension between the operations team’s desire for stability and the
developers’ desire to get new code into production. There are many ways to reach
a balance. Most ways involve aligning goals by sharing responsibility for both
uptime and velocity of new features.
Operations at Scale
Operations in distributed computing is done at a large scale. Processes that have
to be done manually do not scale. Constant process improvement and automation
are essential.
Each phase has unique requirements, so you’ll need a strategy for managing each
phase differently.
Launches, decommissioning of services and other tasks that are done infrequently
require an attention to detail that is best assured by use of checklists. Checklists
ensure that lessons learned in the past are carried forward.
The stages of the life cycle
1. Service Launch: Launching a service the first time. The service is brought to
life, initial customers use it, and problems that were not discovered prior to the
launch are discovered and remedied.
2. Emergency Tasks: Handling exceptional or unexpected events. This includes
handling outages and detecting and fixing conditions that precipitate outages.
3. Nonemergency Tasks: Performing all manual work required as part of the
normally functioning system. This may include periodic (weekly or monthly)
maintenance tasks (for example, preparation for monthly billing events) as
well as processing requests from users (for example, requests to enable the
service for use by another internal service or team).
The stages of the life cycle
4. Upgrades: Deploying new software releases and hardware platforms.
Each new software release is built and tested before deployment. Tests include
system tests, done by developers, as well as user acceptance tests (UAT), done
by operations. UAT might include tests to verify there are no performance
regressions (unexpected declines in performance). Vulnerability assessments are
done to detect security issues. New hardware must go through a hardware
qualification to test for compatibility, performance regressions, and any changes in
operational processes.
The stages of the life cycle
5. Decommissioning: Turning off a service.
It is the opposite of a service launch: removing the remaining users, turning off the
service, removing references to the service from any related service
configurations, giving back any resources, archiving old data, and erasing or
scrubbing data from any hardware before it is repurposed, sold, or disposed.
1. Removal of users
2. Deallocation of resources
3. Disposal of resources
Case Study: Self-Service Launches at Google
Google launches so many services that it needed a way to make the launch
process streamlined and able to be initiated independently by a team. In addition
to providing APIs and portals for the technical parts, the Launch Readiness
Review (LRR) made the launch process itself self-service.
The LRR included a checklist and instructions on how to achieve each item.
Some checklist items were technical, for example, making sure that the Google
load balancing system was used properly.
Case Study: Self-Service Launches at Google
Other items were cautionary, to prevent a launch team from repeating other
teams’ past mistakes. For example, one team had a failed launch because it
received 10 times more users than expected. There was no plan for how to handle
this situation. The LRR checklist required teams to create a plan to handle this
situation and demonstrate that it had been tested ahead of time.
Other checklist items were business related. Marketing, legal, and other
departments were required to sign off on the launch. Each department had its own
checklist.
Other things to consider
The most productive use of time for operational staff is time spent automating and
optimizing processes. This should be their primary responsibility.
When team members take turns addressing these responsibilities, they receive
the dedicated resources required to assure they happen correctly by sharing the
responsibility across the team. People also avoid burning out.
Virtual Office
Operations teams generally work far from the actual machines that run their
services. Since they operate the service remotely, they can work from anywhere
there is a network connection.
Teams often work from different places, collaborating and communicating in a chat
room or other virtual office.
Many tools are available to enable this type of organizational structure. It becomes
important to change the communication medium based on the type of
communication required. Chat rooms are sufficient for general communication but
voice and video are more appropriate for more intense discussions. Email is more
appropriate when a record of the communication is required, or if it is important to
reach people who are not currently online.