All1 7ForMidTerm PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 97

CBD-3354 Distributed Systems

and Cloud Computing (Class 2)

Cloud Computing for Big Data


Lambton College in Toronto
Hugo Bosch
[email protected]
Business Objectives
Well-defined business objectives are measurable, and such
measurements can be collected in an automated fashion.
Some sample business objectives are:
1. Sell our products via a web site
2. Provide service 99.99 percent of the time
3. Process x million purchases per month, growing 10 percent monthly
4. Introduce new features twice a week
5. Fix major bugs within 24 hours
Design: Building a Cloud-scale Service
• The architecture includes redundancy and resiliency features that
work around failures.
• Components fail but the system survives.
• All subsystems are programmable via an application programming
interface (API).
• The system is based on a service-oriented architecture (SOA).
• All of the services can be independently scaled, upgraded, or
replaced.
Chapter 1. Designing in a Distributed World
• How does Google Search work?
• How does your Facebook Timeline stay updated around the clock?
• How does Amazon scan an ever-growing catalog of items to tell you
that people who bought this item also bought socks?
• Distributed computing is the art of building large systems that divide
the work over many machines.
Chapter 1. Designing in a Distributed World
• Server: Software that provides a function or application program interface (API).
(Not a piece of hardware.)
• Service: A user-visible system or product composed of many servers.
• Machine: A virtual or physical machine.
• QPS: Queries per second. Usually how many web hits or API calls received per
second.
• Traffic: A generic term for queries, API calls, or other requests sent to a server.
• Performant: A system whose performance conforms to (meets or exceeds) the
design requirements. A neologism from merging “performance” and
“conformant.”
• Application Programming Interface (API): A protocol that governs how one server
talks to another.
Practice 1
Which of the following statements are true?

1. The most reliable systems are built using cheap, unreliable components.
2. The techniques that Google uses to scale to billions of users follow the same patterns you can use to scale a system that handles
hundreds of users.
3. The more risky a procedure is, the more you should do it.
4. Some of the most important software features are the ones that users never see.
5. You should pick random machines and power them off.
6. The code for every feature Facebook will announce in the next six months is probably in your browser already.
7. Updating software multiple times a day requires little human effort.
8. Being oncall doesn’t have to be a stressful, painful experience.
9. You shouldn’t monitor whether machines are up.
10. Operations and management can be conducted using the scientific principles of experimentation and evidence.
11. Google has rehearsed what it would do in case of a zombie attack.
The CAP Principle
• CAP stands for consistency, availability, and partition resistance.
• The CAP Principle states that it is not possible to build a distributed
system that guarantees consistency, availability, and resistance to
partitioning.
• Consistency means that all nodes see the same data at the same
time.
• Availability is a guarantee that every request receives a response
about whether it was successful or failed.
• Partition resistance means the system continues to operate despite
arbitrary message loss or failure of part of the system.
The CAP Principle
Trouble with a Naive Least Loaded Algorithm
• Without slow start, load balancers have been known to cause many problems. One famous example is what
happened to the CNN.com web site on the day of the September 11, 2001, terrorist attacks. So many people
tried to access CNN.com that the backends became overloaded. One crashed, and then crashed again after it
came back up, because the naive least loaded algorithm sent all traffic to it. When it was down, the other
backends became overloaded and crashed. One at a time, each backend would get overloaded, crash, and
become overloaded from again receiving all the traffic and crash again.

• As a result the service was essentially unavailable as the system administrators rushed to figure out what
was going on. In their defense, the web was new enough that no one had experience with handling sudden
traffic surges like the one encountered on September 11.

• The solution CNN used was to halt all the backends and boot them at the same time so they would all show
zero load and receive equal amounts of traffic.

• The CNN team later discovered that a few days prior, a software upgrade for their load balancer had arrived
but had not yet been installed. The upgrade added a slow start mechanism.
Chapter 2. Designing for Operations
• The best strategy for providing a highly available service is to build
features into the software that enhance one’s ability to perform and
automate operational tasks.
• When we design for operations, we take into account the normal
functions of an infrastructure life cycle.
Operational Requirements
They include the following:
1. Configuration
2. Startup and shutdown
3. Queue draining
4. Software upgrades
5. Backups and restores
6. Redundancy
7. Replicated databases
8. Hot swaps
9. Toggles for individual features
10. Graceful degradation
11. Access controls and rate limits
12. Data import controls
13. Monitoring
14. Auditing
15. Debug instrumentation
16. Exception collection
Implementing Design for Operations
There are 4 main ways that you can get features into software:
1. Build them in from the beginning.
2. Request features as they are identified.
3. Write the features yourself.
4. Work with a third-party vendor.
Implementation priorities for design for
operations
CBD-3354 Distributed Systems
and Cloud Computing (Class 3)

Cloud Computing for Big Data


Lambton College in Toronto
Hugo Bosch
[email protected]
Chapter 2. Designing for Operations
• Designing for operations means making sure all the normal operational
functions can be done well. Normal operational functions include tasks
such as periodic maintenance, updates, and monitoring.
• If a service provides an API, that API should include an Access Control
List (ACL) mechanism that determines which users are permitted or
denied access, and also determines rate-limiting settings.
• An example of access controls is a public key infrastructure (PKI) that
uses digital certificates to prove identity like Putty.
Challenging Questions
1. Why is design for operations so important?
2. How is automated configuration typically supported?
Chapter 3. Selecting a Service Platform
• Infrastructure as a Service (IaaS): Computer and network hardware,
real or virtual, ready for you to use.
• Platform as a Service (PaaS): Your software running in a vendor-
provided framework or stack.
• Software as a Service (SaaS): An application provided as a web site.
The consumers of SaaS, PaaS, and IaaS
Chapter 3. Selecting a Service Platform
A platform may be described along three axes:

1. Level of service abstraction: IaaS, PaaS, SaaS


2. Type of machine: Physical, virtual or process container
3. Level of resource sharing: Shared or private
Level of service abstraction: IaaS, PaaS, SaaS
• Abstraction is how far users are kept from the details of the raw
machine itself.
• The closer you are to the raw machine (low abstraction), the more
control you have.
• The higher the level of abstraction, the less you have to concern
yourself with technical details of building infrastructure and the more
you can focus on the application (high abstraction).
Infrastructure as a Service (IaaS)
• IaaS provides bare machines, networked and ready for you to install the
operating system and your own software.
• The provider takes care of the infrastructure: the machines themselves, power,
cooling, and networking, providing internet access, and all datacenter operations.
• A datacenter example is:
• https://www.youtube.com/watch?v=zXsoygN_v7A

• Server: Software that provides a function or API. (Not a piece of hardware.)


• Service: A user-visible system or product composed of many servers.
• Machine: A virtual or physical machine.
• Oversubscribed: A system that provides capacity X is used in a place where Y
capacity is needed, when X < Y. Used to describe a potential or actual need.
• Undersubscribed: The opposite of oversubscribed.
Platform as a Service (PaaS)
• PaaS enables you to run your applications from a vendor-provided
framework.
• An example is Google AppEngine.
• PaaS providers charge for their services based on how much CPU,
bandwidth, and storage are used.
• PaaS provides many high-level services including storage services,
database services, and many of the same services available in IaaS
offerings.
Software as a Service (SaaS)
• SaaS is a web-accessible application.
• The application is the service, and you interact with it as you would any
web site.
• The provider handles all the details of hardware, operating system, and
platform.
• Some common examples include:
1. Salesforce.com
2. Google Apps
3. Basecamp
• The service is fully managed, upgraded, and maintained by the provider.
Type of machine: Physical, virtual, or process
container
• A physical machine is a traditional computer with one or more CPUs,
and subsystems for memory, disk, and network.
• Virtual machines are created when a physical machine is partitioned
to run a separate operating system for each partition.
• Virtual machines permit isolation at the OS level.
• Virtual machines are allocated a fixed amount of disk space, memory,
and CPU from the physical machine.
• Virtual machines are very heavy-weight.
Process Container
• A container is a group of processes running on an operating system that are
isolated from other such groups of processes.
• Each container has an environment with its own process name space,
network configuration, and other resources.
• Containers are very lightweight because they do not require an entire OS.
Example
• “Docker is an open-source project that automates the deployment of
applications inside software containers.”
• https://en.wikipedia.org/wiki/Docker_(software)
• https://www.docker.com/
Level of resource sharing: Shared or private
• In a “public cloud,” a third party owns the infrastructure and uses it to
provide service for many customers.
• In a “private cloud,” a company runs its own computing infrastructure
on its own premises.
• Hybrids may also be created, such as private clouds run in rented
datacenter space.
• The choice between private or public use of a platform is a business
decision based on four factors: compliance, privacy, cost, and control.
Compliance, privacy, cost, and control
• Using a public cloud for certain data or services may cause a company to fail a compliance audit.
• Even in a failover scenario, a company that moves the data into the public cloud would fail an audit.
• Due to software bugs, employee mistakes, or other issues, your data could be exposed to other
customers or the entire world.
• The cost of using a public cloud may or may not be less than the cost of building the necessary
infrastructure yourself.
• Amortizing the expense over many customers reduces cost.
• Calculating the total cost of ownership (TCO) and return on investment (ROI) will help determine
which is the best option.
• A private cloud affords you more control.
• In a public cloud you have less control.
• Letting the vendor take care of all hardware selection means losing the ability to specify low-level
hardware requirements (specific CPU types or storage products).
Colocation
• Colocation is a useful way to provide services.
• It occurs when a datacenter owner rents space to other people, called
tenants.
• Any rental of datacenter space has been called colocation service.
• Using a colocation facility can get you up and running quickly.
• Tenants can take advantage of capability economies of scale rather
than manage their own ISP connections and relationships.
Selection Strategies
There are many strategies one may use to choose between IaaS, PaaS, and SaaS.

1. Default to Virtual
2. Make a Cost-Based Decision
3. Leverage Provider Expertise
4. Get Started Quickly
5. Implement Ephemeral Computing
6. Use the Cloud for Overflow Capacity
7. Leverage Superior Infrastructure
8. Develop an In-House Service Provider
9. Contract for an On-Premises, Externally Run Service
10. Maximize Hardware Output
11. Implement a Bare Metal Cloud
Challenging Questions
1. Compare IaaS, PaaS, and SaaS on the basis of cost, configurability,
and control.
2. What are the warnings to consider in adopting Software as a
Service?
3. List the key advantages of virtual machines.
4. Why might you choose physical over virtual machines?
5. Which factors might make you choose private over public cloud
services?
CBD-3354 Distributed Systems and
Cloud Computing (Chapter 4)

Cloud Computing for Big Data


Lambton College in Toronto
Hugo Bosch
[email protected]
Application Architectures
• This chapter examines the building blocks used when designing
applications and other services.
• The first design pattern we examine is a single self-sufficient machine
used to provide web service.
• The machine runs software that speaks the HTTP protocol, receiving
requests, processing them, generating a result, and sending the reply.
Many typical small web sites and web-based applications use this
architecture.
Single-Machine Web Service Architecture
Single-Machine Web Service Architecture
The web server generates web pages from three different sources:
• Static Content: Files are read from local storage and sent to the user
unchanged. These may be HTML pages, images, and other content like
music, video, or downloadable software.
• Dynamic Content: Programs running on the web server generate HTML
and possibly other output that is sent to the user. They may do so
independently or based on input received from the user.
• Database-Driven Dynamic Content: This is a special case of dynamic
content where the programs running on the web server consult a database
for information and use that to generate the web page. In this architecture,
the database software and its data are on the same machine as the web
server.
Three-Tier Web Service Architecture
• The three-tier web service is a pattern built from three layers: the load
balancer layer, the web server layer, and the data service layer.
• The web servers all rely on a common backend data server, often an SQL
database. Requests enter the system by going to the load balancer. The
load balancer picks one of the machines in the middle layer and relays the
request to that web server. The web server processes the request, possibly
querying the database to aid it in doing so. The reply is generated and sent
back via the load balancer.
• A load balancer works by receiving requests and forwarding them to one of
many replicas—that is, web servers that are configured such that they can
all service the same URLs. Users talk to the load balancer as if it is a web
server; they do not realize it is a frontend for many replicas.
Three-Tier Web Service Architecture
Load Balancing Methods
For each request, a load balancer has to decide which backend to send it to.
There are different algorithms for making this decision:
• Round Robin (RR): The machines are rotated in a loop. If there are three
replicas, the rotation would look something like A-B-C-A-B-C. Down
machines are skipped.
• Weighted RR: This scheme is similar to RR but gives more queries to the
back-ends with more capacity. Usually a manually configured weight is
assigned to each backend. For example, if there are three backends, two of
equal capacity but a third that is huge and can handle twice as much traffic,
the rotation would be A-C-B-C.
• Least Loaded (LL): The load balancer receives information from each
backend indicating how loaded it is. Incoming requests always go to the
least loaded backend.
Load Balancing Methods
• Least Loaded with Slow Start: This scheme is similar to LL, but when a new
backend comes online it is not immediately flooded with queries. Instead, it starts
receiving a low rate of traffic that slowly builds until it is receiving an appropriate
amount of traffic. This fixes the problems with LL.
• Utilization Limit: Each server estimates how many more QPS it can handle and
communicates this to the load balancer. The estimates may be based on current
throughput or data gathered from synthetic load tests.
• Latency: The load balancer stops forwarding requests to a backend based on the
latency of recent requests. For example, when requests are taking more than 100
ms, the load balancer assumes this backend is overloaded. This technique
manages bursts of slow requests or pathologically overloaded situations.
• Cascade: The first replica receives all requests until it is at capacity. Any overflow
is directed to the next replica, and so on. In this case the load balancer must
know precisely how much traffic each replica can handle, usually by static
configuration based on synthetic load tests.
Four-Tier Web Service Architecture
• A four-tier web service is used when there are many individual
applications with a common frontend infrastructure.
• In this pattern, web requests come in as usual to the load balancer,
which divides the traffic among the various frontends.
• The frontends handle interactions with the users, and communicate
to the application servers for content. The application servers access
shared data sources in the final layer.
Four-Tier Web Service Architecture
Four-Tier Web Service Architecture
• The difference between the three-tier and four-tier designs is that the
application and the web servers run on different machines.
• The benefits of using the latter design pattern are that we decouple
the customer-facing interaction, protocols, and security issues from
the applications.
• The downside is that it takes a certain amount of trust for application
service teams to rely on a centralized frontend platform team. It also
takes management discipline to not allow exceptions.
Cloud-Scale Web Service Architecture
• Cloud-scale services are globally distributed. The service
infrastructure uses one of the previously discussed architectures,
which is then replicated in many places around the world.
• A global load balancer (GLB) is a DNS server that directs traffic to the
nearest data-center.
• Geolocation is the process of determining the physical location of a
machine on the internet. Unlike a phone number, whose country
code and area code give a fairly accurate indication of where the user
is, an IP address has no concrete geographic meaning. There is a small
industry consisting of companies that use various means (and a lot of
guessing) to determine where each IP subnet is physically located.
Cloud-Scale Web Service Architecture
Global Load Balancing Methods
A GLB maintains a list of replicas, their locations, and their IP addresses.
When a GLB is asked to translate a domain name to an IP address, it takes
into account the geolocation of the requester when determining which IP
address to send in the reply.
GLBs use many different techniques:
• Nearest: Strictly selects the nearest datacenter to the requester.
• Nearest with Limits: The nearest datacenter is selected until that site is full.
At that point, the next nearest datacenter is selected. Slow start, as
described previously, is included for the same reasons as on local load
balancers.
• Nearest by Other Metric: The best location may be determined not by
distance but rather by another metric such as latency or cost.
Message Bus Architectures
• A message bus is a many-to-many communication mechanism between
servers.
• A message bus is a mechanism whereby servers send messages to
“channels” (like a radio channel) and other servers listen to the channels
they need.
• A server that sends messages is a publisher and the receivers are
subscribers.
• A central authority, or master, manages which servers are connected to
which channels.
• Message bus technology goes by many names, including message queue,
queue service, or pubsub service.
Message Bus Architectures
• A message bus system is efficient in that clients receive a message
only if they are subscribed.
• This approach is more efficient than a broadcast system that sends all
messages to all machines and lets the receiving machines filter out
the messages they aren’t interested in.
Some examples are:
• Simple Queue Service (SQS) by Amazon
• MCollective as a publish subscribe middleware
• RabbitMQ as a message broker
Service-Oriented Architecture (SOA)
• Service-oriented architecture (SOA) enables large services to be
managed more easily.
• With this architecture, each subsystem is a self-contained service
providing its functionality as a consumable service via an API.
• The various services communicate with one another by making API
calls.
• Some benefits of SOA are flexibility, ease of upgrade and
replacement.
SOA Real Life Example
Splitting Teams by Functionality
• At Google, Gmail was originally maintained by one group of Google
site reliability engineers (SREs). As the system grew, subteams split off
to focus on subsystems such as the storage layer, the anti-spam
layer, the message receiving system, the message delivery system,
and so on. This was possible because of the SOA design of the system.
SOA Best Practices
Following are some best practices for running a SOA:
• Use the same underlying RPC protocol to implement the APIs on all services. This
way any tool related to the RPC mechanism is leveraged for all services.
• Have a consistent monitoring mechanism. All services should expose
measurements to the monitoring system the same way.
• Use the same techniques with each service as much as possible. Use the same
load balancing system, management techniques, coding standards, and so on. As
services move between teams, it will be easier for people to get up to speed if
these things are consistent.
• Adopt some form of API governance. When so many APIs are being designed, it
becomes important to maintain standards for how they work. These standards
often impart knowledge learned through painful failures in the past that the
organization does not want to see repeated.
Example: a private network backbone connecting
many datacenters (DCs) and points of presence (POPs)
on the Internet
• The private WAN links that connect datacenters form an internal
backbone. An internal backbone is not visible to the internet at large.
It is a private network.
• A point of presence (POP) is a small, remote facility used for
connection to local ISPs.
Challenging Questions
1. What are the services that a four-tier architecture provides in the
first tier?
2. What is a message bus architecture and how might one be used?
Chapter 5. Design Patterns for Scaling
A system’s ability to scale is its ability to process a growing workload, usually
measured in transactions per second, amount of data, or number of users.

Distributed systems must be built to be scalable from the start because growth is
expected.

A distributed system is not automatically scalable.

The initial design must be engineered to scale to meet the requirements of the
service, but it also must include features that create options for future growth.

Once the system is in operation, we will always be optimizing the system to help it
scale better.
General Strategy
The basic strategy for building a scalable system is to design it with scalability in
mind from the start and to avoid design elements that will prevent additional
scaling in the future.

Once the system is running, performance limits will be discovered. This is where
the design features that enable further scaling come into play.

The additional design and coding effort that will help deal with future potential
scaling issues is lower priority than writing code to fix the immediate issues of the
day.
General Strategy
Some recommendations are:

1. Identify Bottlenecks

2. Reengineer Components (rewriting parts of a system)

3. Measure Results

4. Be Proactive
Scaling Up
The simplest methodology for scaling a system is to use bigger, faster equipment.

A system that runs too slowly can be moved to a machine with a faster CPU, more
CPUs, more RAM, faster disks, faster network interfaces, and so on.

Often an existing computer can have one of those attributes improved without
replacing the entire machine.

This is called scaling up because the system is increasing in size.


The AKF Scaling Cube
Methodologies for scaling to massive proportions boil down to three basic options:

1. Replicate the entire system (horizontal duplication)

2. Split the system into individual functions, services, or resources (functional or


service splits)

3. Split the system into individual chunks (lookup or formulaic splits).


The AKF Scaling Cube
Horizontal Duplication (X)
Horizontal duplication increases throughput by replicating the service. It is also
known as horizontal scaling or scaling out.

This is related to the CAP Principle.

Techniques that involve x-axis scaling include the following:

• Adding more machines or replicas

• Adding more disk spindles

• Adding more network connections


Functional or Service Splits (Y)
A functional or service split means scaling a system by splitting out each individual
function so that it can be allocated additional resources.

Techniques that involve y-axis scaling include the following:

• Splitting by function, with each function on its own machine

• Splitting by function, with each function on its own pool of machines

• Splitting by transaction type

• Splitting by type of user


Lookup-Oriented Split (Z)
A lookup-oriented split scales a system by splitting the data into identifiable segments, each of which is given dedicated
resources.

z-axis scaling is similar to y-axis scaling except that it divides the data instead of the processing.

Additional ways to segment data include the following:

• By Hash Prefix

• By Customer Functionality

• By Utilization

• By Organizational Division

• Hierarchically

• By Arbitrary Group
Combinations
Many scaling techniques combine multiple axes of the AKF Scaling Cube.

Some examples include the following:

• Segment plus Replicas

• Dynamic Replicas

• Architectural Change
Caching
A cache is a small data store using fast/expensive media, intended to improve a slow/cheap bigger data
store.

Some things to consider are:

1. Cache Effectiveness

2. Cache Placement

3. Cache Persistence

4. Cache Replacement Algorithms

5. Cache Entry Invalidation

6. Cache Size
Cache Placement
Not all caches are found in RAM.

The cache medium simply must be faster than the main medium.
Data Sharding
Sharding is a way to segment a database (z-axis) that is flexible, scalable, and
resilient.

It divides the database based on the hash value of the database keys.

A hash function is an algorithm that maps data of varying lengths to a fixed-


length value.

We use a power of 2 to optimize the hash-to-shard mapping process.


Threading
Data can be processed in different ways to achieve better scale.

Simply processing one request at a time has its limits.

Threading is a technique that can be used to improve system throughput by


processing many requests at the same time.

Threading is a technique used by modern operating systems to allow sequences


of instructions to execute independently.

There are limits to the number of threads a machine can handle, based on RAM
and CPU core limits.
Queueing
Another way that data can be processed differently to achieve better scale is
called queuing.

A queue is a data structure that holds requests until the software is ready to
process them.

Most queues release elements in the order that they were received, called first in,
first out (FIFO) processing.

Queueing is similar to multithreading in that there is a master thread and worker


threads.
Queueing Benefits
With queueing, you are less likely to overload the machine since the number of
worker threads is fixed and remains constant.

There is also an advantage in retaining the same threads to service multiple


requests.

Another benefit of the queuing model is that it is easier to implement a priority


scheme.

In fair queueing, the algorithm prevents a low-priority item from being “starved” by
a flood of high-priority items.
Queueing Variations
Variations of the queueing model can optimize performance.

Another variation is for threads to kill and re-create themselves periodically so


that they remain “fresh.”

Finally, it is common practice to use processes instead of threads.

An example of queueing implemented with processes is the Prefork processing


module for the Apache web server.

The number of subprocesses used can be adjusted dynamically.


Content Delivery Networks
A content delivery network (CDN) is a web-acceleration service that delivers
content (web pages, images, video) more efficiently on behalf of your service.

CDNs cache content on servers all over the world.

CDNs have extremely large, fast connections to the internet. They have more
bandwidth to the internet than most web sites.

CDNs are great choices for small sites.

CDNs now compete on price, geographic coverage, and an ever-growing list of


new features.
Chapter 6. Design Patterns for Resiliency
"Success is not final, failure is not fatal: it is the courage to continue that counts."
Winston Churchill

Resiliency is a system’s ability to constructively deal with failures.

Manufacturers calculate their components’ reliability and publish their mean time
between failure (MTBF) ratings.

The techniques are grouped into four categories: physical failures, attacks,
human errors, and unexpected load.
Software Resiliency Beats Hardware Reliability
Better hardware means special-purpose CPUs, components, and storage
systems.

Better software means adding intelligence to a system so that it detects failures


and works around them.

Software is also more malleable than hardware.


The Traditional Approach
Traditional software assumes a perfect, malfunction-free world.

This leaves the hardware systems engineer with the impossible task of delivering
hardware that never fails. We fake it by using redundant array of independent
disks (RAID) systems that let the software go on pretending that disks never fail.

Sheltered from the reality of a world full of malfunctions, we enable software


developers to continue writing software that assumes a perfect, malfunction-free
world.
The Distributed Computing Approach
Distributed computing, in contrast to the traditional approach, embraces
components’ failures and malfunctions.

Traditional computing goes to great lengths to achieve reliability through hardware


and then either accepts a small number of failures as “normal” or adds
intelligence to detect and route around failures.
Resiliency through Spare Capacity
The general strategy used to gain resiliency is to have redundant units of
capacity that can fail independently of each other.

Failures are detected and those units are removed from service.

The total capacity of the system is reduced but the system is still able to run.

This means that systems must be built with spare capacity to begin with.

Failure of any one replica is detected and that replica is taken out of service
automatically.

We call this N + M redundancy. Such systems require N units to provide capacity


and have M units of extra capacity.
Resiliency through Spare Capacity
Units are the smallest discrete system that provides the service.

The term N + 1 redundancy is used when we wish to indicate that there is enough
spare capacity for one failure.

If we added a fifth server, the system would be able to survive two simultaneous
failures and would be described as N + 2 redundancy.
How Much Spare Capacity
Selecting the granularity of our unit of capacity enables us to manage the
efficiency.

The other factors in selecting the amount of redundancy are how quickly we can
bring up additional capacity and how likely it is that a second failure will happen
during that time.

The time it takes to repair or replace the down capacity is called the mean time to
repair (MTTR).

The probability an outage will happen during that time is the reciprocal of the
mean time between failures. The percent probability that a second failure will
happen during the repair window is MTTR/MTBF × 100.
How Much Spare Capacity
MTTR is a function of a number of factors.

A process that dies and needs to be restarted has a very fast MTTR.

N + 1 is a minimum for a service; N + 2 is needed if a second outage is likely while


you are fixing the first one.
Load Sharing versus Hot Spares
Another strategy is to have primary and secondary replicas.

In this approach, the primary replica receives the entire workload but the
secondary replica is ready to take over at any time.

This is sometimes called the hot spare or “hot standby” strategy since the spare is
connected to the system, running (hot), and can be switched into operation
instantly. It is also known as an active–passive or master–slave pair.

Because there is only one master, these configurations are 1 + M configurations.


Failure Domains
A failure domain is the bounded area beyond which failure has no impact.

A failure domain may be prescriptive—that is, a design goal or requirement.

Alternatively, a failure domain may be descriptive.

Determining a failure domain is done within a particular scope or assumptions


about how large an outage we are willing to consider.

We commonly hear of datacenters that have perfect alignment of power,


networking, and other factors but in which an unexpected misalignment results in
a major outage.
Software Failures
Software needs resilience features and there are 2 categories of crashes:

1. A regular crash occurs when the software does something prohibited by the
operating system. Due to a software bug, the program may try to write to
memory that is marked read-only by the operating system. The OS detects
this and kills the process.

2. A panic occurs when the software itself detects something is wrong and
decides the best course is to exit. The software may detect a situation that
shouldn’t exist and cannot be corrected. If internal data structures are
corrupted and there is no safe way to rectify them, it is best to stop work
immediately rather than continue with bad data. A panic is an intentional
crash.
Software Hangs
Sometimes when software has a problem it does not crash, but instead hangs or
gets caught in an infinite loop.

A strategy for detecting hangs is to monitor the server and detect if it has stopped
processing requests.

These active requests, which are called pings, are designed to be light-weight,
simply verifying basic functionality.

If pings are sent at a specific, periodic rate and are used to detect hangs as well
as crashes, they are called heartbeat requests.

Another technique for dealing with software hangs is called a watchdog timer.
Physical Failures
Distributed systems also need to be resilient when faced with physical failures.

Providing resiliency through the use of redundancy at every level is expensive and
difficult to scale.

Many components of a computer can fail. The parts whose utilization you monitor
can fail, such as the CPU, the RAM, the disks, and the network interfaces.
Supporting components can also fail, such as fans, power supplies, batteries, and
motherboards.
Clos Networking
It is reasonable to expect that eventually there will be network products on the
open market that provide non-blocking, full-speed connectivity between any two
machines in an entire datacenter. We’ve known how to do this since 1953 (Clos
1953). When this product introduction happens, it will change how we design
services.
Overload Failures
Distributed systems need to be resilient when faced with high levels of load that
can happen as the result of a temporary surge in traffic, an intentional attack, or
automated systems querying the system at a high rate, possibly for malicious
reasons.

Some examples are:

Traffic Surges

DoS and DDoS Attacks

Scraping Attacks
Chapter 7. Operations in a Distributed World
Operations is the work done to keep a system running in a way that meets or
exceeds operating parameters specified by a service level agreement (SLA).

Operations includes all aspects of a service’s life cycle: from initial launch to the
final decommissioning and everything in between.

“The rate at which organizations learn may soon become the only sustainable
source of competitive advantage.” (Peter Senge)
Change versus Stability
A system starts at a baseline of stability. A change is then made. All changes have
some kind of a destabilizing effect. Eventually the system becomes stable again,
usually through some kind of intervention. This is called the change-instability
cycle.

There is a tension between the operations team’s desire for stability and the
developers’ desire to get new code into production. There are many ways to reach
a balance. Most ways involve aligning goals by sharing responsibility for both
uptime and velocity of new features.
Operations at Scale
Operations in distributed computing is done at a large scale. Processes that have
to be done manually do not scale. Constant process improvement and automation
are essential.

Distributed computing involves hundreds and often thousands of computers


working together.

Operations is different from traditional computing administration because it is


focused on a particular service or group of services and because it has more
demanding uptime requirements.
Service Life Cycle
Operations is responsible for the entire life cycle of a service: launch, maintenance
(both regular and emergency), upgrades, and decommissioning.

Each phase has unique requirements, so you’ll need a strategy for managing each
phase differently.

Launches, decommissioning of services and other tasks that are done infrequently
require an attention to detail that is best assured by use of checklists. Checklists
ensure that lessons learned in the past are carried forward.
The stages of the life cycle
1. Service Launch: Launching a service the first time. The service is brought to
life, initial customers use it, and problems that were not discovered prior to the
launch are discovered and remedied.
2. Emergency Tasks: Handling exceptional or unexpected events. This includes
handling outages and detecting and fixing conditions that precipitate outages.
3. Nonemergency Tasks: Performing all manual work required as part of the
normally functioning system. This may include periodic (weekly or monthly)
maintenance tasks (for example, preparation for monthly billing events) as
well as processing requests from users (for example, requests to enable the
service for use by another internal service or team).
The stages of the life cycle
4. Upgrades: Deploying new software releases and hardware platforms.

Each new software release is built and tested before deployment. Tests include
system tests, done by developers, as well as user acceptance tests (UAT), done
by operations. UAT might include tests to verify there are no performance
regressions (unexpected declines in performance). Vulnerability assessments are
done to detect security issues. New hardware must go through a hardware
qualification to test for compatibility, performance regressions, and any changes in
operational processes.
The stages of the life cycle
5. Decommissioning: Turning off a service.

It is the opposite of a service launch: removing the remaining users, turning off the
service, removing references to the service from any related service
configurations, giving back any resources, archiving old data, and erasing or
scrubbing data from any hardware before it is repurposed, sold, or disposed.

6. Project Work: Performing tasks large enough to require the allocation of


dedicated resources and planning. Along the way tasks will arise that are larger
than others. Examples include fixing a repeating but intermittent failure, working
with stakeholders on roadmaps and plans for the product’s future, moving the
service to a new datacenter, and scaling the service in new ways.

The Agile methodology is an effective way to organize project work.


Service Launches
If we launch new services frequently, then there are probably many people doing
the launches. Some will be less experienced than others.

We should maintain a checklist to share our experience. Every addition increases


our organizational memory, the collection of knowledge within our organization,
thereby making the organization smarter.
Service Decommissioning
Decommissioning (decomm), or turning off a service, involves three major phases:

1. Removal of users
2. Deallocation of resources
3. Disposal of resources
Case Study: Self-Service Launches at Google
Google launches so many services that it needed a way to make the launch
process streamlined and able to be initiated independently by a team. In addition
to providing APIs and portals for the technical parts, the Launch Readiness
Review (LRR) made the launch process itself self-service.

The LRR included a checklist and instructions on how to achieve each item.

Some checklist items were technical, for example, making sure that the Google
load balancing system was used properly.
Case Study: Self-Service Launches at Google
Other items were cautionary, to prevent a launch team from repeating other
teams’ past mistakes. For example, one team had a failed launch because it
received 10 times more users than expected. There was no plan for how to handle
this situation. The LRR checklist required teams to create a plan to handle this
situation and demonstrate that it had been tested ahead of time.

Other checklist items were business related. Marketing, legal, and other
departments were required to sign off on the launch. Each department had its own
checklist.
Other things to consider
The most productive use of time for operational staff is time spent automating and
optimizing processes. This should be their primary responsibility.

Emergency tasks need fast response. Nonemergency requests need to be


managed such that they are prioritized and worked in a timely manner. To make
sure all these things happen, at any given time one person on the operations team
should be focused on responding to emergencies; another should be assigned to
prioritizing and working on nonemergency requests.

When team members take turns addressing these responsibilities, they receive
the dedicated resources required to assure they happen correctly by sharing the
responsibility across the team. People also avoid burning out.
Virtual Office
Operations teams generally work far from the actual machines that run their
services. Since they operate the service remotely, they can work from anywhere
there is a network connection.

Teams often work from different places, collaborating and communicating in a chat
room or other virtual office.

Many tools are available to enable this type of organizational structure. It becomes
important to change the communication medium based on the type of
communication required. Chat rooms are sufficient for general communication but
voice and video are more appropriate for more intense discussions. Email is more
appropriate when a record of the communication is required, or if it is important to
reach people who are not currently online.

You might also like