Cloud Computing - Theory and Practice (2019)
Cloud Computing - Theory and Practice (2019)
Practice
Mark Grechanik, Ph.D.
September 2, 2019
2
Copyright
c
2019 by Mark Grechanik, Ph.D.
All rights reserved. No part of this publication may be reproduced, distributed, or
transmitted in any form or by any means, including photocopying, recording, or other
electronic or mechanical methods, without the prior written permission of the author,
except in the case of brief quotations embodied in critical reviews and certain other
noncommercial uses permitted by copyright law. For permission requests, write to the
author, addressed Attention: Permissions Coordinator, at the address below.
1 Overview 10
1.1 New Computing Metaphor . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 The State of the Art and Practice . . . . . . . . . . . . . . . . . . . . 12
1.3 Obstacles For Cloud Deployment . . . . . . . . . . . . . . . . . . . . 14
1.4 What This Book Is About . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Duffusing 2 Cloud 17
2.1 Diffusing Computation . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Peer-To-Peer (P2P) Computing . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Backing up Files With Pastiche . . . . . . . . . . . . . . . . 22
2.2.2 <Project Name Here>@Home . . . . . . . . . . . . . . . . 25
2.3 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 All Roads Lead to Cloud Computing . . . . . . . . . . . . . . . . . . 27
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 RPC 51
4.1 Local Procedure Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Calling Remote Procedures . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 The RPC Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Interface Definition Language (IDL) . . . . . . . . . . . . . . . . . . 60
3
4 CONTENTS
5 Map/Reduce Model 65
5.1 Computing Tasks For Big Data Problems . . . . . . . . . . . . . . . 66
5.2 Datacenters As Distributed Objects . . . . . . . . . . . . . . . . . . . 67
5.3 Map and Reduce Operation Primitives . . . . . . . . . . . . . . . . . 69
5.4 Map/Reduce Architecture and Process . . . . . . . . . . . . . . . . . 71
5.5 Failures and Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 Google File System . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.7 Apache Hadoop: A Case Study . . . . . . . . . . . . . . . . . . . . . 77
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 RPC Galore 82
6.1 Java Remote Method Invocation (RMI) . . . . . . . . . . . . . . . . 82
6.1.1 The Java RMI Process . . . . . . . . . . . . . . . . . . . . . 84
6.1.2 Parameter Passing . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.3 Lazy Activation of Remote Objects . . . . . . . . . . . . . . 86
6.2 <your data serialization format here>-RPC . . . . . . . . . . . . . . 87
6.2.1 XML-RPC . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.2 JSON-RPC . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.3 GWT-RPC . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Facebook/Apache Thrift . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4 Google RPC (gRPC) . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5 Twitter Finagle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6 Facebook Wangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Cloud Virtualization 98
7.1 Abstracting Resources . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2 Resource Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3 Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.4 Hypervisors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.5 Interceptors, Interrupts, Hypercalls, Hyper-V . . . . . . . . . . . . . 107
7.6 Lock Holder Preemption . . . . . . . . . . . . . . . . . . . . . . . . 109
7.7 Virtual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.8 Programming VMs as Distributed Objects . . . . . . . . . . . . . . . 113
7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8 Appliances 115
8.1 Unikernels and Just Enough Operating System . . . . . . . . . . . . . 116
8.2 OSv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.3 Mirage OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.4 Ubuntu JeOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.5 Open Virtualization Format . . . . . . . . . . . . . . . . . . . . . . . 122
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
CONTENTS 5
11 Infrastructure 159
11.1 Key Characteristics of Distributed Applications . . . . . . . . . . . . 159
11.2 Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
11.3 Cloud Computing With Graphics Processing Units . . . . . . . . . . 166
11.4 RAID Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
11.5 MAID Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 170
11.6 Networking Servers in Datacenters . . . . . . . . . . . . . . . . . . . 171
11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
17 Conclusions 241
List of Figures
7
8 LIST OF FIGURES
10.1 Scala pseudocode example of a RESTful web service using Twitter Finch.146
10.2 The skeleton C program of a container management system. . . . . . 149
10.3 The dependencies of the program /bin/sh are obtained using the
utility ldd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
10.4 An example of Dockerfile for creating a container with a Java program. 152
10.5 The skeleton C program of a container management system. . . . . . 153
11.1 C program fragment for iterating through the values of the two-dimensional
array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
11.2 A high-level view of the GPU architecture and interactions with the CPU.167
12.1 A Java-like pseudocode for using a load balancer with Amazon Cloud
API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
LIST OF FIGURES 9
13.1 A Spark program for computing the average value for the stream of
integers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.2 An SQL statement that retrieves information about employees. . . . . 200
13.3 A Spark SQL statement that retrieves information about employees. . 200
15.1 A sales ACID transaction that updates data in three database tables. . 223
15.2 A transformed sales transaction that sends messages to update data. . 223
15.3 Pseudocode shows possible synchronization. . . . . . . . . . . . . . . 224
15.4 A transformed sales transaction that sends messages to update data. . 225
10
1.1. NEW COMPUTING METAPHOR 11
sioned and released with minimal management effort [117]. Since performance and
availability are very important non-functional characteristics of large-scale software
applications, the promise of cloud computing is to enable stakeholders to economically
achieve these characteristics via cloud elasticity and economy of scale. Essentially,
a main benefit of deploying large-scale applications in the cloud is to significantly
reduce costs for enterprises due to cheap hardware and software in the centralized
cloud platforms. Thus, a main reason for enterprises to move their applications to
the cloud is to reduce the increasing cost of their maintenance that involves paying
for hardware and software platforms and technical specialists that maintain these plat-
forms [13, 23, 24, 93, 98, 123].
Seven league boots is a metaphor for describing the operational efficiency of cloud
allocating hardware and software services to deployed applications, which is in theory
more optimal with increasing scale. This operational efficiency is often referred to
as the elasticity to provision resources rapidly and automatically, i.e., to quickly scale
out, and rapidly release these resources, i.e., to quickly scale in. Using our analogy, one
can imagine a magical mini-robot who lives inside computer servers and who adds the
CPUs and other resources when applications need them and removes these resources
when they are no longer needed. This magical mini-robot is implemented in software
that is a part of a cloud computing platform.
The difference between the words allocation and provisioning of resources is subtle
but important. Allocating resources means reserving some quantity of these resources
for the use of some client or a group of clients. For example, an organization can pur-
chase 1,000 virtual machines (i.e., simulated computers) with some predefined char-
acteristics from a cloud provider for 24x7 use for three years and the cloud provider
will allocate these resources. The organization may have many users who will execute
various applications in the cloud, and each of these applications will demand different
resources. Provisioning resources to applications means activating the subset of the
allocated allocated resources to meet the applications’ demands. Ideally, stakeholders
want to allocate exactly as many resources as will be provisioned to their applications,
however, it is often difficult to achieve.
To stakeholders, the capabilities available for provisioning often appear to be un-
limited, but unlike seven league boots, stakeholders purchase certain quantities of these
capabilities [117]. A message from cloud providers (e.g., Google, Microsoft, Amazon
EC2) to application owners is the following: develop and test your software application
the same way you have been doing, and when you are ready, put on your application
our magical seven league boots, i.e., deploy your application on our cloud platform,
pay fees for the usage of our resources by your application, and we will take care of
the performance and availability of your application. Despite this simple and attractive
marketing message, companies and organizations are very slow to deploy their appli-
cations on the cloud; in fact, many companies that attempted to move their applications
to cloud provider platforms moved back to internal deployment.
Not only is it widely documented that companies and organizations go slow with
cloud computing, but they are also advised to hold back and thoroughly evaluate all
pros and cons. A survey of 1,300 information technology professionals shows that it
takes at least six months to deploy an application in the cloud and 60% of respondents
said that cloud application performance is one of three top main reasons to move ap-
12 CHAPTER 1. OVERVIEW
plications to the cloud, whereas 46% cited the cost of moving to the cloud as the major
barrier [36]. Interestingly, almost two in five of respondents said they would rather
get a root canal, dig a ditch, do their own taxes than address serious challenges associ-
ated with cloud deployments. Over 31% said they could train for a marathon or grow
a mullet in a shorter period of time than it takes to migrate their applications to the
cloud. There is a fundamental disconnect between software development process and
deploying applications in the cloud, where properties of the cloud environment affect
not only the performance of deployed applications, but also their functional correct-
ness. In addition, more than one quarter of the respondents suggested that they had
more knowledge on how to play Angry Birds than the knowing how to migrate their
company’s network and applications to the cloud.
Shortly after this deployment a disturbing pattern emerged that the customer sup-
port center of the corporation received an increasing number of calls from upset cus-
tomers who were denied boarding to their flights, since some other customers were
already in their seats. Even though denied boarding is a more humane solution than
dragging passengers off their seats by security officers, passengers could not appreci-
ate their luck and they complained loudly. As it turned out, the flight application issues
multiple tickets for the same seats. Further investigation showed that the increased
seat double-booking happened on flights that offered last-minute special deals. When
the application was moved back to the corporation and deployed on a single powerful
server, this pattern of double-booking disappeared. Within the corporation the consen-
sus was formed that was succinctly expressed as “cloud computing is terrific but it is
not for us yet.”
A problem with this deployment was uncovered later during one of code reviews.
The Java keyword synchronized guarantees that only one client can access the
shared state as long as the threads run within a single process. However, once the
load increases, the cloud scales out by adding more VMs that run threads in separate
processes. Thus, two or more threads are prevented to reserve the same seat within the
same process, however, the state of the application is not shared among two or more
processes. Hence, double-booking occurred frequently when a special flight deals were
announced leading to the peaks in user loads, which resulted in the cloud scale-out and
creating new VMs. Preventing this situation would require changes to the source code,
so that multiple VMs would synchronize their accesses to shared resources (i.e., seats),
which is often an expensive and error-prone procedure.
An overarching concern was that most software applications have performance
problems, i.e., situations when applications unexpectedly exhibit worsened character-
istics for certain combinations of input values and configuration parameters. Perfor-
mance problems impact scalability thus affecting a large number of customers who use
a software application. To ensure good performance, the cloud is supposed to elas-
tically allocate additional resources to the application to cope with the performance
problems and the increased load. Predicting performance demands is a very difficult
problem, and since applications are deployed as black-boxes, it is difficult for the cloud
to allocate resources precisely when demand goes up and down. Unfortunately, the
cloud has no visibility into the internals of the applications and their models, and re-
sources are under- or over-provisioned resulting in reduced quality of services or sig-
nificant cost of resources that are not used by these applications. Application owners
ultimately foot the bill for these additional resources, which is considerable for large-
scale applications with performance problems.
14 CHAPTER 1. OVERVIEW
As a result, the company puts the application in the cloud and starts running multi-
ple instances of this application in parallel in many VMs. Let us assume that the cost of
the VM is one cent per hour and the average elapsed execution time of the application
is one minute. Suppose that we run 1,000,000 VMs in parallel at the cost of $3.36Mil
performing over 20Billion test executions of this application. Let us also assume that
that the total input space is 1020 combinations. Yet, despite this cost, only a small
fraction of the input space, 2 × 10−10 is explored. Of course, depending on the struc-
ture of the application’s source code and its functionality, it may be enough, however,
it is also highly probable that after spending two weeks of computing time and mil-
lions of dollars, the application’s coverage may still be in single digits with few bugs
reported. As a result, the testing effort in cloud would be declared as a failure, since
the management can argue that by purchasing computers for the internal information
technology (IT) division is more cost-effective that using cloud computing. And they
may be right – these computers can be used for other applications, whereas the money
paid for the cloud computing time used for the testing effort cannot be amortized for
other applications.
This example illustrates complex issues that accompany cloud computing. Owning
computing infrastructure is expensive, however, its cost is amortized over many appli-
cations, whereas the cost of poorly executed cloud deployment cannot be recouped.
Moreover, depending on how confidential data is, i.e., public, partly or fully sensitive,
it may not be possible to move it outside the organization that owns this data to the
cloud infrastructure owned by a different company. Even though we do not discuss se-
curity and privacy of cloud deployment in this book, this is one of the biggest obstacles
in cloud deployment. Thus, it is important to understand how to build an application
as a set of distributed objects and deploy them in the cloud environment in a way that
is economical, to guarantee desired performance and functionality.
16 CHAPTER 1. OVERVIEW
In this chapter, we describe the ideas of diffusing computation, peer-to-peer (P2P) com-
puting, and grid computing as most notable precursors to cloud computing. All these
types of computing are instances of distributed computing, which can be defined as
software processes that are located at different memory address spaces and they com-
municate by passing messages, which are sequence of bytes with defined boundaries.
These messages can be exchanged synchronously, when the message sending compo-
nent blocks during the message exchange, or asynchronously, when the function call
to send a message returns immediately after passing the parameter values to the under-
lying message-passing library and the component continues execution. A synchronous
call may block during the entire call or a part of the call whereas an asynchronous call
is always nonblocking. If the call returns before the result is computed, the caller must
perform additional operations to obtain the result of the computation when it becomes
available. Distributing computations is important for various reasons that include but
not limited to better utilizing resources and improving response time of the application.
A stricter definition of synchronicity in distributed applications involves the no-
tion of the global clock. In an idealized distributed application, all of its objects have
access to this clock that keeps the time for these objects. All algorithm executions
are bounded in time and speed and the global clock measures these time and speed.
All message transmissions have bounded time delays (i.e., latencies). To understand
these constraints, let us view the execution of some algorithm in phases – each phase
groups one or more operations and there is a precisely defined phase transition that
specifies when one phase finishes and the next one starts [132]. Suppose that we have
a specification that defines at what phases to stop the executions of algorithms in some
distributed objects and at what phases to allow the executions in some other distributed
objects to proceed. Since partitioning executions in phases bounds their speeds and
having the global clock bounds their times, it is possible to implement specifications
that determine how a distributed application is executed (e.g., how resources are shared
among its distributed objects). However, it is not easy to realize in the asynchronous
17
18 CHAPTER 2. DUFFUSING 2 CLOUD
application when there is no global clock and no bounds exist on the execution time
and speed of its constituent distributed objects.
Opposite to engineering distributed applications is building a monolithic applica-
tion, whose entire functionality is implemented as instructions and data that are em-
bedded in a single process address space. Since a single address space is often as-
sociated with a single computing unit, there is a limit on how many applications can
be built as monolithic. Due to geographic distribution of users and data and govern-
ment regulations, many applications can only be built as distributed. A key element in
building distributed application is to organize objects that constitute these applications
into some patterns, and then use these patterns to develop the underlying framework
for certain types of distributed applications. In this chapter, we analyze several salient
developments on the road from building single monolithic applications to creating and
deploying applications in the cloud from distributed objects.
the deficits of incoming edges will be zero. Since the sum of the deficits of all outgoing
edges is equal to the sum of the deficits of the incoming edges, after a bounded number
of steps, the computation will return to its neutral state and it is terminated.
An important contribution of the concept of diffusing computation is that it intro-
duced an abstraction of cooperating distributed objects that receive service requests
from other nodes. Diffusing computation is abstract, since no concrete details of the
message formats, interfaces of computing units, or specific resources assigned to units
are given. As such, diffusing computation offers a simple model that can be instantiated
in a variety of concrete settings. For example, consider a simplified idea of a computa-
tion, which we later call map/reduce, where objects are assigned to different computing
nodes and the dataset is distributed across a subset of these nodes. The objects that are
run on these nodes will preprocess assigned datasets by mapping their data to smaller
datasets, and send the smaller datasets to children nodes that will continue the compu-
tation until the data is reduced to some value, after which, these nodes will return these
values to their parents as signals of completed computations. The parent nodes will
aggregate these values and send them as signals to their parents until the computation
terminates.
A simple form of the diffusing computation is a graph of two nodes where one node
called a client sends a message to the other node called a server. This communication
may be synchronous, where the client will wait, i.e., will block until the server responds
to the message or until some event is produced during the execution of the call by
the server, or asynchronous, where the client will continue its computation while the
server is processing the message. This model is called client/server and it is widely
used to engineer distributed software applications. Technically speaking, this model is
also valid for monolithic applications, where a client object invokes methods of some
server objects. This invocation may be thought of the client object sending a message to
the server object, even though in reality the invocation is accomplished by loading the
instruction pointer with the address of the code for the invoked method, since the code
is located in the same address space with the caller object. However, if the address
spaces are disjoint, which is the case even when the client and the server are run in
separate process spaces on the same computer, then the invocation is performed using
some kind of inter-process communication mechanism (e.g., shared memory, pipes,
sockets), which can be modeled using message passing.
Finally, using the model of diffusing computation, one can reason about splitting
a monolithic computation into distributed one, so that it can meet various objectives.
Computing nodes in the graph can be implemented as commodity server computers;
20 CHAPTER 2. DUFFUSING 2 CLOUD
edges represent network connectivity among these computers; messages designate the
input data that are split among computing nodes, and signals are output data that the
servers compute. Adding constraints to the model of diffusing computation, we can
create interesting designs, for example, for a web store, where purchasing an item
triggers messages to various distributed objects, which are responsible for calculating
sales tax, shipment and handling, retrieving coupons, among other things. Each of these
objects may send messages to its children objects to trigger subsequent operations, for
example, the shipment and handling component may send messages to components
that compute an optimal shipping route and determine discounts for different shipping
options. Once these computations are completed, children components send results of
their computations with signal messages to their parents, who in turn do the same for
their parent objects until the computation terminates. Thus, diffusing computation is an
important model for understanding how objects interact in a distributed environment.
another using a networking protocol like TCP/IP, whereas at the peer layer the objects
communicate with one another using the high-level communication graph, which is
called an overlay network.
Question 3: Discuss a model of P2P with a central computing unit that allows
P2P nodes to organize into an overlay network by sharing information about their
services with the central computing unit.
Second, the issue of security is important, since peer computers are exposed to
different clients, some of which have malicious goals. A main idea behind many attacks
is for one malicious users to mask as a peer to take over as many computers on the
P2P network as possible for various purposes like stealing personal data, money, or to
mount a bigger attack. Denial of service is one example of an attack when a malicious
computer floods other computers with messages, so that they allocate an increasing
chunk of their computational resources to processing these messages. Interestingly, the
peer who initiates this process may not even be malicious, since the process can start
with broadcasting a message across a P2P network to which the increasing number
of peers send response messages. Thus, the diffusing computation is aggravated into
a broadcast storm, where the network traffic is saturated with response messages and
new messages are blocked. As a result, the entire P2P network becomes nonresponsive,
a situation that is referred to as network meltdown. Detecting and preventing security
threats is a big problem with P2P computing.
The third category of important questions that P2P computing raises and that is
highly relevant to cloud computing is how to guarantee certain properties of compu-
tations in presence of failures. Unlike client/server networks where the failure of the
server stops computations, there is no single point of failure in P2P networks, since
handling peer failures should be engineered by design. In the worst case when only
two peers are present, P2P degenerates into the client/server computing, however, as
a large number of peers join the network and do not leave suddenly at the same time,
single peer failures will not jeopardize the entire network.
Question 4: Suppose that a music player has a security vulnerability that can
be exploited by sending a music file with carefully crafted sequence of bytes.
Explain how you would design a P2P music file distribution system to mitigate
the effect of such security vulnerability.
P2P applications are diverse and music sharing is what made P2P a part of the
widespread cultural significance. We consider two case studies of applications of P2P
to rather somewhat less general and more concrete problems of backing up files and
finding patterns of extra-terrestrial life in radio telescope data. These case studies are
interesting for two reasons. First, Pastiche can be viewed as a precursor to a prolifera-
tion of file backup and sharing cloud applications, where users pay for storing their files
with certain guarantees of fault tolerance and data recovery. Second, Pastiche shows
22 CHAPTER 2. DUFFUSING 2 CLOUD
how difficult it is to obtain guarantees of security and fast access to data in P2P com-
puting. Finally, with SETI@Home, the idea of obtaining slices of CPU and memory
resources of the other participants of the network is realized in the P2P setting.
SETI@Home was a radical departure from the algorithm-centric computing, where
a small amount of data is processed by complex algorithms to data-centric comput-
ing, where a large amount of data is split across multiple processing nodes to run
some algorithm over the data in parallel. In both cases, we see a radical departure
both from monolithic programming and from a theoretical model of diffusing com-
putation. The differences with monolithic programming are obvious, since Pastiche
and SETI@Home are implemented using distributed objects that communicate by ex-
changes messages using the Internet. When it comes to the basic distributed computing
models, we see that these approaches impose additional constraints related to allocation
of resources (i.e., storage space and CPU/RAM) on demand based on their availability
to distributed objects. In addition, we see that these approaches deal with very large
data set with complex structures, often referred to as big data. Processing big data
requires clever allocation of resources and high levels of parallelism.
• What happens when peer computers that store backed up data become unavail-
able? Recall that there is no centralized control in P2P networks, so its comput-
ing nodes come and leave at will. Not being able to obtain backed up data when
the main storage failed renders such backup system unusable.
• Next, how does a peer find other peers to store backup data? How many other
peers to choose for the backup?
• Since a P2P network may have tens of thousand of connected peers, how to
locate peers that store backup data to retrieve them efficiently without flooding
2.2. PEER-TO-PEER (P2P) COMPUTING 23
As the reader can see, just because an idea is plain and can be stated using a simple
language, it does not mean that it can be easily implemented. All raised issues are
serious enough to render naive implementations completely unusable.
collision. Many hash functions provide theoretical guarantees that collisions are very
rare events.
Once fingerprints for files are computed, lower bits of these fingerprint values are
used to determine offsets in the data to break it into chunks. Before these chunks are
distributed across some peers in the overlay network, these chunks are encrypted using
convergent encryption, where encryption keys are derived from the content of these
data chunks using hash functions. Convergent encryption gained popularity in cloud
computing, since it enables automatic detection and removal of duplicate data while
maintaining data confidentiality.
SETI@Home is not the only project that uses this idea to distribute computations
to computers owned by volunteers. Other projects named Milkyway@Home, Fold-
ing@Home, and Einstein@Home also use this computing model with a common theme
in which data items in data sets are largely independent from one another and these data
sets can be easily split and distributed among different computing nodes. Like Pastiche,
available resources of computing nodes are borrowed to achieve some computing goal.
However, @Home applications operate on big data that does not have to be stored
across multiple nodes. On contrary, this big data is mapped to computing nodes and
processed by algorithms that are run on these computing nodes, which reduce the input
data to much smaller results that are sent to other nodes for further processing.
grid or will be repurposed to solve some other problem. Whereas a user is presented
a view of a single computer, this view is virtual, since the user’s computer runs a
grid computing application that simulates and supports this view using resources from
computing devices from volunteers that connect to this grid computing application.
Applications of grid computing are computationally intensive and they process large
amounts of data to determine certain patterns, for example, genome discovery, weather
forecasting, and business decision support.
difficult to guarantee. Distributing sensitive data to computers that are not owned or
controlled by companies may violate data confidentiality policies. Finally, allocating
enough resources for a computational tasks is a problem in the context where tens
of thousands of tasks compete for resources. Since computing resources are limited,
balancing these resources among different tasks is a big and important problem.
Several trends emerged in the middle of 2000s. Companies and organizations
started looking for outsourcing the business of creating and owning datacenters. At the
same time, prices for computers were dropping precipitously and multicore computers
were becoming a commodity in the early 2000s with clock rates increasing to several
gigahertz. Large data processing companies like Google, Amazon, and Facebook built
data centers to store and process big data and to support millions of customers using
web interfaces. That is, each of these companies and organizations presented their ab-
straction as a distributed object with well-defined interfaces that their customers can
access via browsers. As a natural progression, the idea emerged to build commercial
datacenters that host tens of thousands of commodity computers. Users can create
accounts with companies that own these datacenters and buy computing services at
predefined rates. To the users, the datacenter looks like a big cloud that contains com-
puting resources. The users deploy their software applications and send requests to
clouds, which process these requests and send response messages. Hence, cloud com-
puting was born.
A few definitions of cloud computing exist. One definition by a group that includes
a founder of grid computing states that cloud computing is a large-scale distributed
computing paradigm that is driven by economies of scale, in which a pool of abstracted
virtualized, dynamically scalable, managed computing power, storage, platform, and
services are delivered on demand to external customers over the Internet [53]. A pop-
ular definition is given by the National Institute of Standards and Technology (NIST)
that states that cloud computing is a model for enabling ubiquitous, convenient, on-
demand network access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that can be rapidly provisioned
and released with minimal management effort or service provider interaction [97].
To understand these definitions, let us analyze their key elements. A large-scale
distributed computing paradigm implies the departure from small distributed appli-
cations where a handful of clients interact with a single object located in a separate
address space. The emphasis is on large-scale, where millions of clients access hun-
dreds of thousands of distributed objects that interact asynchronously and for which
specific availability and reliability guarantees are provided with Service Level Agree-
ments (SLAs) that include constraints on the protocols between clients and server ob-
jects, guarantees, and penalties for violating these guarantees. This paradigm is driven
by economies of scale, a concept in economics that with the increase of production
of certain units in an enterprise, the production cost of each unit drops. For example,
moving from hand-crafting each car to their mass production using automated manu-
facturing lines reduced the cost of cars drastically. The idea of applying the concept of
economies of scale to cloud computing is by aggregating commodity computers in a
datacenter and selling slices of computing resources to customers, the cost of solving
computing tasks will be drastically reduced, since customers would not have to own
this cloud computing infrastructure, they will rent it, and the cost will be amortized
2.4. ALL ROADS LEAD TO CLOUD COMPUTING 29
Question 10: Discuss how resource (de)provisioning can affect the functional
correctness of application.
Question 11: Discuss how software bugs in the implementations of the virtual
CPU can affect the correctness of cloud-based applications that run on this vCPU.
Recall that a key element of cloud computing is that users pay for provisioned
resources. Unlike grid computing, users of cloud computing enter payment information
when they create their accounts. Once the account is created and a payment source
is verified, all services that the user request from the cloud are reflected in a bill that
applies charges for provisioned resources. In that, cloud computing makes a significant
departure not only from grid computing and other forms of P2P computing, but also
from a custom-built company-owned datacenter. Once the capital expenditure is made
in the latter, there is little need to dynamically (de)provision resources, since these
resources are already paid for and applications grab as many resources as possible, even
if they do not fully utilize them. However, having to pay for underutilized resources
in the cloud changes the nature of application deployment, since the cost of computing
may be so high that it may not be cost-effective to move applications to the cloud.
Thus, it is important to take into accounts how to engineer applications from distributed
objects, so that its deployment in the cloud would be cost-effective. This is a major
topic that this book is about.
2.5 Summary
In this chapter, we trace the path from early ideas of distributed computing to cloud
computing. We start off by describing the abstract model of diffusing computation
where computing units are organized in a tree, then we move on to peer-2-peer and grid
computing implementations that was introduced decades later. Along the discussion we
introduce key milestones and seminal ideas and concepts that we will use throughout
this book. Even though peer-2-peer and grid computing with various @Home projects
require a separate book, we extract only most representative ideas that are essential to
understand how we arrived to cloud computing. We conclude with a short discussion
on virtualization and cloud datacenters to give the readers a convergent view of cloud
computing as a natural progression of early ideas and implementations of research in
distributed computing.
Chapter 3
Large complex software applications require sophisticated hardware and software in-
frastructures in order to deliver high quality services to their users. As we already
discussed, historically, many organizations have chosen to own, manage, and oper-
ate their infrastructures, bearing the responsibility for guaranteeing that they are suf-
ficient and available. As a result, such organizations must periodically evaluate their
infrastructures and forecast their resource needs (i.e., CPUs, memory, storage). Armed
with these forecasts they then purchase and maintain the resources required to meet
those needs. Since it is difficult, if not impossible, to exactly predict resource needs,
especially over a long period of time, these organizations often end up grossly over-
provisioning or grossly under-provisioning their infrastructures. And therefore, they
have found it difficult to cost-effectively provide for their long term computing needs.
In this chapter, we describe a model of cloud computing and how it is realized using
different operational environments.
31
32 CHAPTER 3. MODEL AND OPERATIONS
Question 1: Explain how Pastiche-related data hashing can be used for load
balancing in the cloud.
In our model, physical hardware, the LB, VMs and communication interfaces com-
prise the key elements of the cloud computing platform and are the key means through
which service costs and resource allocation are controlled and manipulated. Client re-
quests (shown with incoming arrows in Figure 3.1) arrive to the LB that then distributes
these client requests to specific VMs. Some VMs (e.g., VM2 ) may not receive requests
directly from the load balancer but from other VMs. Resources are allocated to VMs to
speed up the execution of the components that process incoming requests. A simplified
sequence of steps for cloud processing is the following.
1. LB receives a set of requests {R} from different clients of the hosted software.
scaling up and scaling out. For example, since VM1 contains more components that in-
volve computationally intensive operations, it is scaled up by assigning three CPUs and
memory units. Alternatively, the cloud could scale out this application by replicating
VMs (e.g., VM4 and VM5 ), thus enabling multiple requests to be processed in paral-
lel. That is, the cloud uses two main scaling operators: sres(r, a, i) and sinst(a, i),
where r is the type of a resource (e.g., memory, CPU), a is the amount, and i is the VM
identifier. The scaling operator sres (de)allocates the resource, r, in the amount a to
the VM, i, and the scaling operator sinst (de)allocates a instances of the VM, i. In
theory, elastic clouds “know” when to apply these operators to (de)allocate resources
with high precision.
Unfortunately, the gap between the theory and practice is large. Consider a situa-
tion when an application load increases, e.g., numerous customers flock to a web-store
application to buy a newly released electronic gadget. Once the LB receives an in-
creased number of request in step 1, the cloud should provision resources to VMs in
anticipation of the increased load, so that by the time that the LB forwards client re-
quests in step 2, the VMs are ready to process them in step 3. In general, the cloud
“does not know” what scaling operator to apply and what parameters to use for these
operators. It is only in step 4 after analyzing performance counters the cloud enables
stakeholders to apply scaling operators. The delay between the steps 2 and 4 leads to
over- and under-provisioning, which is typical for existing cloud infrastructures [62].
Naturally, if the cloud “knew” the time and extent to which resource demands would
change then it could precisely and proactively (de)allocate resources, thereby improv-
ing its elasticity and resulting in Δr → 0, Σg → 0.
The attraction of elastic clouds is that stakeholders pay only for what they use,
when they use it, rather than paying up-front and continuing costs to own the hard-
ware/software infrastructures and to employ the technical staff that supports them [13,
23,24,93,98,123]. Of course, in practice, even the most elastic clouds are not perfectly
elastic [77]. Understanding when and how to reallocate resources is a thorny problem.
Allocating those resources takes time, and it is generally impossible to quickly and
accurately match resources to applications’ needs over extended periods of time. An
article by the Google Cloud team underscores this point as it describes a state of the art
supervisory system that lets applications monitor various black box metrics and then
direct the cloud to initiate scaling operations based on that data [62]. However, the very
existence of the system confirms the large gap between the promise and the reality of
elastic cloud computing. For example, a documented limitation of this system is that its
polling approach substantially lags changes in resource usage, rather than proactively
anticipating changes. As a result, main problems of cloud computing elasticity is either
under-provisioning applications where they lack the resources to provide appropriate
quality of service, or over-provisioning where the stakeholders would be holding and
34 CHAPTER 3. MODEL AND OPERATIONS
cloud. It is a manual, intellectually intensive and laborious effort, and these provision-
ing strategies are often not as effective as they need to be. It is a symptom of a bigger
problem where software engineering is viewed orthogonal to cloud computing – with
current approaches, stakeholders design and build software applications as if a fixed,
even if sometimes abundant, set of resources were always available. Since no soft-
ware engineering models are used to deploy applications in the cloud, resources are
frequently either over or underprovisioned, thereby increasing the applications’ cost
and degrading their quality of service. With global spending on public cloud services
estimated to reach $110.3Bil [12,58] by 2016, and with an estimated $20Bil of US fed-
eral government spending on cloud computing [84], the cost of not having truly elastic
clouds will be measured in the billions of dollars [14].
Most current software development approaches pre-date the cloud, and as a result
they effectively treat cloud applications no differently than traditional distributed sys-
tems. If anything, some might argue that separating the software system from the cloud
execution platform should allow engineers to worry much less about performance, as it
will, after all, be boosted by the cloud. While it is true that developing software for pub-
lic cloud computing differs little from developing for in-house infrastructures, there are
at least two fundamental differences between these environments. First, cloud comput-
36 CHAPTER 3. MODEL AND OPERATIONS
ing depends on time-varying resource allocation and, second, it apportions costs based
on runtime usage. In fact, these features are necessary to enable and manage cloud elas-
ticity [117]. In addition, cloud providers rely on slow, reactive, black-box approaches
to dynamically provision resources, which leads to higher cost and degraded perfor-
mance [29, 33, 68, 110, 140].
Thus, there is a problematic disconnect between existing software engineering ap-
proaches and the needs of applications that are deployed in the cloud. With current
approaches, stakeholders design and build software applications as if a fixed and abun-
dant set of resources would always be available. This, however, is simply not true for
cloud computing. Resources are dynamically allocated and deallocated to different
components, which has important cost and performance implications that can even
violate assumptions made in requirements specifications.
Developers need such support throughout the software lifecycle to answer ques-
tions, such as: how to assign features to components and then how to assign resources
to those components in ways that reduce overall cost and maximize performance; how
to ensure that performance problems are found during performance testing, so that
these problems will not result in the excessive costs when the application runs in the
cloud; how to ensure functional correctness when scaling out the application automat-
ically; and how to ensure that scaling strategies are appropriate for a given application.
Providing support to software engineers who create, deploy, and maintain software
applications for the cloud requires the creation of a paradigm that addresses multiple
cloud-relevant engineering concerns throughout the software engineering lifecycle. For
instance, both the engineering approaches that developers use to create applications a
nd the cloud computing infrastructures that ultimately run these applications must take
resource variability and cost models into account. For example, with usage-based cost
models, system performance problems should map directly to quantifiable economic
losses. These considerations can be used both to reason about and direct a system’s
up-front design, and to direct resource provisioning.
Figure 3.3: An illustrative example of the CUVE for a cloud-based application. The timeline
of the operations is shown with the horizontal block arrow in the middle. The process starts
with the customer who defines elasticity rules on the left and the events are shown in the fish-
bone presentation sequence that lead to the CUVE on the right. (With main contributions from
Mssrs.Abdullah Allourani and Md Abu Naser Bikas)
pay for resources that are not used by their applications for some period of time.
Creating an automatic approach for generating rules that (de)provision resources
optimally based on the applications behavior is an undecidable problem because it is
impossible to determine in advance how an application will use available resources
unless its executions are analyzed with all combinations of input values, which is of-
ten a huge effort. Currently, many rules are created manually to approximate a very
small subset of the applications behavior, and clouds often (de)provision resources in-
efficiently in general, thus resulting in the loss of customers time and money [71]. Au-
tomatically creating test input workloads is a big problem that detect situations when
customers pay for resources that are not fully used by their applications while at the
same time, some performance characteristics of these applications are not met, i.e., the
Cost-Utility Violations of Elasticity (CUVE).
utilization. The outer measurements on the vertical axis indicate the number of the
provisioned VMs and the response time in seconds, and the solid red line shows the
threshold of a service level agreement (SLA) that indicates a desired performance level
(i.e., the response time).
We show that a rapid change from the workload 2 to the workload 3 results in a
situation where the cloud allocates resources according to the rule based on workload
2, whereas different resources are needed to maintain a desired level of performance
for workload 3, a short moment after the provisioning is made for workload 2. Find-
ing such black swan workloads that lead to the CUVE is very important during stress
testing, where the SLA is violated and the cost of deployment is high because of the
provisioned resources. The cost and the performance move in opposite directions.
Once known, these black swan workloads and rules can be reviewed by developers and
performance engineers, who optimize the rules to achieve a better performance of the
corresponding application. We show how the interactions between workloads and rules
lead to the CUVE problem.
Consider what happens in the illustrative example with the commonly recom-
mended rule that specifies that the cloud infrastructure should allocate one more VM if
the utilization of the CPUs in already provisioned VMs exceeds 80%. As an example,
we choose the initial configuration of five VMs at the cost of $2 at the time t1 . We
rounded off the cost for the ease of calculations and based it on the pricing of vari-
ous cloud computing platforms [9, 63, 103]. Then, a CPU-intensive workload triggers
the rule at the time t2 . A new VM will be provisioned after some startup time while
the owner of this application is charged an additional $2 at the time t2 . The VM will
become available to the application at t2 + tV Ms , where tV Ms is the VM startup time.
Suppose that allocating one more VM in this example decreases the CPU utilization to
35% whereas the memory utilization remains the same at 30%. The new workload 2
leads to a significantly increased CPU utilization, and another VM is allocated at the
time t3 . This is in a nutshell how an elastic cloud works.
Suppose that the response time for the application should be kept under two sec-
onds according to the SLA that is specified by the applications’ owners, and a goal
of the elastic rules is to provision resources to the application to maintain the SLA.
The SLA is maintained below the threshold until the time t4 when the workload rapidly
changes. The new workload 3 leads to a significant burst in the memory usage whereas
the utilization of the CPUs in already provisioned VMs remains low at 40%. The mem-
ory utilization increases to 90%, and there is no rule that can be triggered in response,
thus, subsequently, there is no action taken by the cloud to alleviate this problem. The
CPUs wait for data to be swapped in and out of memory, and they spend less time ex-
ecuting the instructions of the application. As a result, the application’s response time
increases, thus eventually breaking the SLA threshold. Furthermore, at the 40% higher
cost, the SLA is violated and the performance of the application worsened significantly,
while the application’s owner pays for resources that are under-utilized.
40 CHAPTER 3. MODEL AND OPERATIONS
An approach for measuring cloud elasticity involves the following steps: 1) submit
a workload that follows some pattern; 2) measure the demand of the workload on the
cloud platform; 3) measure the supply of the available resources by the platform; 4)
measure the latency and some other performance aspects; 5) calculate penalties for
under- and overprovisioning of the resources and add the cumulative penalties [77].
Question 8: Design and implement a software stack for web page indexing
applications that collect information about web pages on the Internet.
There are currently many software stacks and their number multiplies every month.
One of the popular software stack that we will discuss in this book is Map/Reduce,
where a specialized file system is built for large-scale parallel processing of data using
commodity hardware, Google File System (GFS), and a set of specialized libraries.
Some popular software stacks include LAMP that consists of the Linux OS, Apache
HTTP server, the MySQL relational database, and the PHP programming language;
LYME that consists of the Linux OS, Yaws web server, the Mnesia or the CouchDB
databases, and the Erlang functional programming language; and OpenStack, which is
an open-source cloud environment implementation that runs on Linux. A variant of
LAMP that substitutes Windows OS for Linux is called WAMP.
A common thread for different software stack is hiding some services within the
stack and exposing a limited set of interfaces to the external users of these stacks. Natu-
rally, doing so enables users to concentrate on specific tasks without having to manage
unrelated components of the stack. More importantly, depending on the organization of
3.5. CLOUD SERVICE AND DEPLOYMENT MODELS 41
the software stack, the user may not have any access to certain settings of the compo-
nents of the stack. For example, allowing some users to configure the virtual memory
settings of the underlying OS may affect other users who run applications on top of the
VM that hosts this OS. Thus, it is important to understand not only what interfaces are
exposed by different components of a software stack, but also what types of users can
control what resources.
Finally, and probably most importantly, public clouds make it difficult for compa-
nies to comply with legal and government regulations, e.g., Sarbanes Oxley and the
Health Insurance Portability and Accountability Act of 1996 (HIIPA). The latter spec-
ifies physical and technical safeguards to ensure the privacy and the confidentiality of
the patient’s electronic protected health information (ePHI). HIIPA specified strict fa-
cility access and control, where all users must be authorized to access computers that
store data, which is very difficult to accomplish with public clouds. Technical safe-
guards ensure that only authorized users can access ePHI with unique user identifiers,
activity auditing, automatic log off and encryption and decryption. Public clouds make
it difficult to balance two goals: multitenancy and strict control of accesses to resources.
Since public and private clouds have benefits and drawbacks and they complement
one another to a certain degree, hybrid clouds has become popular, where a company or
an organization creates and integrates its private cloud with its account at a public cloud
provider. For example, vCloud Connector from vmWare links private and public clouds
in a unified view, so that the users can operate a hybrid cloud without dealing separately
with particular aspects of constituent private and public clouds. A common scenario
is to establish a secure virtual private network (VPN) connection between VMs in the
public clouds, so that data can be shared across these clouds as if the VMs that are
hosted in the public cloud are part of the private cloud. Finally, community clouds
are built and controlled by groups of non-competing companies and organizations that
share goals and resources. Non-competitiveness is important, since participants of the
community cloud should trust one another. For example, an educational community
cloud can provide access to learning resources to educate users who want to learn
about the state-of-the-art technologies that the companies and organizations create and
sell that are the founding members of the community cloud.
Figure 3.4: Java-like pseudocode for a client that uses Google SaaS spreadsheet service.
converted into a working Java program (we leave it as a homework exercise). The
class GoogleSpreadsheetClient contains the method main that creates an instance of
the class SpreadsheetService in line 9. In order to link this service with a specific
account, the user must authenticate herself as it is done in lines 10–11 using the class
OAuthParameters that implements the open-source web authentication protocol called
OAuth. The discussion of how OAuth works is beyond the scope of this chapter, readers
can consult the appropriate standards available on the Internet2 .
Once the client authenticates itself with the Google Docs SaaS, the url object is
created in line 12 where the client provides a valid URL to the location of a spreadsheet.
This object is used as the first parameter in the method call getFeed for the object
service to obtain the object feed of the type SpreadsheetFeed in line 13. The purpose
of this class is to obtain a list of spreadsheets at this URL in line 14. Then in lines
15–26 the for loop is executed where each spreadsheet entry is obtained by the index in
line 16, its title is printed in line 17, and then for the first worksheet of each spreadsheet
its list of cells is obtained in lines 24–25 and then their values are printed in line 26.
Many other Google Docs SaaS applications have a similar API call structure and the
conceptual structure of the clients is similar to this example.
2 Specifications of the OAuth authorization framework that enables third-party applications to obtain ac-
A key point of this example is to illustrate that the users of SaaS do not have any ac-
cess to the underlying hardware or the operating system or other low-level components
of the application platform. SaaS expose interfaces whose methods enable clients to
obtain access to the application objects, create and destroy them and change their states.
However, clients do not have a choice of the platform on which SaaS applications are
deployed, they are often not even given interfaces to obtain the information about the
platform. On the one hand, this is a strength of SaaS, since clients do not have to worry
about the underlying platforms (i.e., the stack); on the other hand, clients are limited in
their ability to control and administer the entire stack.
Question 11: Argue pros and cons of a SaaS application exposing the under-
lying hardware or OS interface.
A key characteristic difference between SaaS, PaaS, and IaaS is the level of control
that clients can exercise over the server stack. As we can see from the code fragment in
Figure 3.4, the client has the access only to application-level objects (e.g., spreadsheets
and their properties and values) using well-defined interfaces that the application ex-
poses, and not to the VM and OS objects and definitely not to hardware components.
In PaaS and IaaS, the client can access and manipulate objects that are defined in the
scopes other than the application-level scope.
NIST introduces two terms that define capabilities of clients in the cloud: control
and visibility. Controlling resources means that a client can determine who can access
and invoke methods of objects that the client owns in the cloud environment, and having
the visibility means that a client can obtain statuses of objects that are controlled by
other clients [97]. In SaaS, clients can have the visibility and control over application-
level objects. A cloud provider controls the hardware, the operating system settings,
and other components of the software stacks installed in the VMs. Of course, when
accessing SaaS via browsers, it is important to ensure cross-browser compatibility, so
that the users of different browsers can have the same experience working with SaaS
applications. Security is always a concern, since using SaaS interfaces may expose
security vulnerabilities of the SaaS implementations.
A limited level of control and visibility carries its own benefits. Since SaaS is in-
stalled once in the cloud, it does not have to be installed on clients’ computers, and there
are no installation and distribution costs. Licensing is improved – clients do not need
to purchase multiple licenses and engage in complicated license verification protocols
when accessing SaaS objects from multiple computers, since the SaaS server handles
license verification automatically when clients access SaaS objects. Finally, clients do
not have the visibility and control of the underlying hardware, operating systems, and
various software packages that are used to host SaaS applications. On the one hand, the
inability to control the underlying software and hardware stack may negatively affect
the ability of the clients to achieve better performance of the SaaS application; on the
other hand the clients do not have to be burdened with the complexity of the underlying
layers of software and hardware, which is exactly a main selling point of SaaS.
Engineering SaaS applications is a critical issue, since many clients share the same
applications. Client objects must be effectively isolated from one another, since each
3.5. CLOUD SERVICE AND DEPLOYMENT MODELS 45
client has sensitive data that is processed by the same SaaS possibly on the same phys-
ical computers. If a SaaS application has bugs that expose data structures created by
one client to other clients, then sensitive data in one application may be inadvertently
exposed to other applications that run on the same physical computer. It is needless
to say that security and privacy problems will take a long time to resolve and users of
SaaS should be aware of potential implications of selecting a specific cloud model.
Key differences between SaaS and PaaS are in the availability of programming
access to the stack services below the applications and the level of control over the
application and the underlying stack. Unlike SaaS, the cloud provider has no control
over applications that customers choose to build, deploy, and ultimately control in the
cloud. On the other hand, in PaaS, just like in SaaS, the cloud customer has no control
over the operating system and the hardware. Therefore, the key transfer of control in
PaaS from the cloud provider to its customers is in the application layer and in the
programming layers. Choosing the PaaS model over SaaS makes sense when cloud
customers want to build their own applications and deploy them in the cloud.
The success of specific PaaS implementations depends upon the availability of a
broad range of programming languages, database management software, various tools
and frameworks for application development. Some of these tools and frameworks
may be optimized for specific operating system/hardware configurations. For example,
a cloud provider may enable interfaces for data caching, where cache storages are al-
located on the same hardware where the application runs. That is, whereas the cloud
customers have no control over hardware and how the cloud provisions it to applica-
tions, using cache interfaces in PaaS will tell the cloud to position the application data
3 https://cloud.google.com/appengine/
46 CHAPTER 3. MODEL AND OPERATIONS
caches in the close proximity to the CPUs which execute the instructions of the corre-
sponding applications. Cloud providers strive to make their offerings of PaaS attractive
to their customers, so that they can develop applications that will deliver services to
their clients more effectively and efficiently than other competing cloud providers.
Different PaaS offerings by many cloud providers create a situation that is called
customer lock-in (also known as vendor-lock in or proprietary lock-in). Since different
cloud providers offer different programming interfaces and software stack services, an
application that is developed using AWS PaaS is not likely to run on vmWare PaaS.
To avoid this situation, customers can build their applications using general program-
ming languages and standard development libraries, however, they will most likely
not be able to take advantages offered by optimized cloud PaaS interface offerings.
In addition, since security and privacy are big concerns, many PaaS interfaces use
platform-specific authentication and protection mechanisms, and if an application uses
these interfaces, it may be locked into the specific PaaS offering from the moment that
the application is conceived. Thus, inter-cloud portability and customer lock-in are big
problems and there is no solution that can address all aspects of these problems.
Figure 3.5: Example of HTTP POST request to create a VM at a cloud hosted by Oracle.
1 "uri" : "/em/cloud/iaas/server/byrequest/1" ,
2 "name" : "VDOSI VM Creation 1345391921407" ,
3 "resource_state" : {
4 "state" : "INITIATED" ,
5 "messages" :
6 [ {
7 "text" : "The Request with ID ’1’ is scheduled",
8 "date" : "2016-02-24T13:12:31"
9 } ]} ,
10 "context_id" : "101" ,
11 "media_type" : "application/oracle.com.cloud.common.VM+json" ,
12 "service_family_type" : "iaas" ,
13 "created" : "2016-02-24T13:12:31"
Figure 3.6: Example of a response to the HTTP POST request to create a VM that is shown in
Figure 3.5.
VM and the second number, 1000, specified the speed of each CPU in MHz. In line
3, the value for memory is 4000 MB and the list of parameters in lines 4–6 is used by
the Oracle IaaS cloud manager to configure the VM including naming its clusters and
specifying root passwords. The latter may be a security breach if the HTTP request
is send as unencrypted cleartext. Using IaaS cloud interfaces, it is possible to create a
program that scales out the application deployment based on the factors that are beyond
the control of the cloud provider.
An example of the HTTP response message is shown in Figure 3.6. In lines 1 and 2
the path to the created VM is given and its allocated name, respectively. The key uri
stands for Uniform Resource Identifier, a sequence of bytes that uniquely identifies a
resource on a network. The reader is already familiar with Universal Resource Locator
(URL) notation that in addition to a URI for a resource on the World-Wide Web (WWW),
specifies how this resource is accessed (e.g., via HTTP when the URL starts with http://
or its Secure Socket Layer (SSL) version https://) and where it is located on the network.
For example, the URL http://www.cs.uic.edu/Lab/Service.html spec-
ifies that the access mechanism is the HTTP protocol that is used to access the server
cs.uic.edu located on the WWW and the resource /Lab/Service.html is ac-
cessed on this server. Finally, a Universal Resource Name (URN) specifies specific
namespaces for accessing a resource (e.g., to access a map data object, a URN may
look like the following: urn:gmap:lattitude:41.8752499:longitude:-87.6613012). Both
URL and URN are subset of the URI.
48 CHAPTER 3. MODEL AND OPERATIONS
In lines 3–9 the state of the created resource is given along with additional infor-
mation, such as the date of the request and the creation data. The context of the created
VM is given in line 10, which is a number that the client will use to send requests to
a specific VM. The protocol that the client should use to communicate with the cloud
manager about the created VM is given in line 11, the service type is IaaS, which is
specified in line 12, and the time of creation of the VM is given in line 13. Using this
information, the client can submit further requests to the cloud manager to control the
lifecycle of the VM.
The IaaS model offers great flexibility in creating, controlling, and maintaining
software applications in the cloud. Of course, given more level of control, it is the re-
sponsibility of the cloud customers to ensure that they develop and test software appli-
cations in a way that will allow them to extract significant benefits for a large number of
available resources. Since cloud providers charge for the usage of hardware resources,
inefficient implementation of IaaS or errors in allocating VMs may lead to significant
charges while the performance of the application may be very poor. In addition, cus-
tomers are responsible for securing their data, which is in itself a daunting task. Cloud
providers, on the other hand, must ensure that they provide fast network and appropri-
ate hardware on demand. For example, one large financial high-speed trading company
purchased an extra-service at a high premium from a cloud provider to ensure reliable
and fast network communication. The cloud provider enabled two separate fail-over
networks for this customer, however, the network service failed during an important
trading session. Further examination revealed that these two separate network from
two separate independent providers used the same underlying physical cable without
knowing about it. This cable was the weak link that led to the disruption of the service.
It is needless to say that it is not enough simply to rely on the verbal assurances, where
every aspect of the underlying hardware should be examined when using IaaS.
sumptions at a large scale. Smart meters are electronic devices that records consump-
tion of electric energy at designated intervals and sends this information to a power
provider for monitoring and billing [38]. In the city of Chicago, over four million
households, companies, and organizations will eventually have smart meters installed
to monitor their power consumptions. Each meter sends hundreds of bytes of power
data to the cloud to process it every 30 minutes or so. It means that smart meter clients
send approximately 50Gb of data every day to the cloud for processing. As the data
arrives, it should be checked for correctness before submitting the data to various ap-
plications for processing. For instance, the power consumption measurement should
not be a negative, a very large or a very small number. If the measurement falls within
the suspicious values range, then it should be excluded from further processing and a
technician should receive a notification to check the smart meter to see if it functions
correctly. In some cases, the diagnostics can be automatically run remotely.
One way is to start a server in many VMs that receive the smart meter data, check its
correctness, and send the data to other servers for processing. Since the data may come
in short bursts followed by periods of lull, the servers may stay idle for some time. Yet,
the energy provider who owns this application will pay for all VMs continuously, even
though they may be idle more than 50% of their time. Starting VM takes anywhere
from seconds to minutes, and once the instantiation starts, cloud providers routinely
charge VM owners for at least one hour. Therefore, it does not make sense to passivate
VMs, and even more, processing a unit of data from a smart meter may take only 100
milliseconds whereas activation of the VM will take much longer resulting in poor
performance of the application.
To address this problem, a FaaS model is proposed rooted in functional program-
ming, where functions are immutable first-class objects that can be passed as param-
eters to other functions and returned as values from them. Suppose that the incoming
smart meter data is defined as a list in Scala val l = List(1,-1,2,-1,1). We
define a function that checks the correctness of the values as the following: def f(v:
Int)=if(v <= 0)SendDiagnosticsMsg(). Then, checking the data is real-
ized in the function map as the following: l map( => f( ), where map takes each
item, , from the list, l and applies the function f to this item. No server is need to do
that. FaaS is implemented as Lambda in Amazon Web Services (AWS) and as Func-
tions in the Microsoft Azure Cloud and in the Google Computing Engine (GCE). The
idea is that the cloud infrastructure can “react” to data items that arrive from smart me-
ters and it will invoke the function map that applies the function f to the data without a
heavyweight VM initialization. Cloud providers usually allow Lambdas and Functions
to run in specialized lightweight VMs that take microseconds to start and customers
are charged in increments of 100 milliseconds of the execution time. It is typical that
cloud providers limit the execution times of functions to five minutes or to less than 10
minutes. Pricing models are somewhat complicated in general, with free first hundreds
of thousands of invocations, and then a few cents per a hundred thousand requests and
one millionth of cent per few Gb of data transfer per second.
Programmers create functions in languages like Java, Scala, F#, JavaScript, Python,
or C# and upload them to the cloud in compressed files where these functions are
registered with specific event triggers. For example, cron jobs are triggered at specific
times and a function may be supplied as a job. Functions may be supplied as a part of a
50 CHAPTER 3. MODEL AND OPERATIONS
workflow, a graph where nodes designate specific computations and edges specify the
directions in which the results of the computations are sent. For example, a financial
workflow at a large company maintains financial information that is received from
various sources and invokes functions that may keep track of work hours or compute
taxes on each purchase. These functions can be created and updated by developers who
then load them up to the cloud to plug into the existing financial workflow.
Of course, invocations of tens of thousand of functions for a a dozen of milliseconds
each creates a debugging nightmare, since adding logging statements to functions will
add significant overhead that negates the purpose of FaaS. To address this issue, AWS
introduced the notion of step functions6 , each of which performs a specific limited
service in a workflow. The AWS documentation states: “Step Functions automatically
triggers and tracks each step, and retries when there are errors, so your application
executes in order and as expected. Step Functions logs the state of each step, so when
things do go wrong, you can diagnose and debug problems quickly.” More information
can be found on AWS7 .
Finally, not only is FaaS used to create applications from functions, but it is also
used within cloud infrastructures Amazon cloud services to trigger events for appli-
cations, to test whether resource configurations comply with some pre-defined rules,
and to respond to certain system events, e.g., AWS CodeCommit and CloudWatch
Logs. That is, with FaaS, cloud computing service providers eat their own food, so
to speak, by using the functional services provided by their platforms to improve the
performance and the functionality of their cloud computing platforms further.
3.6 Summary
In this chapter, we reviewed different cloud models and their operational environments.
We explain the basic model of cloud computing where two fundamental properties (i.e,
pay-as-you-go and resource provisioning on demand) separate cloud computing from
other types of distributed computing. We described issues with utilization of resources
in the cloud and showed how different cloud providers address the issues of over- and
underprivisioning resources for applications. After introducing the notion of a software
stack, we explained three cloud deployment models, SaaS, PaaS, IaaS, and FaaS in
terms of accessing and controlling software layers. In addition, we gave examples of
client programming to access services offered by the cloud software stacks.
6 https://aws.amazon.com/step-functions
7 https://aws.amazon.com/serverless
Chapter 4
Deploying software applications in VMs in the cloud means that the users of these
applications access them via the Internet to perform computations by supplying some
input data to procedures that are located in these VMs and obtaining the results of these
computations. In general, a procedure or a function is a basic abstraction by parameter-
ization that is supported by a majority of programming languages. Instead of copying
and pasting code fragments and assigning values to their input variables, the procedural
abstraction allows programmers to specify a code block once in a named procedure, pa-
rameterize it by designating input variables, and invoke this code by writing the name
of a procedure with values that correspond to the input parameters. Despite seeming
simplicity of invoking a procedure, it is a complicated process that involves the operat-
ing system interrupting the execution flow of the application, locating the code for the
invoked procedure, creating a special structure called a frame to store various values
and instructions, executing the code of the procedure, and once it is finished, removing
the frame from the memory and returning to the next instruction in the execution flow.
In this section, we will analyze how to change the process of local procedure invocation
to make it work in the distributed setting, so that clients can call procedures that are
located in VMs in the cloud from their local client programs.
51
52 CHAPTER 4. RPC
physical dials and switches. It did not allow first programmers to create procedures and
call them from programs.
Economically, it did not make sense to build computers where instructions could
be changed only by physical rewiring. In 1944, ENIAC inventors, John Mauchly and
J. Presper Eckert proposed the design for the next generation computer called Elec-
tronic Discrete Variable Automatic Computer (EDVAC) and they were joined by John
von Neumann who formulated the logical design of the stored-program computer that
is later became known as the von Neumann architecture. The key idea of the architec-
ture was to separate the CPU and memory – programs would be written separately as
sequences of instructions, which will be loaded and executed by the CPU and the data
will be stored in the computer memory and manipulated by the instructions. Doing
so represented a radical departure from hard-wired computers, since many different
programs could be executed on general-purpose computers.
With the separation of the CPU and the memory, the latter is divided into five seg-
ments: the code segment that contains program instruction and the program instruction
register points to the next instruction that is to be executed by the CPU, the data seg-
ments where initialized and static values of the program variables are stored, the Block
Started by Symbol (BSS) segment that stores uninitialized program variables, and most
importantly, the heap and the stack segments. The heap stores program variables that
can grow and shrink in size, whereas the stack is the temporary storage for the pro-
gram execution context during a procedure call. The stack is controlled by a special
operating system subroutine called the calling sequence or the activation record,
An illustration of the local procedure call is shown in Figure 4.1. We base our
description on the chapter of the Intel documentation that described procedure calls,
4.1. LOCAL PROCEDURE CALLS 53
interrupts, and exceptions. Like many other computer architectures, Intel uses the pro-
cedure stack to reflect the Last In First Out (LIFO) order of procedure calls, where the
last called procedure will finish and return the control to the calling procedure. In the
upper left corner, the source pseudocode is shown where the procedure function
is called from the procedure main. Underneath it, in the lower left corner the stack
is shown with the calling procedure main on the bottom and the called procedure
function on top of the stack. In Intel 64-bit architecture, the stack can be located
anywhere in the program memory space and it can be as large as a single segment,
which is 4GB.
Recall that a stack is a LIFO data structure with two operation, push and pop.
Once a call is made to the procedure main, it is pushed on the stack, the processor
decrements the stack pointer (SP), since the stack grows from the higher to the lower
addresses. When the procedure function is called, the operating system allocates a
new frame on the stack for this procedure on top of the frame for the procedure main.
Only the top frame of the stack, i.e., the procedure function in our example, is active
at any given time. When the last instruction of the top frame procedure is executed,
its frame is popped and the processor returns to the execution of the next instruction
in the previous stack frame procedure that becomes the top stack frame. Thus, the
configuration of the stack is dictated by the control flow of the program instructions
that call procedures.
To ensure that the processor returns to the last instruction before calling a proce-
dure, and to provide the context for a called procedure, a frame should keep certain
information in addition to the general bookkeeping information for the stack. To link
the calling and the called procedures, the processor keeps track of SP and return in-
struction pointer. The processor stored the SP value in its base pointer register, so that
the processor can reference data and instructions using the offset from the SP value in
the current memory segment, making address computations very fast. To obtain return
instruction pointer, the call instruction to a procedure pushes the current instruction ad-
dress in the Instruction Pointer register onto the current stack frame. This is the return
instruction address to which the control should be passed after the called procedure is
executed. If the return instruction pointer is damaged or lost, the processor will not be
able to locate the next instruction that should be executed after the called procedure is
finished. It is interesting, though, that the Intel documentation states the following fact:
“the return instruction pointer can be manipulated in software to point to any address in
the current code segment (near return) or another code segment (far return). Perform-
ing such an operation, however, should be undertaken very cautiously, using only well
defined code entry points.”
Question 2: what is the relationship between the stack overflow problem and
storing the return instruction pointer?
54 CHAPTER 4. RPC
The expansion of the stack frame for the procedure function is shown on the
right side of Figure 4.1. The layout of the stack frame differs among different computer
architectures, so we concentrate on main principles in our description. Parameters to
procedures can be passed through registers, as a pointer to the argument list, which is
placed in a data segment of the memory, or by placing these parameters on the stack.
The called procedure can return results exactly the same ways. In our example, the
arguments (i.e., the variables a and b) are placed on the stack followed by the returned
address, the BP register value, and the local variable (i.e., the array of five integers,
array). The calling context, i.e., the state of the calling procedure and the content
of values in hardware stores (i.e., registers and some memory locations) should be
saved and restored after the called procedure finishes its execution. To complicate
things further, procedures can be located at different privilege levels, i.e., in the Intel
architecture, level 3 for applications, levels 2 and 1 for the operating system services,
and level 0 for the kernel. Handling those procedure calls involves checking the access
level by the processor and performing somewhat expensive operations for creating a
new stack and copying values from the previous stack to the new one, in addition to
complicated manipulations of values across multiple registers and memory locations.
Moreover, interrupts and exceptions can happen when executing a procedure, where
an interrupt is an asynchronous event that is triggered by an Input/Output (I/O) device
and an exception is a synchronous event that is generated by the processor when some
conditions occur during the execution of an instruction (e.g., division by zero). By
saying that an event (e.g., creation of a data structure) is asynchronous, we mean that
this event is generated at a different rate or not together with the instructions that the
processor is executing at some moment and the process that generated this event does
not block. However, a synchronous event occurs at some predefined time with some
other event (e.g., a phone call is answered synchronously after the phone rings at least
once). When an interrupt or an exception occurs, the processor halts the execution of
the current procedure and invokes a special interrupt/exception handler procedure that
is located at a predefined address in the computer memory. Depending on the type of
the event, the processor will save the content of hardware registers and some memory
fragments, obtains a privilege level for the handler, and saves the return address of the
current procedure before executing the handler. Once finished, the processor returns
to the previously executing procedure on the stack. Those who are interested in more
detailed information, can find it in the Intel 64 Architectures Software Developer’s
Manual 1 or in the documents for the corresponding architectures.
So far, we have used the terms synchronous, asynchronous and (non)blocking to
describe function calls. Let us refine the distinction from the perspective of executing
I/O operations, e.g., reading from or writing into a file or a network connection. A
synchronous I/O function call to retrieve data blocks until some amount of the data
is retrieved (or whatever data is waiting in some buffer for a nonblocking call) or an
exception is returned informing the caller that the data is not available. This function
call is also blocking since the client waits some time for the completion of the function
call before proceeding to the next operation. Conversely, an asynchronous call is a
non-blocking function call returns immediately. For example, a reference to the called
1 http://www.intel.com/Assets/ja_JP/PDF/manual/253665.pdf
4.2. CALLING REMOTE PROCEDURES 55
function can be put in an internal queue and some component will eventually retrieve
the information about the called function and invoke it. Regardless of the subsequent
actions, the client proceeds to execute next operations and commands without waiting
for any results from the called function. This function call is asynchronous, since the
execution of this function does not affect the execution of the immediate subsequent
operations by the client. Once the function call is executed, its results will be sent to
the caller via some callback function that the client provides when making the asyn-
chronous call.
A confusion often arises because a synchronous function call is assumed to be
blocking, however, it is not the case. A synchronous I/O function call to read data
from some file may return with a subset of the available data, hence it will not block to
obtain the entire set of the available data. As a result, the client must check the status
of the I/O handle or the return value of the I/O function call and if more data is still
available, the client will keep calling this function in a loop. Thus, such function call is
both synchronous, since the client waits to get some result back and nonblocking, since
the caller is not blocked for the entire duration to get all results back. By definition,
asynchronous function calls are always nonblocking,
Figure 4.2: A model of the client/server interactions when calling remote procedures.
To overcome this problem, messages are passed between the process where the call
is made to a remote procedure and the process that hosts the called remote procedure. A
message is a sequence of bits in a predefined format that contains information about the
caller and the callee as well as the values of the input parameters and the results of the
computation. Consider a client/server model of remote procedure calls that is shown in
Figure 4.2. The client is the process in which a remote procedure call is made and the
server is the process where the actual procedure is hosted. Communications between
the client and the server is accomplished using messages, which are sequences of bits
formed according to some protocol (e.g., TCP/IP). The message from the client to the
server is called a request, and the message that the server sends to the client is called a
reply. These messages are shown with block arrows in Figure 4.2 and the explanations
of the functions of the client and the server are given below their corresponding boxes.
The server creates a listener to receive requests from its clients and it waits for these
requests in the WHILE processing loop. The client process connects to the server and
it runs independently from the server process until a remote procedure call is made.
At this point, the call is translated into a request for a service that the client sends to
the server. The service is to invoke a procedure with the specified values of the input
parameters. Once this request is sent, the client waits for a reply from the server that
contains the result of the computation or the status of the finished execution if the
function does not return any values. This is the simplest semantics of the client/server
model of the remote procedure call.
Question 4: How to handle exceptions that may be thrown during the execu-
tion of the remote procedure at the server side?
We can view exceptions as objects that are created by the execution runtime in
4.2. CALLING REMOTE PROCEDURES 57
response to instructions that result in some incorrect state of the application or the
runtime environment itself (e.g., a disk error that causes the filesystem to issue an ex-
ception). Once an exception occurs, the control is passed to exception handlers, which
are procedures that the runtime invokes automatically in response to certain exceptions.
Suppose that an exception handler is defined in the client and the exception is thrown
in the server. It is unreasonable to expect programmers to write complicated code that
detects the creation of exception objects in the server program and transfers the infor-
mation about them to the client program. Clearly, this process should be automated
and programmers should reason about the program as a whole.
Question 5: Since this RPC model is based on message passing, one imme-
diate question is how it is different from using some interprocess communication
(IPC) mechanism to make a remote procedure call?
In general, significant manual effort is required to create programs that use IPC to
send data between components that are run in separate process spaces. Not only must
programmers define the message format, write the code that sends messages and re-
ceives them and locates and executes the remote procedures, but also they must handle
various error conditions, e.g., resending messages that are lost due to network out-
age. In addition, an IPC mechanism on a Linux platform may have different interfaces
and some differences in its semantics from the Windows, and it will result in multiple
implementations of the IPC-based RPC for different platforms. As an exercise, the
reader can implement RPC using sockets or named pipes for a Linux and the Windows
platforms to understand the degree of manual effort, which we call low-level imple-
mentation. Opposite to low-level is a high-level coding where the programmers reason
in terms of procedure calls rather than establishing physical connections between the
client and the server and sending messages using these connections.
A software tool is language dependent if its use differs from a language to a lan-
guage. Constructing a message according to some format is language-dependent, since
the code that (un)marshalls bits in a message uses grammatical constructs of the lan-
guage in which the code is written. Written in one language, the code fragment cannot
be easily integrated into a program written in some other language without changing
the code. Opposite to it, a language-independent implementation of the RPC does not
require the programmer to write RPC-specific code in some language. We will discuss
how to achieve language independence for RPC in more detail in Section 4.5.
Question 7: What information should be put in clients and servers, and what
is the exact content of the messages that they exchange?
Clearly, when the client requests a service, it should specify the address of the
server that responds to requests from its clients. The server address could be a unique
58 CHAPTER 4. RPC
Figure 4.3: The RPC process. Circles with numbers in them designate the order of the steps of
the process with arrows showing the direction of the requests.
the request latency. Suppose that a different user issued a read request to the same
DSM memory location after the previous write request and this read request is mapped
to some other computing node, Np where k 6= p. Since the write request latency is
non-zero, the memory of the node N p will not match the memory of the node Nk for
some time. It means that the memory will not be coherent and the user’s read will not
return the value that was written to the same memory location prior to the read. DSM
algorithms attempt various trade-offs between the level of memory consistency and the
performance. As we shall see in later chapters of this book, the issues of latency and
data consistency are key to engineering cloud applications.
to a call from the RPC runtime library, which contains various functions for data con-
version, exception handling, and remote connectivity using various IPC mechanisms.
In some cases, the RPC runtime library can batch multiple remote procedure calls into
batched futures to sent multiple procedure calls in one batch to reduce the communi-
cation overhead between clients and servers [27]. The transport layer
3 receives the
message and transmits it to its counterpart on the server side
4 that ensures that the
message is received and assembled correctly if it was broken into packets on the client
side. Once the transport layer on the server side receives the message,
5 it passes it
as the parameter to a call from the runtime library, where the message is decoded, the
destination remote procedure is identified, and
6 the server stub for this procedure is
invoked that contains the implementation of the invocation of the actual procedure and
how to pass arguments to it. Thus,
7 the server stub invokes the actual procedure that
contains the code that a programmer wrote to implement some functionality. Doing so
completes the client to server call of the RPC process.
Once the procedure finishes, the results of this execution should be passed back
to the client. Recall that the client can wait until the remote procedure finishes its
execution to proceed with its own (i.e., the synchronous RPC) or it can proceed with
its execution and the results of the execution of the remote procedure will be delivered
independently to the client’s address space and written into the designated memory
locations (i.e., asynchronous RPC). In both cases, the RPC process is the same. Once
the procedure terminates,
8 its return value is passed to the server stub that marshals
the values
9 to pass to the RPC runtime library. Even if the remote procedure does not
return any results, the return is considered an empty value of the type void and it is
passed to the client. The runtime library 10
passes the return to the transport layer that
passes it to the client side 12
in turn 11
→ 13
→14
back to the location in the source
code where the remote procedure call was invoked and it is treated as if it came from a
local call. This completes the RPC process.
Question 8: Argue pros and cons of adding definitions to the methods of in-
terfaces in IDL.
4.4. INTERFACE DEFINITION LANGUAGE (IDL) 61
1 [uuid(906B0CE0-C70B-1067-B51C-00DD010662DA), version(1.0)]
2 interface Authenticate {
3 int AuthenticateUser([in, string] char *Name);
4 void Logout(int);}
Figure 4.4: IDL code example of the authentication interface and its remote procedures.
A conceptual outline of the solution that addresses these question is to abstract the
specification of the interactions between the client and the server in the client/server
RPC model via the concept of the interface. Instead of revealing all details about
the implementation of the remote procedure, the client has information only about the
interfaces that the server exposes, and these interfaces include the names of the remote
procedures, their signatures that include types and the order of their arguments and the
types of the return values. Using interface definitions, developers can create clients
independently of the remote procedures and in parallel to their creation.
This concept was realized in the Interface Definition Language (IDL)2 , whose main
idea is to create a language- and platform-independent way to creating RPC-based soft-
ware while enabling programmers to use type-checking facilities of the target language
compiler to catch errors at compile time. A key idea behind the IDL is to introduce
a standard for creating interface specifications and then to use created specification to
generate the implementation of the stubs for desired platform and a target language,
thus burying the complexity of writing custom code in the tool that generates this code
and allowing programmers to concentrate only on the main logic of the application they
develop and avoid reasoning about the low-level code that realizes the RPC protocol.
Consider the RPC development process that is shown in Figure 4.5 whereas an
example of IDL is shown in Figure 4.4. The annotation in the square brackets specifies
the unique identifier (i.e., uuid stands for universally unique identifier) of the interface
and its version. The identifier can be viewed as a reference to the RPC server. The
uniqueness of the identifier is guaranteed by the algorithm that it generates by using
random seeds, information from the computer hardware on which this algorithm runs,
time, and other values that the algorithm combines into an identifier. Using the same
algorithm, it is highly unlikely that the same identifier will be generated unless the
algorithm runs nonstop for a couple of billion years. Assuming that the combination of
the uuid and the version uniquely defines an interface, locating it becomes a technical
matter for an endpoint mapper.
The rest of the code in the IDL example define the signatures of two procedures:
AuthenticateUser and Logout. An important point is that there is no imple-
mentation of a procedure is allowed in IDL, only its definition. In fact, it is up to the
application programmer to define and implement algorithms that realize the semantics
of the defined methods. The IDL file marks the beginning of the RPC development
process, where it is
1 processed by the IDL compiler as shown in Figure 4.5, and it
2
outputs the header file, .h. We can add two more inputs to the IDL compiler: the tar-
get language and the destination platform, which we tacitly assumed to be C and Unix.
The output header file can be viewed as a conversion of the IDL interface definition
into the target language and it is used in the subsequent steps of the development
2 http://www.omg.org/gettingstarted/omg_idl.htm
62 CHAPTER 4. RPC
The IDL compiler also
3 generates the client and the server stubs that are shown
in Figure 4.5 with the c.c and s.c extensions of the IDL file name. Recall that
stubs contain calls to the underlying RPC runtime library as well as the implementation
of argument (un)marshalling and exception handling, among other things. Here the
benefit of the IDL concept becomes clearer - it allows programmers to define interfaces
in a platform- and language-independent way and then the IDL compiler will generate
a low-level implementation of the RPC support, so that the programmers do not need to
worry about implementing this support manually. That is, the IDL abstraction allows
programmers to concentrate on high-level definitions of interfaces between the client
and the server and invoke the remote procedures the same way as if they are local.
At this point, programmers created ClientImp.c and ServerImp.c files that,
as whose names suggest contain the implementation of the client and the server pro-
grams. Some programmers may independenly implement the body of the remote pro-
cedures in the server modules while other programmers implement the client programs
with invocations of these remote procedures. The generated header file is
4 included
in the client and the server programs. Doing so enforces the type checking procedure
that verifies that remote procedure calls adhere to the same signatures defined in the
IDL specification. With this type checking, the language compiler can detect errors in
the incorrect usage of the remote procedures statically.
Once the client and the server programs are written, the target language compiler
and linker
5 generate the executable code that is
6 linked with the stub implemen-
tations and the runtime RPC libraries that are provided by the given platform. The
client and server programs
7 are generated and can be deployed independenly thus
completing the RPC development process.
Figure 4.5: The RPC development process with IDL. Circles with numbers in them designate
the order of the steps of the process with arrows showing the direction of the requests. Dashed
arrows show the interactions between the client and the server.
and modularized in scope that is assigned a unique name, A function call is essentially
an invocation of the code fragment that the function name designates with parameter
values replacing the formal parameter names in the code (i.e., we do not distinguish
here passing by values vs by reference vs by name at this point). Once the function
finishes its execution, it is replaced by the return value that it computes. This simple
semantics is preserved in RPC making it easy to reason about the source code.
Next, over many years, RPC implementations got very efficient. In a way, one
can think of an IDL compiler choosing right code transformation techniques and al-
gorithms to generate the client and the server stubs that carry the brunt of low-level
communications that realize RPC calls. There are many ways to optimize the RPC:
using different low-level IPC primitives to suit the locality of the client and the server
(e.g., shared memory can be used to pass messages if both the client and the server
are located on the same computer), batching procedure calls, or lazily passing param-
eter values by waiting until the remote procedure actually needs these values during
its execution. Moreover, in the cloud datacenters with hundreds of thousands of com-
modity computers using highly efficient RPC is a major demand. For example, a joint
work between Microsoft and Cornell on RPC chain introduces a primitive operation
that chains multiple RPC invocations in a way that the results of computation are trans-
ferred automatically between RPC servers without the need for clients to make explicit
remote procedure calls [145]. The implementation uses the concept of chaining func-
tions where the results of computing of the inner chained function is used as an input
parameter to the next outer function that is chained to the inner function. Using RPC
chain enables programmers to create highly efficient workflows within cloud datacen-
ters. In another work, company called CoreOS, Inc. released a product called etcd, a
64 CHAPTER 4. RPC
distributed key value data store that works with clusters of computers 3 . Its documen-
tation states: “The base server interface uses gRPC instead of JSON for increased effi-
ciency.” That is, one of many RPC implementations such as Google RPC (i.e., gRPC)
is used as core components of commercial and open-source cloud solutions nowadays.
4.6 Summary
In this chapter, we introduce a very important concept of remote procedure calls that
is cornerstone of creating distributed objects in cloud computing specifically and in
distributed computing in general. We start off by explaining how local procedure
calls work and show that the model of the local procedure call does not work in the
distributed environment. Next, we explain the basic client/server model of RPC and
introduce requirements to make it powerful and easy to use. We show how RPC pro-
cess works and we ask questions how to make it effective and fitting into the general
software development process. We described issues with using low-level interprocesss
communication mechanisms to implement RPC and we showed how adhering to the
basic function invocation without using low-level calls makes source code simpler and
easier to maintain and evolve. Then, we introduce the notion of Interface Definition
Language (IDL), explain it using a simple example, and then describe the RPC devel-
opment process with IDL. We conclude by summarizing main points what made RPC
a pervasive and successful concept and implementation.
3 https://coreos.com/etcd
Chapter 5
The author of this book worked at a large furniture factory in the middle of 1980s in
the Soviet Union where he wrote software in PL/1 that computed reports for execu-
tives about raw timber delivery and furniture production at the factory. The data were
collected manually during the day from various devices and people who operated them
as well as warehouse agents and logistics personnel. The amount of daily data ranged
anywhere approximately between 50Kb to 200Kb, very small by today’s standards,
and these data were often manually reviewed, mistakes were identified and the data
was adjusted accordingly, and the batch report algorithms ran overnight to produce the
report summary by 9AM the next morning for executives to review these reports. The
author’s first job in the USA in the beginning of 1990s was to write software to collect
data from infrared sensors installed on various items in a plant to track these items as
they were moved around the plant. The amount of data was larger, close to 5Mb a day
from a large plant and processing was done in real-time. Yet the amount of data was
still very small compared to the avalanche of data that came after the Internet took off
in the second half of 1990s.
Between 1990s and 2000s a few economic and technical factors changed the nature
of computing. The Internet enabled production and distribution of large amounts of
data. Different types of sensors, cameras, and radio-frequency identifiers (RFID) pro-
duced gigabytes of data first weekly, they daily, then hourly. Average non-computer-
science type users created content on the Internet as web pages and social network
nodes. News media started posting articles online. The data sizes were measured first
in gigabytes, then terabytes and then petabytes and exabytes. This dramatic increase in
the available data came together with advances in computer architecture that decreased
the cost of storages to cents per gigabyte and the cost of a commodity computer to few
hundred dollars from thousands in the beginning of 1990s. A new term was coined,
big data that collectively describes very large and complex datasets that are difficult to
process using a few servers even with many multicore CPUs1 .
1 https://en.wikipedia.org/wiki/Big_data
65
66 CHAPTER 5. MAP/REDUCE MODEL
than nine hours. This example conveys the essence of the idea of splitting a technically
impossible to accomplish computing task into manageable subtasks and executing them
in parallel thus dramatically decreasing the overall computing time.
Question 1: How dependencies among arbitrary data items in a set may com-
plicate the division of a task that uses this data into subtasks?
Consider a different simple and very common task of searching text files for some
words. More generally, this task in known as grepping, named after a Unix command,
grep – Global Regular Expression Print – that programmers frequently use to search
text files for the occurrence of a sequence of characters that matches a specified pattern.
For example, the following command “grep ’\<f..k\>’*.txt” executed with-
out the external double quotes prints a list of all four-character words found in your
files with the extension txt starting with “f” and ending in “k” like fork, fink, and
fisk. For an average Unix user, this operation will terminate in less than a minute
searching through a few hundreds of text files, depending on the complexity of the
pattern and the number of matches and the performance of hardware. However, exe-
cuting an operation like this on $4.6 Billion web pages may take thousands of years to
complete on a single computer.
Nowadays, many machine learning algorithms operate on big data to learn patterns
that can be used for automating various tasks. These algorithms are very computation-
ally expensive, they take significant amounts of RAM and many iterations to complete.
For example, a movie recommendation engine at Netflix analyzes rankings of more
than 10,000 movies and TV shows provided by close to 100 Million customers and
computes recommendations to these customers using its proprietary machine learning
algorithms. As with the previous examples, the problem is to accomplish this task
within reasonable amount of time and under fixed cost. Even though these and many
other tasks like that are different, the common theme is that they operate on big data
that contain many independent data objects. Computing results using disjoint subsets
of these data objects can be done in parallel, since these data objects are independent
from one another, that is, operating on one data object does not require obtaining in-
formation from some other data object. A clue to a solution lies in the idea of dividing
and conquering the problem by splitting it into smaller manageable computing tasks
and then merging the results of these tasks, which are computed on cheap and easily
replaceable commodity computers.
cooling systems, how to supply power uninterruptedly, how to find and fix hardware
errors (this one is simple - the whole computer is replaced with a new one), and how
to provide fault tolerant Internet connectivity to these computers? An answer to these
questions lies in a new organizational entity called a datacenter.
Elastic parallelism. Big data sets are assumed to be independent from one another
and the programs that process then have no control or data dependencies that re-
quire explicit use of synchronization mechanisms (e.g., semaphors or distributed
locks). In this setting, many programs can operate on different datasets in par-
allel, and adding new programs and datasets can be handled by assigning pro-
cessing units in the datacenter on demand to run these programs to process the
datasets. Whereas programmers view the datacenter as a single distributed ob-
ject that runs a single task (e.g., grep or PageRank), in reality this task is split
automatically into many small subtasks that are assigned to many processing
units for parallel execution. Then the results are assembled and returned to the
programmer, to whom the parallel processing is hidden behind the facade of a
single object with well-defined interfaces in the RPC style.
Workload optimization. Whereas a user of a standalone desktop has the full control
over the resources, assigning them to different tasks optimally remains a serious
problem. Even in case of a single desktop, the operating system does not always
allocate resources optimally to executing processes (e.g., thread affinity prob-
lem). In case of datacenter, load balancing, resource response time, and resource
locality is used to optimize the allocation of resources to different workloads.
Homogeneous platform. The virtual object abstraction presented to users hides the
possible differences in hardware and software platforms that may co-exist in a
datacenter. Unlike writing a program for a desktop that runs a specific operating
system and that has limitations on RAM and CPUs and disk space, in datacenter
5.3. MAP AND REDUCE OPERATION PRIMITIVES 69
all differences are abstracted away and all users are given a single view of the dis-
tributed object with well-defined interfaces. All communications with this object
are performed via the RPC that enables users/clients invoke remote methods of
the object’s interfaces. Within the actual datacenter, all computing units can also
be the same with the same software installed on them adding to the homogeneity
of the computing platform in the datacenter.
Fault tolerance. When hardware fails in a desktop, its user runs the diagnostics, de-
termines what component fails, purchases a new component of the type that is
compatible with the failed component and replaces it. Desktop failures are in-
frequent, whereas in a datacenter with 100,000 commodity computers, failures
happen every day. Given a low price per commodity computer, many datacen-
ters simply replace one when its hardware fails. In his presentation in 2008, a
principal Google engineer said: “In each cluster first year, its typical that 1,000
individual machine failures will occur; thousands of hard drive failures will oc-
cur; one power distribution unit will fail, bringing down 500 to 1,000 machines
for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to
vanish from the network; 5 racks will go wonky, with half their network pack-
ets missing in action; and the cluster will have to be rewired once, affecting 5
percent of the machines at any given moment over a 2-day span. And there is
about a 50 percent chance that the cluster will overheat, taking down most of the
servers in less than 5 minutes and taking 1 to 2 days to recover.”2
Since hardware failures are the everyday fact in datacenters, it is imperative that
these failures be handled without requiring programmers to take specific actions
in response to these failures, except some extreme cases where the majority of
computers in a datacenter are affected and they cannot run programs. If a hard
drive fails on a computer that contains a dataset while a process is operating on
this dataset, the fault-tolerant infrastructure should detect the drive failure, notify
a technician to replace the computer and move the dataset and the program to
some other available computer automatically to re-run the computation. It is
one example of how failures can be handled in a datacenter while presenting a
fail-free mode of operation to the clients of the abstract distributed object.
failure-rates-in-google-data-centers/
70 CHAPTER 5. MAP/REDUCE MODEL
Once the sum is computed for each block, the values of these sums will be added to
obtain the resulting value.
Let us consider a different example that looks more complex – grepping hundreds
of millions of text files to find words in them that match a specified pattern. This op-
eration is more complex than adding integers, but we will apply the same parallelizing
solution. First, we partition the entire file set into shards, each of which contains a sub-
set of files, and we assign each shard to a separate computer in a datacenter. Then we
search for the pattern in each subset of files in parallel and output a map where the key
specifies a word that matches a pattern and the values contain linked lists of pointers
to specific locations in files that contain these words. Thus, a result of each parallel
grepping will be this map. Next, we need to merge these maps, so that we can compute
the single result. If a word is contained in more than one map, the resulting map will
contain an entry for this word with the values concatenated from the corresponding
entries in different parallelly computed maps.
Even though these two examples are completely different, there is a common pat-
tern buried in the solutions. First, the data from shards are mapped to some values
that are computed from this data. As a result, multiple maps are created from shards.
Second, these maps are merged or reduced to the final result and this reduction is done
by eliminating multiple entries in these disjoint maps by combining them in single en-
tries with multiple values that can be further reduced. One interesting insight is that
arithmetic operations are reduction or folding operations. Summing all integers in a
list reduces it to a single value. Same explanation goes for multiplying. Merging many
maps whose key sets intersect in a nonempty set reduces these maps in a single one by
removing redundant repetitions of the keys in separate maps. This concept is illustrated
in Figure 5.1 where the flow of key-value pairs come from the left, first into the map
primitives and then into the reduce primitive.
Let us give formal definitions of the map and reduce primitive operations. A
mapper is a primitive operator that accepts a key-value pair (key, value) and produces a
multiset of key-value pairs, {(k1 , v1 ),. . .,(k p , v p )}. A reducer is a primitive operator that
accepts a pair (k, {value}), where a key is mapped to a multiset of some values, and it
produces a map of key-set of values pair, with the same key (k,{v1 , . . . , vs }) and a set of
values that may be different from the ones in the input map. Applying these definition
to the summation example above, the mapper takes pairs of integers, one as key and
the other as a value and outputs a key-value pair where the key is the concatenation
of these two integers with a comma in between as a string and the value is the sum of
the integers. The reducer takes the the pairs outputted by the mapper and adds up their
values to the total that is returned as a result of this map/reduce operation.
Question 4: Design the mapper and the reducer for the grepping problem.
5.4. MAP/REDUCE ARCHITECTURE AND PROCESS 71
Question 5: Design the mapper and the reducer for computing PageRank.
The input to the map/reduce architecture comprises the mapper and the reducer pro-
grams that are written by programmers and the input dataset on which these programs
operate. The programmers’ job is intellectually intensive and it involves understanding
the problem, mapping it onto the mapper and reducer primitives, and implementing the
solution. Once the mapper and the reducer implementations are ready, the program-
mers submit them to a computing node that is called the master, and it distributes the
72 CHAPTER 5. MAP/REDUCE MODEL
Figure 5.2: The architecture and the workflow of the map/reduce model.
mapper and the reducer programs to worker nodes that belong to a set of designated
commodity computers in the datacenter. The first step of the map/reduce process is
completed.
cially considering that shards may be of different sizes and there is high variability in
processing time between different mappers. This way, reducers do not stay idle wait-
ing for all mappers to finish and the load is distributed more evenly across mappers and
reducers. It would also mean more than one mapper and reducer must be used to take
the advantage of pipelining the workflow.
Question 7: What can be said about the ratio of the total number of mappers
and reducers to the number of the commodity computers that can be used to host
them? What is the complexity of map/reduce?
detects that the master process does not run any more and it will restart it. The restarted
master program checks to see if there is a checkpoint file, reads the last saved state, and
resume the map/reduce computation from that checkpoint.
Alternatively, if a worker fails, it does not respond to a heartbeat, which is a control
message sent periodically from the master to all workers. If the failed worker was
running the mapper when it failed, the master will relocate this mapper task with its
shards to some other worker. This operation is transparent, since it is hidden from users,
who continue to see the computation running and eventually producing the desired
result as if there were no failures. However, if the worker failed after its mapper had
completed the operation, then the resulting intermediate files can be copied to a new
worker. If these files are not accessible because of the disk or network failre, then
the mapper should be re-executed on a new worker. Recovering from the failure that
affected a reducer is even simpler, since a different reducer will pick up the input for
the failed reducer and will produce the output.
Question 8: What features of the standard Unix file system are redudant for
the map/reduce computations?
Big data computations heavily influenced a new set of requirements on the filesys-
tem. Files are big and frequently changed by appending data – think of appending
new key-value pairs to the input file for the reducer. Then, the pattern of file usage is
different from the one of human users – after the files are modified, they are mostly
read, rather than written to. This pattern is known as Write Once Read Many (WORM).
With the WORM pattern, files are rarely modified, if at all. Also, concurrent usage is
different – when human users share a file, one of them locks the file and the others wait
until the lock is released. However, when distributed objects make the RPC to append
data to a file, there is no reason to lock this file, since the RPC can be asynchronous
and the data will be appended to a file in the order the requests come. Finally, the
filesystem must ensure high bandwidth rather than small latency, i.e., higher bits per
second transfer rates rather than smaller delays in responding to requests.
5.6. GOOGLE FILE SYSTEM 75
A standard example to describe the difference between the latency and the band-
width involves cars of different speeds and capacities. A sedan with the capacity = 5
and the speed = 60 miles/hour travels faster but carries fewer passengers than a bus
with the capacity = 60 and the speed = 20 miles/hour. Latencies for traveling 10 miles
are the following: car = 10 min and bus = 30 min with the throughput: car = 15 people
per hour (PPH), bus = 60 PPH. Thus, latency-wise, car is three times faster than bus,
but throughput-wise, bus is four times faster than car.
The authors of the GFS paper say: “Our goals are to fully utilize each machines
network bandwidth, avoid network bottlenecks and high-latency links, and minimize
the latency to push through all the data.” Therefore, when making a function call, some
time may be spent to determine which network links to use to send the data. Doing so
increases the latency. Once the network path is determined, the data is sent.
The latter point is important, since bandwidth and latency are connected. If the rate
of sending bits is close or higher than the bandwidth, then the network is saturated,
the data will be queued and the latency will increase. However, if the bandwidth is
already high and the network is not saturated, increasing the bandwidth by adding
more network capabilities will not decrease the latency. For example, satellite latency
is about 20 times higher than the earth ground-bound networks at over 600 ms due to
the distance between Earth and the satellite in space. In many cloud computing tasks, it
turns out that the data is received and processed in large batches instead of small sizes
and frequent RPC requests. This is why bandwidth takes precedence over the latency.
Question 9: Discuss pros and cons of the solution where small data blocks are
batched into large ones before they are submitted to servers for processing.
Google File System (GFS) is designed to satisfy these constraints and it uses the
client/server model that we discussed in Section 4.2. At the logical level, a single
filesystem interface is presented as the tree with the root and branches as directory sub-
trees that contain other directories and files, same as in most other ordinary filesystems.
On the physical level in GFS, files are divided into fixed-size units called chunks and
they are handled by chunkservers. That is, a file is not mapped to a single server, its
data chunks are spread across many servers. The single master server assigns each
chunk immutable and Globally Unique IDentifier (GUID) and it maintains all filesys-
tem metadata. Clients access files by sending requests that include chunk GUIDs to
chunkservers, which run on top of Linux as user-level processes. At first, a client asks
the master to provide it a chunkserver, and once the master replies with the chunkserver
address, the client communicates with this chunkserver to service its requests. Simi-
larly to the map/reduce architecture, the master communicates with chunkservers using
the hearbeat control message, and chunkservers report their states to the master. The
master provides location and migration transparencies by using the metadata, which
includes name resolution by mapping from file names to chunk GUIDs, access control,
and the physical location of chunks, each of which is replicated on multiple computers
for fault tolerance and recovery. The GFS differs from other popular filesystems in that
it does not maintain metadata on the directory level. The namespace in GFS is main-
tained as a dictionary that maps absolute paths to metadata. The metadata is stored in
76 CHAPTER 5. MAP/REDUCE MODEL
Question 10: How do chunk sizes affect the performance of the GFS?
Question 11: How would you modify the GFS if chunks had priorities associ-
ated with the urgency of them being processed by map/reduce applications?
Recall that the main reason for inventing GFS was to make operations on big data
efficient. In map/reduce, the output of mappers is appended to the files that keep in-
termediate results as key-value pairs, and the output of reduces is also appended to
the resulting file where the data is aggregated into the final result. Making file mod-
ifications efficient is a priority for improving the overall efficiency of the map/reduce
architecture. In addition, since many clients read and append data, which are replicated
on many chunkservers, there is a potential for saturating the network with a flood of
requests and data.
These issues are addressed in GFS by using atomic data appends and the linear
network connectivity between chunkservers. The atomicity of an operation is defined
as a sequence of indivisible action steps such that either all these steps occur or none
does [60]. It means that appending data to a file atomically cannot be viewed as a se-
quence of steps where one step searches for the offset from the file head, the other posi-
tions the reference to the location where the data will be written, and the data is written
in the third step. If appending data is not an atomic operation, then two concurrent data
append operations will interfere with each other when steps are executed in parallel,
resetting the references and corrupting the file in the end. Explicitly synchronizing
non-atomic appends requires programmers to create synchronization mechanisms that
lock the file. These synchronization mechanisms are difficult to reason about and they
slow down the execution, since one operation will wait until the other releases the lock.
Instead, the GFS stores data chunks in parallel without requiring programmers
to use synchronization mechanisms and then automatically updates references in the
metadata to designate appended data chunks to a given file. The reader can obtain in
5.7. APACHE HADOOP: A CASE STUDY 77
depth information about write operations from the original GFS paper as well as on the
use of leases for file management and replication [60].
When it comes to replication, GFS places requirements on the connectivity between
chunkservers to reduce amount of data that is exchanged between them. Chunkservers
are connected in a chain forming a line where each chunkserver has two closest neigh-
bors: one that sends data to the chunkserver and the other that receives data from
it. Thus, the distance between chunkservers is measured based on the number of
chunkservers through which the data from the source chunkserver will travel to reach
the destination chunkserver. Enabling chunkservers to communicate only with its clos-
est neighbors reduces the communication overhead, which could be much higher for a
tree or a clique-connected network. In addition, TCP-based data transfers are pipelined,
i.e., chunkservers do not wait to receive the entire data chunk, it can forward its partial
packets thereby reducing the overall transfer time. Individual small gains in each data
chunk transfer result in larger gains for a long map/reduce computation.
Question 12: Analyze the current documentations on the GFS and the HDFS
and create a list of differences between these filesystems.
node is often viewed as a VM running daemon processes of the HDFS and YARN in
the background. One or more master nodes in a Hadoop cluster coordinate the work in
the cluster and worker nodes receive jobs from the master nodes and perform mapper
and reducer computations. The HDFS contains three main services. The NameNode
service that runs on the master node, maintains all metadata, and services clients’ re-
quests about the location of files. The secondary and standby NameNode services
checkpoint metadata, and the DataNode service represent the GFS chunkservers that
store chunks of data and update the NameNode service about the states of the local
files. In the HDFS, the WORM model is used where the files cannot be overwritten,
but only moved, deleted, or renamed. As the reader can see, there is much similarity in
the basic architectures between the HDFS and the GFS.
Question 13: How many times are data chunks replicated in the HDFS?
the computer that the ResourceManager assigns to the task to run the container.
Responding to this request, the NodeManager creates and starts the container (one
per task) and the task runs. Once completed, the ApplicationMaster reports the
task completion to the ResourceManager, deregisters itself, and exits.
Consider an example of the Java map/reduce skeleton program for the Hadoop
framework that is shown in Figure 5.3. Some imports are omitted and error handling
code is removed for brevity. Each program line is numbered on the left margin.
1 import org.apache.hadoop.*;
2 public class YourImplementationClassName {
3 public static class YourMapperName extends MapReduceBase implements
4 Mapper<LongWritable,Text,Text,IntWritable>{
5 public void map(LongWritable key, Text value,
6 OutputCollector<Text, IntWritable> output,
7 Reporter reporter) throws IOException {
8 //here goes your implementation of the mapper
9 reporter.progress();
10 output.collect(new Text(key), new IntWritable(value));} }
11 public static class YourReducerName extends MapReduceBase
12 implements Reducer<Text, IntWritable, Text, IntWritable> {
13 public void reduce( Text key, Iterator <IntWritable> values,
14 OutputCollector<Text, IntWritable> output, Reporter reporter)
15 throws IOException {
16 //here goes your implementation of the reducer
17 output.collect(new Text(key), new IntWritable(val));} }
18 public static void main(String args[])throws Exception {
19 JobConf conf = new JobConf(YourImplementationClassName.class);
20 conf.setJobName("whatever name you choose");
21 conf.setOutputKeyClass(Text.class);
22 conf.setOutputValueClass(IntWritable.class);
23 conf.setMapperClass(YourMapperName.class);
24 conf.setReducerClass(YourReducerName.class);
25 conf.setInputFormat(TextInputFormat.class);
26 conf.setOutputFormat(TextOutputFormat.class);
27 FileInputFormat.setInputPaths(conf, new Path(args[0]));
28 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
29 JobClient.runJob(conf);}}
Figure 5.3: Example of the Java skeleton code for the map/reduce in Hadoop framework.
There are two key elements that make this program skeleton specific to the Hadoop
framework. The program’s classes are derived from the Hadoop framework class
MapReduceBase and they implement the parameterized interfaces Mapper and
Reducer. Also, the reader who is familiar with the Java types can see that the
program contains new types such as LongWritable, Text, TextInputFormat,
TextOutputFormat, and IntWritable. These types are introduced by the Hadoop
framework as an alternative to Java types like String and long for a number of rea-
sons. Objects of these types represent data chunks and they are serialized to a byte
stream, so that they can be distributed over the network or persisted to permanent stor-
80 CHAPTER 5. MAP/REDUCE MODEL
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/
src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.
java#L42
5 https://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html#
jls-15.8.2
5.8. SUMMARY 81
5.8 Summary
In this chapter, we introduce the concept of big data and give examples of how imple-
mentation of the software to process big data is fundamentally different from the imple-
mentation of applications that process small amounts of data. We discuss the concept
of the datacenter and show how it can be abstracted as a distributed object. Next, we
show how the diffusing computation model that we reviewed before can be modified
into the map/reduce model to split one task on the big data file into many thousands of
map and reduce subtasks on much smaller files and how to implement and execute this
model in a datacenter with failure management to allow users to view the execution in
the datacenter as fault-free. We review the Google implementation of the map/reduce
model, analyze new requirements for the filesystem, and then study how Google imple-
mented these new requirements in its filesystem called GFS. We conclude by reviewing
the case study of map/reduce implementation in Apache Hadoop.
Chapter 6
RPC Galore
82
6.1. JAVA REMOTE METHOD INVOCATION (RMI) 83
erate the bytecode that runs on its own platform called the Java Virtual Machine (JVM),
which sits on top of the operating system but below the Java applications. That is, once
written in Java, a program can run on any platform that hosts the JVM on top of which
this program will run. In a way, Java becomes a common language denominator, like
the IDL, except it allows programmers to create definitions of procedures in addition
to their specifications or declarations. Therefore, there is no need to introduce a new
IDL for Java RMI, since specifications of remote procedures can be created in Java. Of
course, in real-world software projects Java is used alongside many other programming
languages, and to enable interoperability between the JVM and many other languages
and platform, Java IDL is used1 . We will discuss it later in the book.
Recall that programmers create named interfaces in IDL with versions and GUIDs
and the declarations of the procedures that they export, which include the names of the
procedures, the sequences of the input and output parameters and their types. In Java,
the notion of interface is a part of the Java Language Specification (JLS): “an interface
declaration introduces a new reference type whose members are classes, interfaces,
constants, and methods. This type has no instance variables, and typically declares
one or more abstract methods; otherwise unrelated classes can implement the interface
by providing implementations for its abstract methods. Interfaces may not be directly
instantiated2 .” This definition allows programmers to create declarations of the remote
procedures and share them between clients and servers to enforce typing.
Until now we have not differentiated between procedural and object-oriented lan-
guages. In fact, we used the terms function and procedure interchangeably. One can
think of a function as a mapping from the input values to the output values, whereas a
procedure defines an ordered sequence of operations or commands that may not result
in any output value. In Java, an object is a class instance or an array, where a class
specifies a reference type (i.e., it is not one of the primitive types: boolean, byte, short,
int, long, char, float, and double), where “types limit the values that a variable can hold
or that an expression can produce, limit the operations supported on those values, and
determine the meaning of the operations. Strong static typing helps detect errors at
compile time3 .” Classes contain method definitions, which collectively define the be-
havior of the class instances. Invoking a method on an object is defined as o.m(p1 , . . . ,
pn ), where o is the object name, and the parameters p1 , . . . ,pn are of types t1 . . . ,tn
respectively. A simple trick allows us to convert an object’s method call into a tem-
plate for the RPC – we extend the parameter space with the value of the object as the
first parameter to the method call, i.e., m(o ref value, p1 , . . . , pn ). Doing so
enables us to reason about object-oriented (OO) remote method calls as the standard
RPC, where the first parameter is used to resolve object references on the server side.
The behavior of Java RMI is defined in its specification4 . It uses the client/server
model, and one of the key element that distinguishes Java RMI from a plain vanilla RPC
is the concept of a registry, a remote object that maps names to other remote objects5 .
1 http://docs.oracle.com/javase/7/docs/technotes/guides/idl/
2 https://docs.oracle.com/javase/specs/jls/se8/html/jls-9.html
3 https://docs.oracle.com/javase/specs/jls/se8/html/jls-4.html#jls-4.
3.1
4 https://docs.oracle.com/javase/7/docs/platform/rmi/spec/rmiTOC.html
5 https://docs.oracle.com/javase/7/docs/platform/rmi/spec/
84 CHAPTER 6. RPC GALORE
Registries enable transparency, where any form of distributed system should hide its
distributed nature from its users. The RPC provides access transparency as its basic
benefit, where invocation of the procedure is the same regardless whether it is local
or remote. The physical separation is hidden from the client. With registries, location
or name transparency is enabled, where actual location of the remote procedure is
unimportant and even unknown to the client, which uses unique object names, which
are mapped and referred to by logical names in the registry. Later we will see how
migration transparency allows remote objects to move and yet clients can access it
during and after the move without changing the runtime configuration.
Since there are many tutorials, the reader can use them to learn how to program
Java RMI on her own6 . Here we will discuss briefly how using the JVM address some
issues with the RPC such as parameter passing, lazily activating remote objects, and
dynamically loading class definitions. But first, we will review the basic steps of the
Java RMI process.
Question 1: How does Java RMI framework guarantee that a single call to a
remote method in the client program results in the corresponding single method
call in the server program?
Question 2: How do you make a remote method call using Java RMI frame-
work to pass a lambda function as a parameter to this remote method?
serialTOC.html
9 https://docs.oracle.com/javase/8/docs/platform/rmi/spec/
rmi-objmodel7.html
86 CHAPTER 6. RPC GALORE
Question 3: How do you exclude certain fields from an object that you pass
as a parameter to the remote method? That is, these fields can be used by the
client program, but the server program will not “see” these fields when the object
is received.
Consider a situation when there are multiple aliases in the client program to the
same object, i.e., multiple variables are bound to the same object location. If the client
and server programs are run in the same address space, then the object identity opera-
tion will establish that all these variables reference the object that has the same identity,
i.e., the same object. However, if these variables are used as parameters to a remote
method, the RMI framework will serialize and transmit these objects and then deseri-
alize them into the server-bound objects that may have completely different identities
even though they contain exactly the same data.
Question 4: What Java classes and methods are used to transmit parameters
according to the Java RMI specification ?
To prevent this situation, the Java RMI specification establishes a requirement for
the referential integrity that states that if two references to an object are passed from
the client address space to a server address space in parameters or in the return value in
a single remote method call and those references refer to the same object in the sending
JVM, those references will refer to a single copy of the object in the receiving JVM.
Question 5: Assuming that you use only Java object serialization to pass ob-
jects between JVMs, how would you implement the referential integrity?
exported in a JVM (i.e., passive) until the first request arrives from a client. Doing so
enables an automatic conservation of computing resources until they are required.
A key idea behind innerworkings of the lazy activation is to encapsulate a reference
to a remote object’s stub in a fault object that contains an activation identifier and the
actual reference to the remote object that may be null if the remote object is passive
on the first call. The first time a client invokes a method of the remote passive object,
the fault is issued and then the RMI framework loads and initializes the remote object
and performs a method call. The reader should read the RMI specification for in-depth
discussion of the activation protocol.
6.2.1 XML-RPC
Since the introduction of the eXtensible Markup Language (XML) in 1996, it has been
used widely to encode documents and define their schemas in a human- and machine-
readable format, its performance was acceptable for many applications and it was easy
to use [8]. XML has become a de facto lingua-franca of data interchange in electronic
commerce. The success of XML is dictated by the pervasiveness of its predecessor,
HyperText Markup Language (HTML), a markup language that is used to create web
pages for display in Internet browsers. For example, an HTML page that lists compo-
nents that exist on a computer is shown in Figure 6.1.
1 <html> <head>
2 <title>Computer with IP address 192.168.1.1</title>
3 </head>
4 <body><h1 align="center"><font size="7">Components</font></h1>
5 <ol><li><p align="left"><b>Realplayer</b></li>
6 <li><p align="left"><b>IE Web Browser</b></li></ol>
7 </body></html>
HTML contains tags whose job is to tell a web browser how to display the web
page. HTML offers a set of tags that are instruc- tions given to web browsers on
how to format text. For example, the tag <h1 align=center> specifies that a
browser should use the top heading style and align the displayed text to the center of
the browser display area. Unfortunately, this specification format does not carry any
semantic meaning about the data. When it is displayed in the browser a reader can
extract the semantics of names and map it to the basic concepts.
The solution of this problem lies in XML’s markup tags that allow developers to
design custom types that map to predefined concepts and structure data semantically.
For example, the HTML data shown in the previous example are given in XML format
in Figure 6.3.
1 <components> <computer Address="192.168.1.1">
2 <component>Realplayer</component>
3 <component>IE Web Browser</component>
4 </computer></components>
XML documents are organized in a hierarchical way as trees where the root node
specifies the topmost parent XML element and children elements are defined under
their respective parents. Each element has zero or more attributes, which are key-value
pairs. The basic unit in XML is element and XML enables declaration of data using
custom types that are instantiated as elements. Each element may contain attributes that
provide additional information. For example, element computer has the attribute
Address with value 192.168.1.1. There is a plethora of XML parsers on the
market, some of them embedded in language runtimes and incorporanted as part of the
type systems, e.g., Scala.
6.2. <YOUR DATA SERIALIZATION FORMAT HERE>-RPC 89
Just as language grammar serves as a specification that dictates the rules of writ-
ing programs XML uses schemas and Document Type Definitions (DTDs) to enforce
certain structure of XML documents. DTDs specify which elements are permitted in
XML documents that use these DTDs. For example, a DTD for the XML document
that describes components is given below.
1 <!ELEMENT components (computer*)>
2 <!ELEMENT computer (component+)>
3 <!ATTLIST computer Address CDATA #REQUIRED>
4 <!ELEMENT component (#PCDATA)>
This DTD declares that element components may have zero or more child el-
ements called computer. Each computer element may have one or more elements
called component. As in regular expressions an expression followed by sign + means
one or more repetitions, and followed by sign *, i.e., the Kleene star12 means zero
or more repetitions. Element computer has attribute Address that is declared as
string data (CDATA) and this attribute is required. Finally, element component is
declared as parsed character data (PCDATA), and it means that this element contains
some text. This DTD can be included in the XML document by adding the following
line: <!DOCTYPE components SYSTEM components.dtd>. DOCTYPE is a
markup telling the XML parser that a DTD is included, components is the name of
the top-level element and the name of the DTD in quotes. SYSTEM is an URI that is
used to describe the scope of the specification.
Programmers can easily create custom types, which map to semantic concepts of
the system that they model. We give a small example of the implementation of a
calculator with one method add that returns the sum of two integer parameters.
The Java-like pseudocode example is shown in Figure 6.4 of the XML-RPC-based
implementation of the calculator with the single method add. Lines 1–11 show an
example of the actual XML request sent by the client over the network to the server.
The XML header in lines 2–7 defines the protocol with which this request will be trans-
mitted (i.e., HyperText Transfer Protocol (HTTP)), the method POST with which this
request will be applied to the destination web server www.somehost.com, among
other things, and the payload that includes the root element methodCall that con-
tains the children elements methodName and params that define the class and the
method names and paramters values and types. The client code that emits this request is
shown in lines 22–31. In the method main of the client class CalcClient the object
of the XML-RPC framework class XmlRpcClient is instantiated whose constructor
takes the address of the webserver that hosts the XML-RPC server. The parameters
are added to the container class, Vector in lines 26–28. Finally, the call execute
is invoked on the object in line 29 where it takes the name of the server class and the
mehod and its parameters and returns a hash table that contains the result, which is
retrieved in line 31.
The XML-RPC server class is shown in lines 13–20. It contains the implementation
of the method add in lines 14–17 that puts the sum of two integers into the hash table
12 https://en.wikipedia.org/wiki/Kleene_star
90 CHAPTER 6. RPC GALORE
1 //XML-RPC request
2 POST /XMLRPC HTTP/1.1
3 User-Agent: UserName/15.2.1
4 Host: www.somehost.com
5 Content-Type: text/xml
6 Content-length: 500
7 <?xml version="1.0"?>
8 <methodCall> <methodName>Calculator.add</methodName>
9 <params><param> <value><i4>3</i4></value> </param>
10 <param> <value><i4>7</i4></value> </param></params>
11 </methodCall>
12 //XML-RPC Server
13 public class XMLRPCServer {
14 public Hashtable add(int x, int y) {
15 Hashtable result = new Hashtable();
16 result.put("add", new Integer(x + y));
17 return( result );}
18 public static void main( String [] args ) {
19 WebServer server = new WebServer(8080);
20 server.addHandler("Calculator", new XMLRPCServer()); }
21 //XML-RPC Client
22 public class CalcClient {
23 private String server = "http://www.someserver.com";
24 public static void main (String [] args) {
25 XmlRpcClient client = new XmlRpcClient( server );
26 Vector params = new Vector();
27 params.addElement(new Integer(3));
28 params.addElement(new Integer(7));
29 Hashtable result = (Hashtable) client.execute (
30 "Calculator.add", params );
31 int sum = ((Integer)result.get("add")).intValue(); } }
Figure 6.4: Java-like pseudocode example of the XML-RPC-based implementation of the calcu-
lator with the single method add.
object result and returns this table object. In the method main in lines 18–20, the
webserver is started and the instance of the class XMLRPCServer is registered with
the webserver, under the key Calculator thus activating it and making it available
to clients to call its method remotely.
Last, one can think of combining XML with Java RMI to access remote Java ob-
jects over the Internet. This was the idea behind the Java Application Programming
Interface (API) for XML-based RPC (JAX-RPC). Eventually, it was renamed into JAX
for Web Services (JAX-WS) and we will review it later in the book. The JAX-RPC ser-
vice is based on the W3C (World Wide Web Consortium) standards like Web Service
Description Language (WSDL)13 .
13 https://github.com/javaee/jax-rpc-ri
6.2. <YOUR DATA SERIALIZATION FORMAT HERE>-RPC 91
6.2.2 JSON-RPC
Recall the JavaScript Object Notation (JSON), a widely used lightweight and language-
independent data interchange format was introduced just before the end of the last mil-
lenium14 . A key element of the JSON representation is a text-based key-value pair,
where the key specifies the name of an attribute whose value is followed after the sepa-
rator colon, e.g., "jsonrpc": "2.0", "method": "add", "params":
[3, 7]. In this example, we have comma-separated key-value pairs, where jsonrpc
is the key that designates an RPC call with the value 2.0 designating the version of the
JSON-RPC framework, the key method specifies the name of the remote method, and
the key params specifies the value that is an array designated with the square brackets
that contain two parameter values. Submitting this JSON data object to the destination
webserver that hosts the JSON-RPC framework will cause the invocation of the remote
method and the result of this invocation will be passed to the client using the JSON
object, e.g., "jsonrpc": "2.0", "result": 10.
From these examples the reader can see that JSON-RPC is quite similar to the
XML-RPC, it is stateless, and it can use multiple message-passing mechanisms to
transmit the message. The JSON-RPC specification allows batching several request
objects to reduce the number of round trip communications between the clients and the
server16 . To date, there are hundreds of various implementations of JSON-RPC with
various degrees of use.
Since remote RPC-based objects can be deployed on top of webservers, one may
submit RPC calls to the remote methods from the command line. A popular utility for
doing that is cURL, a command line tool and library for transfering data with URL
syntax and it supports multiple protocols and various encoding and security options17 .
Consider the following command: curl --user drmark --data-binary
’{"jsonrpc": "1.0", "id":"curltest", "method": "getinfo",
"params": [] }’ -H ’content-type: text/plain;’ http://bitserver
that submits a JSON request to a bitcoin server to execute a remote method getinfo18 .
There are multiple implementation on the Internet where the cURL library is used to
implement client API calls for various RPC frameworks.
6.2.3 GWT-RPC
Google Web Toolkit (GWT) is an Apache-licensed open source set of tools that allows
web programmers to create and maintain Asynchronous JavaScript And XML (AJAX)
14 http://www.json.org
16 http://www.jsonrpc.org/specification
17 https://github.com/curl/curl
18 https://en.bitcoin.it/wiki
92 CHAPTER 6. RPC GALORE
applications using the Java language1920 . Since GWT applications are based on the
client/server model, where the client is a Graphical User Interface (GUI) that allows
users to interact with the back-end of the GWT application that is usually hosted in the
cloud. The GUI may be hosted in a browser or can be run as a desktop application on a
user’s computer and it communicates with the server using some networking protocol.
The user interacts with the GUI by sending events to its GUI objects (e.g., entering data
into a text box or clicking on a button) and the GUI client responds to these events by
executing code that communicates with the server, which performs computations and
may generate and return a new GUI (e.g., a HyperText Markup Language (HTML) page
generated from a Java-based distributed object). In GUI applications, user interactions
often lead to multiple invocations of remote methods in the back end.
Since GWT applications frequently fetch data from the server, the RPC framework
is integrated in GWT to exchange Java objects over HTTP between the client GUIs and
the back-end servers21 . The RPC server component is implemented as a servlet, which
is a Java class that exposes well-defined interfaces according to the Java Servlet Specifi-
cation, and servlets are deployed in web containers, i.e., webservers that receive HTTP-
based client requests to invoke remote methods exposed by the deployed servlets 22 .
The GWT framework exports classes and interfaces similarly to the Java RMI, where
the interface RemoteService must be extended by a created remote interface that
contains declarations of the exported remote methods. A service implementation must
extend the GWT framework class RemoteServiceServlet and must implement
the service’s interface. Once compiled, it can be deployed with the webserver as a
servlet, making it available to clients. The latter uses the method create of the class
GWT to obtain a reference to the remote object and bind the reference to a specific end-
point (i.e., the URL) where the remote object is deployed by casting to the interface
ServiceDefTarget and invoking its method setServiceEntryPoint with
the URL as its parameter. This simple procedure underscores the ubiquity of the RPC
that is used in many different frameworks. Interested readers may look into Errai, a
GWT-based framework for building rich web applications based on the RPC infras-
tructure with uniform, asynchronous messaging across the client and server23 .
RemoteProcedureCalls.html
22 https://jcp.org/en/jsr/detail?id=315
23 http://docs.jboss.org/errai/latest/errai/reference/html_single/
#sid-5931313
24 http://www.facebook.com
6.3. FACEBOOK/APACHE THRIFT 93
not exceed one second of response time25 . Facebook uses hundreds of thousands of
commodity computers in its datacenters26 , as reflected in its financial filings that it
spent over $3.63 billion in 2016 for equipment. As for software, Facebook uses dozens
of open-source frameworks and programming languages including MySQL, Hadoop,
Cassandra, Hive, FlashCache, Varnish, PHP, Python, C++, Java, and
functional languages Erlang, OCaml, and Haskell. A reason that Facebook en-
gineers created their own RPC design is that other existing RPC systems had mul-
tiple drawbacks: heavyweight with significant runtime overhead (e.g., XML-based,
CORBA), proprietary (e.g., gRPC), platform-specific (Microsoft RPC), missing cer-
tain features (e.g., Pillars misses versioning), or a having less than elegant abstraction.
Facebook Thrift is an RPC framework conceived, designed, and implemented at
Facebook and released to the Apache open source [141]. In 2010, Thrift became an
Apache Top Level Project (TLP). Thrifts goal is to provide reliable communication and
data serialization across multiple languages and platforms as efficiently and seamlessly
with the highest performance. Multiple Thrift tutorials are available on the Internet27 .
Here we discuss main elements of Thrift’s design and implementation.
Thrift’s design revolves around five concepts: to enable programmers to use native
types, to decouple the transport layer from the code generation, to separate data struc-
tures from their representation when transported between clients and servers, and to
simplify the versioning when rolling out changes to clients and servers frequently.
Types. Thrift IDL offers a type system that allows programmers to annotate variables
with types that map to the native types of the destination languages and plat-
forms. Base types include bool, byte, and bit-designated types like i16–i64
for integers whose representation lengths are defined by the corresponding num-
ber of bits. Types double and string are correspondingly a 64-bit floating
point number and a text/binary sequence of bytes.
Composite types in Thrift IDL include structures where fields are annotated with
unique integer identifiers thus keeping their track across multiple versions, three
types of containers: list, set, and map, exceptions, which are equivalent
to structures, and services, which contain method declaration with annotated
parameters and return types, using the basic types, exceptions, and structures.
Transport. In general, programmers should not think about how objects are exchanged
between clients and servers. Once they are serialized, the choice of the commu-
nication mechanism should not matter and it should not affect how they are con-
verted into an internal representation as sequences of bytes. Thrift provides the
interface TTransport that exports abstract methods open, close, read,
write, flush, and isOpen – they designate generic operations that can be
25 http://www.datacenterknowledge.com/the-facebook-data-center-faq/
26 https://techcrunch.com/gallery/a-look-inside-facebooks-data-center
27 https://thrift.apache.org/tutorial
94 CHAPTER 6. RPC GALORE
exports methods for all protobuf messages and their fields, and messages can be con-
structed by instantiating this class and calling methods for each fields with concrete
values to construct the entire message and emit it in the internal wire-specific format.
Conversely, a constructor takes a constructed binary protobuf and parses it to popu-
late the object fields that programmers can access directly. Multiple converters exist to
translate protobuf messages into JSON, XML, and many other formats, and vice versa.
Similar to Thrift, gRPC is built as highly performant in the distributed datacen-
ter environment. Since there are many clients requesting services from gRPC remote
objects, a load balance, which runs in the process grpclb, distributes requests from
gRPC clients to balance the load across multiple available remote objects. The work-
flow starts with a gRPC client that issues a name resolution request as part of making a
remote call. A response indicates either Interner Protocol (IP) address of the load bal-
ancer process or a configuration service address that describes what policy to choose to
distribute clients’ requests (e.g., the first available address or a round robin selection)29 .
If the returned address belongs to the load balancer, the gRPC framework will open a
stream connection to a remote object server that is selected based on the algorithm that
calculates a projected distribution of workloads.
Futures are automatic waits on values, when reference to a value is unknown at cre-
ation of binding, and this reference will automatically be filled up by some com-
putation. That is, a future represents an abstraction for an object to hold some
value that is not available when the object is created, but it will be available af-
ter some later computation. Before the computation occurred, the corresponding
future is empty, if the computation resulted in an exception, the future will fail,
or it will succeed if the computation produces a value.
Consider Scala pseudocode for sending a message to all followers of some Twit-
ter account that is shown in Figure 6.5. The variable, numberOfFollowers
29 https://github.com/grpc/grpc/blob/master/doc/service_config.md
30 https://twitter.github.io/finagle
96 CHAPTER 6. RPC GALORE
1 val numberOfFollowers: Future[List[String]] = Future {
2 connect2TwitterAccount(credentials).getFollowers().toList }
3 val numSentMsgs = numberOfFollowers.filter(_.status == PERSONAL).
4 map(followerObject => SendMessage(followerObject, someMessage)
5 numSentMsgs onSuccess {case => println("sent "+numSentMsgs.toString)}
Figure 6.5: Scala pseudocode example of using futures to send messages to Twitter followers.
declared in line 1 receives the list of the Twitter account names of all follow-
ings of some account defined by credentials. Of course, it is needless to
say that the method connect2TwitterAccount may take some time to ex-
ecute and the client program does not have to block and wait until this method
finishes. It continues to execute the code to line 3, where it uses the variable,
numberOfFollowers to filter out all accounts that are not personal (e.g.,
corporate) and then to send messages by mapping each account oject to the
method SendMessage in line 4. Finally, in line 5, the method onSuccess
is defined to print out the list accounts to which messages were sent success-
fully. So the meaning of future is to enable programmers to concisely defined
asynchronous calls without writing additional code that defines how these asyn-
chronous calls can be composed in a sophisticated pipelines with error handling.
Asynchronous uniform service example is shown in Figure 6.6. Importing classes
from the Finagle framework is done in line 1 and the server class is declared in
line 2. The service object is created in line 3 as an instance of the parameterized
class Service whose method apply takes the single HTTP request parameter
in line 3 and returns the future of the HTTP response. The implementation of
the method apply returns an HTTP response with the version number and the
status Ok. The service is bound and exported in line 6
1 import com.twitter.finagle.*
2 object Server extends App {
3 val service = new Service[http.Request, http.Response] {
4 def apply(req: http.Request): Future[http.Response] =
5 Future.value(http.Response(req.version, http.Status.Ok))}
6 Await.ready( Http.serve(":8080", service) ) }
This example comes from Finagle’s tutorial31 and we use it to illustrate how easy
it is to create asyncronous services and export their remote methods with Finagle.
In fact, Finagle can be used with other RPC services like Thrift for which Twitter
developed and released Scrooge32 , a Thrift code generator written in Scala that
makes it easy to integrate Thrift and Finagle RPCs33 .
Connecting filters enable chaining of Finagle services into a composite service and
they are defined in Scala as functions type Filter[Req, Resp]=(Req,
31 https://twitter.github.io/finagle/guide/Quickstart.html
32 https://twitter.github.io/scrooge
33 https://twitter.github.io/scrooge/Finagle.html
6.6. FACEBOOK WANGLE 97
6.7 Summary
In this chapter, we reviewed some of the most popular and most used RPC frameworks.
The powerful idea of invoking remote methods over the network has become even more
important in the age of cloud computing. Modern implementations of RPC use more
powerful abstractions like futures and pipelining to make it easy for programmers to
create highly performant RPC services without writing much code. Making a single
remote call or a sequence of synchronous remote calls is easy and well-understood;
constructing a large pipeline of asynchronous remote calls where remote objects hosted
on different platforms are accessed by hundreds of millions clients billion times a day
presents a significant challenge. The main idea of this chapter is to build a picture of
the complex world of software constructed from distributed objects using the mosaic
of the RPC frameworks.
34 https://github.com/facebook/wangle
35 https://github.com/facebook/folly/tree/master/folly/io/async
36 http://libevent.org
Chapter 7
Cloud Virtualization
The meaning of the word “virtual” is having the essence or effect but not the appearance
or form of, according to the British Dictionary. With respect to computers, virtual <X>
means that <X> is not physically existing as such but made by software to appear to
do so. Virtualization of computer resources is easy to illustrate using the concept of
the virtual memory, which is much larger volatile storage than the available physical
memory. Of course, if a computer has only 4Gb of the physical memory and the virtual
memory is configured to be 10Gb, it does not mean that it would be possible to load and
manipulate 10Gb of data into the memory at once. Instead, the memory management
unit (MMU) of the OS relies on the assumption that all 10Gb of data will not be needed
at the same time, but instead only a few smaller chunks of the data will be needed
at a time. Therefore, the user will have a view of contiguous address space with the
capacity of 10Gb, whereas in reality the MMU will load and remove pages with data
on demand, a technique known as paging. This example points out to an important
property of a virtualization in general – the presence of some resource in a virtual
world is an abstraction, which is realized by some processes that execute instructions
that implement the specified behavior of this resource.
In this chapter, we will discuss how the concept of resource virtualization enables
streamlined deployment of applications in the cloud. After examining the benefits of
the virtualization abstraction for enabling automatic (de)provisioning of resources to
applications, we will review the processes of simulation of computing environments
and emulation of the execution of applications compiled for one platform on different
platforms. Finally, we will study the organization of virtual machines and analyze
popular virtualization solutions used in various cloud computing environments.
98
7.1. ABSTRACTING RESOURCES 99
the remote object. A client can then invoke the exposed methods of the passed objects,
which will be executed by the remote object that simulates the computing environment.
Going even further, we can take one remote object that simulates a computer and pass
it using the RPC as a parameter to a method of some other remote object that also sim-
ulates a computer, so that one simulated computer will run into some other simulated
computer. And we can continue this composition of remote objects ad infinitum.
Question 1: Please discuss how quickly a running application can use a newly
provisioned vCPU.
To illustrate the last point, consider a datacenter with 100 physical CPUs – we chose
this number simply for convenience. By monitoring the execution of the applications,
suppose we observe that many CPUs are underutilized due to a different reasons: ap-
plications block on I/O, sleep on synchronization, or simply because the number of
executing threads in applications is smaller than the number of CPUs. Now, suppose
we implement and instantiate 1,000 virtual CPUs (vCPUs) as software objects that we
described above that take the instruction sets of the applications and execute these in-
structions. Of course, implementing the instruction loop as it is shown in Figure 2.2
is not enough – a physical CPU is needed to execute both the vCPU loop and the in-
structions. However, by placing a vCPU monitor between the actual CPUs and vCPUs
allows us to schedule instructions for vCPUs on the actual CPUs that leads to satisfy-
ing some load objectives. For example, we can allow customers to acquire 200 vCPUs
for a single application even though the physical system has only 100 actual CPUs. If
the vCPU monitor does a good job of scheduling unused CPU time to assign it to the
customer, she can never know that there are more vCPUs allocated to her applications
than there are physically available. And in that lies the power of abstracting resources.
100 CHAPTER 7. CLOUD VIRTUALIZATION
operating systems, so the VMs can be implemented on top of computer hardware ma-
nipulating it directly. However, the specifications do not state how hardware resources
are accessed and manipulated, e.g., how they are shared across multiple programs that
use these resources. These VMs are often referred to as high-level language VMs,
which are a subset of the category of process VMs, which translate a set of OS and
user-level instructions written for one platform to another [142].
Alternatively, a system VM provides a complete computing environment with vir-
tual resources that emulate the underlying hardware and a hosted guest OS. In some
cases, a system VM can obtain information about the underlying hardware and create
virtual resources that mimic the computing environment it is installed on. Doing so
allows multiple VMs to run in the cloud environment efficiently and the underlying
VM monitor (VMM) can schedule VMs to run on different commodity computers even
migrating them to improve load balancing as the VMs continue to execute applications
hosted inside of them.
Question 4: Can a process VM host a system VM? What about the other way
around? Can a system VM host another system VM?
Different types of virtualization are used to create VMs. In QEMU, Bochs, and
PearPC, emulation is used to translate the instruction set from the VM to the destination
lower base. Fully virtualized machines host operating systems that run applications as
if these OSes are directly installed on top of the underlying computer hardware. In
the extreme, a fully virtualized machine can be installed on top of the guest OS that is
already hosted inside some other fully virtualized machine – however, it is unlikely to
be useful from a practical point of view.
In two other types of virtualization, the operating system is adjusted to work with
virtual resources to deliver VMs [144]. In paravirtualization, the virtual hardware
resources differ from their actual counterparts and the VM is designed for the modified
virtual resources; however the OS is modified too to run as a guest OS inside the VM
to address the changed virtual resources. This is why paravirtualization is also called
OS-assisted virtualization. For example, consider that a vCPU does not export some
instruction that the OS relies on. This instruction will be replaced in the OS by calls to
an exported method of the VMM called a hypercall, which will simulate the behavior
of the CPU that the OS expects. Doing so often results in higher performance of the
VM, since the overhead of emulation can be avoided.
An example of using paravirtualization is how Xen handled the problem with the
x86 architecture that requires that programs that need to control peripheral devices
write a control data sequence to a specially allocated I/O memory space. Each device is
mapped to a specific address range in this I/O memory space. The hardware controller
reads the data sequence from the I/O memory space and sends signals to the physical
device via the bus. Of course, the I/O memory space is treated by the OS differently
from other memory locations, e.g., it is not paged or cached. And this memory space
is linked directly to the hardware via a controller, making it more difficult to virtualize.
7.3. VIRTUAL MACHINES 103
Question 5: Explain how you would virtualize peripheral resources in the x86
architecture with full virtualization and emulation.
The other interesting aspect of the x86 architecture is that it has four privilege levels
known as ring0 to ring3, where user applications run in ring 3 and the OS must run in
ring0, since only the latter allows the instructions to directly access the memory and
hardware. Since the VM runs in the user space and the guest OS runs within the VM,
it runs in ring3 and it would not function as an OS to the user applications that run
on top of it. Applying full virtualization requires the emulation of all OS instructions
by the VMM to translate them into their semantic equivalents accepted by the ring
x86 architecture. Full virtualization may reduce the performance of the OS by 20% or
more, whereas paravirtualization, as intrusive as it is, results in a better performance.
On a side note, hardware manufacturers took notice of a rapid expansion of virtu-
alization and offered enhancements to the underlying hardware architectures to enable
seamless access to virtual resources to improve performance and reduce the amount of
work that VM producers must accomplish to implement full virtualization. The idea
of hardware-assisted virtualization or accelerated virtualization rests on extensions to
baseline processor architectures, so that the VMM can use instructions from these ex-
tensions directly thereby avoiding modifications to the guest OSes. Consider the Intel’s
hardware virtualization technology (Intel VT), where, among many things, additional
instructions are introduced that start with letters “VM” 1 . For example, with the in-
struction VMLAUNCH, the VMM can run a specific VM and with the instruction
VMRESUME it can resume execution of a VM. Using hardware-assisted virtualiza-
tion yields the benefits of not having to change the OS kernel, however, it costs some
additional complexity in the CPU design and runtime overhead, leading to a hybrid
approach where paravirtualization is combined with accelerated virtualization.
In paenevirtualization, the internal VMs are created by the OS and they run within
the OS as containerized user processes where containers are isolated from one another
by the OS kernel. A prominent example of paenevirtualization is Linux-VServer where
the user-space environment is partitioned into disjoint address spaces called Virtual
Private Servers (VPS), so that each VPS is an isolated process that behaves as the
single OS kernel to the user-level processes that it hosts 2 . Each VPS creates its own
context to abstract away all OS entities (e.g., processes, resources) outside of its scope
and strictly control and monitor interactions between contexts and processes that run
within these contexts 3 . Different levels of isolation are applied to resources, where
files in shared directories are less isolated than network sockets, shared memory, and
other IPC mechanisms.
Paenevirtualization are often referred to as a variant of the chroot jail, a runtime
environment for a process and its children with a changed root directory thus isolating
processes and resources that are visible to them. Consider FreeBSD, an open-source
1 https://www.intel.com/content/www/us/en/virtualization/
virtualization-technology/intel-virtualization-technology.html
2 http://linux-vserver.org
3 http://linux-vserver.org/Paper
104 CHAPTER 7. CLOUD VIRTUALIZATION
popular OS that was released in 1993 [96]. As part of its distribution, FreeBSD con-
tains a VMM called bhyve that can host guest OSes that include Windows, BSD
distributions, and Linux flavors. Moreover, VirtualBox and QEMU emulator run on
top of FreeBSD. In addition, paenevirtualization is implemented in FreeBSD called
jail lightweight virtualization, where a jail is a process space that contains a group of
processes and it has its own root administration with full isolation from the rest of the
OS. Access control is used in jails to prevent accessing address spaces of the other jails,
namespaces are created for each jail to give the impression of a fully global namespace
environment where name collisions are avoided with other jails, and chroot is used
to constraint the jail to a subset of the filesystem. Processes that run within jails are
confined to operations that can access addresses and resources that are bound to their
respective jails where these proceeses run. For example, jailed processes cannot reboot
the system or change network configurations, and these and other restrictions may limit
the use of paenevirtualization depending on the needs of the hosted applications.
Despite all restrictions, paenevirtualization offers a number of benefits: low over-
head since neither emulation nor software-based resource virtualization are used, no
need to installed virtualization software that requires special virtualization images for
simulated computing environments, and no need to install and configure guest OSes.
These benefits are rooted in the resource isolation principle that is also a drawback,
since resources are not virtualized, and it means that it is not possible to move VMs or
cluster them to utilize resources in the datacenter efficiently. In addition, the OS kernel
must be modified, and this is an intellectually intensive and error-prone exercise.
7.4 Hypervisors
A hypervisor is a VMM that controls the execution of VMs and it supplies them with
virtual resources [122]. As such, a hypervisor is a software process that runs between
the underlying platform and the actual VM that it hosts. Native (or type-1 or bare-
metal) hypervisors run directly on top of computer hardware, whereas hosted or type-2
hypervisors run on top of OSes. Hybrid hypervisors combine the elements of type-1
and type-2 hypervisors, e.g., the Kernel Virtual Machine (KVM) that is embedded in
the Linux kernel and it runs on top of accelerated virtualization hardware thus acting
both as an OS kernel and the type-1 hypervisor that runs KVM guests that run applica-
tions on top of embedded vCPUs and device drivers.
The theory of hypervisors was formulated in
1974 as the Popek and Goldberg virtualization re-
quirements [122], which are based on the notion
of VMM map that is shown in Figure 7.1. The
entire state space is partitioned into two sets: the
states that are under the VMM control, SV MM , on
the left, and all other states, SO , on the right of
the VMM map. A program, P, is a sequence of
instructions, each of them maps a state to some
other state. Suppose that there is a program, P, Figure 7.1: VMM map.
that transform some state, Si into the state, S j in
7.4. HYPERVISORS 105
the partition, SO as it is shown in Figure 7.1 with the solid arrows. A VMM map is
a structure-preserving map, f : SO → SV MM between the partitions SO and SV MM that
establishes a condition that if there is a set of instructions, P, in the state partition SO
that maps some state Si to some other state, S j , then there must be a corresponding set
0 0
of instructions, P in the state partition SV MM that maps the corresponding state Si to
0
some other corresponding state, S j , which are linked by dashed arrows.
The VMM map leads to formulating the following Popek-Goldberg fundamental
properties that determine relations between states and instructions.
Performance property dictates that all instructions that do not require special treat-
ment by the hypervisor must be executed by the hardware directly. This property
stipulates that the performance of the VMs that the hypervisor hosts should reach
in the limit the performance of the native execution environment without the hy-
pervisor by allowing a range of instructions to bypass emulation.
Safety property prohibits an arbitrary set of instructions that are not executed by the
hypervisor to control resources. This property establishes a level of protection
between the VM and the hypervisor that guarantee that no rogue instruction can
take control of the resources that the hypervisor manages.
Given that a hypervisor should satisfy these properties, what is required to create a
hypervisor for a given hardware platform? This question is answered in the following
theorem formulated by Popek and Goldberg: a hypervisor can be created if and only
if the union of control-and behavior-sensitive instructions is the subset of the set of all
privileged instructions. The proof of the theorem is by construction, i.e., we will show
how to create a hypervisor for a hardware platform.
In the user mode, the CPU executes non-privileged instructions in programs, e.g.,
arithmetic instructions. Once a privileged instruction is detected, e.g., an I/O operation
or resetting the CPU or memory address mapping, the CPU switches to the kernel
mode, where the CPU saves the information necessary to continue the execution of the
user process and it passes the control to the OS, a process that must run in the kernel
mode as dictated by the underlying hardware architecture. Assuming that memory
accesses do not result in any traps, executing privileged instructions in the user mode
results in non-memory traps whereas executing them in the kernel mode will not result
in the same traps. Nonprivileged instructions can execute in both modes without traps.
A simplified model
of the OS for the VMM
is based on two con-
cepts only: the super-
visor/user mode and the
notion of virtual mem-
ory, the schema of which
is shown in Figure 7.2. Figure 7.2: Virtual memory schematic organization.
Executing a privileged instruction results in a trap if the program is in the user mode
and no trap is triggered if the program is in the kernel mode. When it comes to the
virtual memory, the OS manipulates the base and bound registers to store the physical
address of the memory segment assigned to a program and the size of this memory
segment, respectively. Traps are triggered when a memory location accessed by the
program violate the bound. We are interested in instructions that can manipulate the
state of the CPU or the registers for the virtual memory in the user mode without traps.
Here comes a dilemma – an OS is designed to run in the kernel mode, but it runs
in the VM in the user mode, and when it attempts to execute privileged instructions,
the execution results in a trap. It means that to handle this case, the hypervisor must
receive the privileged instructions from the OS and to process them. This is the essence
of the trap-and-emulate architecture where the hypervisor runs the VMs directly and
the VMM handles traps and emulates privileged instructions.
Next issue is with the use of the virtual memory, where the OS maps virtual address
space to the CPU’s physical address space using the MMU. A hardware mechanism
to implement virtual memory is by using specialized hardware called relocation and
bound registers. In a crudely simplified description, the CPU generates some address,
A that is compared with the address in the bound register, B. If A ≥ B then the system
traps, otherwise the address from the relocation register, R that keeps the location of the
program segment and it is used to compute the physical address, A+R. Both bound and
relocation registers are read and written into only by the OS using special instructions.
We call an instruction control-sensitive if it can read or write the control state of
the architecture, specifically, the kernel/user mode bit and the virtual memory regis-
ters. Thus, executing control-sensitive instructions gives the information about the
environment in which the program is running and changes the environment to affect
the execution of all programs in it. Example of control sensitive instruction includes
an instruction that instructs the OS to return to the user mode without triggering a trap.
Behavior-sensitive instructions are those whose semantics vary depending on the set-
tings of the control registers without accessing them directly (e.g., returning physical
7.5. INTERCEPTORS, INTERRUPTS, HYPERCALLS, HYPER-V 107
memory address given some virtual address and it depends on the value of the relo-
cation register). Example of behavior-sensitive instruction includes loading a physical
memory address or performing actions that depend on the current mode. Regular or
innocuous instructions are neither control nor behavior sensitive.
Now, we have enough information to show how to construct a hypervisor. If the
control-sensitive instruction can read the kernel/user mode bit and this instruction is not
privileged, then running this instruction in the guest OS within the VM will result in a
different answer compared to running this instruction outside the VMM, thus violating
the equivalence property. For the behavioral-sensitive non-privileged instruction of re-
turning the physical address for a given virtual address, the location register is different
in the OS versus the hardware one, thus also leading to the violation of the equivalence
property.
A hypervisor, H ,< D , A , Ξ >, where D is its top control module called a dis-
patcher that decides which other modules to load. D should be placed in the location
where the hardware traps, meaning that whenever a privileged instruction traps, D will
respond. After D determines what resources are requested, it passes the control to the
module A that allocates resources for the VM that requested it. Finally, Ξ is the set of
interpreter routines that are invoked by A in response to privileged instructions. This
triple defines a hypervisor and let us consider the cases.
Suppose that the hypervisor is in kernel mode, and all VMs it hosts are in the user
mode. The hypervisor handles all traps and all transitions from the kernel to the user
mode using its module D . When an application makes an OS system call, the call is a
trap that is intercepted by the hypervisor that decodes this call and transfers the control
to the OS running in the VM under the application that made the system call.
The hypervisor virtualizes the physical memory. Specifically, the VMs occupy
the physical memory that is virtualized by the OS as a virtual memory for the appli-
cations that run inside the VM, whereas the hypervisor controls the actual physical
hardware memory. The memory in the VM is called the guest physical memory and
the hardware memory is called the host physical memory. VM’s relocation and bound
registers are virtualized and controlled in the VM by the OS, whereas their physical
counterparts are controlled by the hypervisor, whose module A maps the registers in
the implementation using offsets and values. Summarily, the hypervisor, H handles
all privileged instructions that conform to the Popek-Goldberg properties and if the
control- and behavior-sensitive instructions are a subset of the privileged instructions,
then H is the hypervisor program that conforms to the Popek-Goldberg properties.
the context of the instruction call. In the context of VMs, when an application that
resides within a VM makes a call to access some privileged guest OS instructions, the
hypervisor intercepts this call and performs other calls on the behalf of the OS kernels.
These interceptors located in the hypervisor are hypercalls. In general, one can think
of an interceptor as a wrapper around some code fragment (e.g., a library, a method,
or an instruction) that presents the interfaces, methods, and instructions with the same
signatures as the wrapped code fragment. Within the interceptor/wrapper, additional
functionality may monitor the interactions between the client and the wrapped code
fragment or it can completely replace the wrapped instructions with some new code.
Interceptor can be added at runtime without any disruption to clients’ interactions with
the wrapped server code using various binary rewriting techniques.
Interrupts are signals that are produced by hardware or software components to
inform the CPU that events happened with some level of importance. The CPU may
discard the signal or respond to it by interrupting the execution of the current process
and executing a function called interrupt handler in response to this signal. When a
physical device (e.g., a keyboard) raises an interrupt, the signal is delivered to the CPU
using some architecture-specific bus and the CPU saves the existing execution con-
text and switches to executing the corresponding interrupt handler. Software interrupts
(e.g., division by zero) may be handled by the OS or the interrupt can be propagated
to the CPU. Some interrupts must be processed extremely fast with low overhead (i.e.,
clock interrupts) and some software and hardware components generate many inter-
rupts (e.g., a network card).
Consider the virtualized platform where a guest OS runs in a VM that is executed
by vCPUs. If the guest OS or a virtualized hardware component raise an interrupt, it
should be delivered to the vCPU, however, a vCPU is also a software component that is
scheduled by the hypervisor to execute on the physical CPU. Since the hypervisor mul-
tiplex virtual components to execute on the physical CPU, when a virtual device raises
an interrupt, the hypervisor may switch to assign a different virtual device process to
the CPU, and since no relevant vCPU is running, interrupt processing is deferred until
the vCPU is executing. Doing so may have a serious negative effect on the responsive-
ness of the virtualized interrupt handling mechanism.
Consider how the TCP/IP networking stack processes network card interrupts that
require the execution of a large number of CPU instruction. For data-intensive ap-
plications that process network I/O from external data sources (e.g., sensors, mobile
phones), it is important that the processing time is minimal. However, virtualized net-
working devices and vCPUs introduce delays due to waiting for the hypervisor to as-
sign these virtualized devices to the physical CPU. As a result, guest VMs that handle
a large amount of network traffic show significantly worsened performance when com-
pared to the native hardware devices. One solution is to attach network devices directly
to the guest VMs to improve the performance, however, doing so defeats the purpose
of sharing physical devices among multiple applications.
In this section, we discuss how the challenges of interrupt handling are imple-
mented in the architecture of Hyper-V, a hypervisor created by Microsoft that can run
virtual machines on x86 CPUs running Windows. Its architecture is shown in Fig-
ure 7.3. The dashed line shows the separation between the kernel and the user spaces
and the hypervisor type 1 runs on top of bare hardware.
7.6. LOCK HOLDER PREEMPTION 109
A key notion in
Hyper-V is a partition,
defined as an address
space with isolation en-
forcement by the hyper-
visor to allow indepen-
dent execution of guest
OSes. The hypervisor
creates partitions and as-
signs a guest OS to run
in each partition except
for the root partition that
controls all work parti- Figure 7.3: Hyper-V architecture.
tions. Work partitions do
not allow the guest OSes to access physical hardware and process interrupts, these
work partitions present virtual devices to processes that run on guest OSes.
Inside the root partition, Virtualization Service Providers (VSPs) receive and pro-
cess requests from virtual devices. During processing, the VSPs translate requests into
instructions that are performed on physical devices via device drivers that are running
within the root partition. The VSPs communicate with the VM worker processes that
act as proxies for the virtual hardware that is running within the guest OS. Interrupts
and virtual device messages are transfered using the VMBus between work partitions
and the root partition. VMBus can be viewed as a special shared memory between the
root and work partitions to provide a high bandwidth and low latency transfer path for
the guest VMs to send requests and receive interrupts and data items. Specific VM-
Bus messaging protocols enable fast data transfer between the root and work partition
memories, so that data in work partitions can be referenced directly from the root parti-
tion. Using these techniques improves the response time between physical devices and
virtual machines.
for some periods of time called time slices. It works the same way as scheduling threads
to execute user programs, where the code of the vCPU is executed during a time slice
in a thread, then the thread is preempted, its context is saved, the context of the other
thread is restored and the execution continues.
Consider what happens when a vCPU executes a thread that obtained a spinlock.
When its time slice expired, the thread that executes the vCPU that executes the thread
that holds a spinlock is preempted. The spinlock is not released, since the hypervisor
cannot force the vCPUs release spinlocks arbitrarily, otherwise, it may lead to all kinds
of concurrency problems. Since the spinlock is being held by the preempted thread,
other running threads will waste their cycles waiting on the spinlock until the context
switch happens again and the code that the vCPU runs will finally release the spinlock.
This is the essense of the LHPP.
Performance impact of LHPP varies, it was shown that in some cases over 99% of
time the vCPUs are spinning thereby wasting the physical CPU cycles [54]. One so-
lution to LHPP is to replace spinlocks with blocking locks, as it was done in OSv that
we will review in Section 8.3. Other solution includes the use of I-Spinlock, where a
thread is allowed to acquire a lock if and only if the remaining time-slice of its vCPU is
sufficient to enter and leave the critical section [148]. There are many ways to eliminate
spinlocks by using lock-free algorithms that depend on hardware atomic instructions
or by modifying the hypervisor to exchange information with the guest VM to avoid
wasted spinlock cycles by allowing the vCPU that holds the spinlock to finish the com-
putation and release the lock, and then to preempt that thread. It is still an ongoing
research to create algorithms and techniques to solve the LHPP.
L7 is where programs exchange data using some protocol (e.g., emailing or transfer-
ring files using HTTP, FTP, SSH);
L6 handles data translation from some program format into a network format (e.g.,
encrypting messages or representing images as text);
Throughout this book we will refer to these layers to discuss various abstractions
and solutions. For example, when we discussed the RPC client/server model, we re-
ferred to L7 as the application layer where a client program sent requests to the server
programs without discussing the details of networking. When we expanded the dis-
cussion to stubs, we referred to as L6 to discuss networking formats such as NDR and
XDR for remote call encoding. When discussing pipelining in various RPC implemen-
tations, we moved to L5 to discuss session control among collections of services on the
network. L5-L7 are often referred to as application layers, L4 as the TCP/IP protocol,
L3 as a message routing layer that also involves TCP/IP protocol, L2 as switching, and
L1 as physical signal transfer. L1-L2 are also referred to as Ethernet, a term that en-
capsulates a range of networking technologies that include both hardware and message
protocols defined in IEEE standard 802.3.
In many ways, these layers represent different abstractions of various services by
concentrating only on their essential elements. However, one common thing that these
layers have is that the control and the data are bundled together, where bits in messages
can represent both application data (e.g., parameters of the remote methods) and control
actions (e.g., the IP address of a VM to route a message to). Separating data and control
planes is a key idea in the virtual or software defined network (SDN).
Question 8: Discuss pros and cons of the “intelligent routing” where L3 col-
lects information about all possible routes and selects an optimal one to send a
message to the destination node.
Before we discuss SDNs, let us recall the elements of network organization. The
Internet backbone is the collection of many large WANs that are owned and controlled
by different commercial companies, educational institutions, and various organizations,
which are connected by high-speed cables, frequently, fiber optic trunk lines, where
each trunk line comprises many optic cables for redundancy. A failure of many WANs
in the Internet backbone will likely not lead to the failure of the Internet, since other
remaining WANs will reroute messages that would otherwise be lost. Other networks
include smaller WANs and LANs: residential, enterprise, cellular, small business net-
works. These networks are connected to the Internet backbone using routers, which
are hardware devices for the network traffic control by receiving and forwarding mes-
sages. Bridges are hardware devices that connect network segments on L1-L2 levels.
112 CHAPTER 7. CLOUD VIRTUALIZATION
Whereas routers allow connected networks remain independent using internally differ-
ent protocols while sending messages to one another, bridges aggregate two or more
networks that use the same protocol using hardware addresses of the devices on the
network. Routers and bridges forward the network traffic indiscriminately, and to man-
age the flow of data across a network, hubs/switches are used to forward a received
message only to the network nodes to which the message is directed. A graph where
nodes represent networking devices and WANs/LANs and edges represent physical
wires connecting these nodes, is called a network topology.
An SDN is a collection of virtual networking resources that allow stakeholders to
create and configure network topologies without changing existing hardware network-
ing resources by separating the control and the data planes. The SDN controller reads
and monitors the network state and events using programmatic if-then rules, where an-
tecedents define predicates (e.g., FTPport == 21) and consequents define actions
when antecedent predicates evaluate to true (e.g., discard(HTTP POST message)).
A typical firewall rule can discard messages that originate from IP addresses outside
the LAN. Subsets of the network traffic can be regulated by separate custom-written
controllers.
General design goals for SDNs include flexibility of changing network topologies,
manageability for defining separate network policies and mechanisms for enforcing
them, scalability for maximizing the number of co-existing SDNs, security and isola-
tion of the networks and virtual resources, programmability, and heterogeneity of net-
works. An SDN resides on top of different routers, switches, and bridges that connect
various LANs and WANs, and a programmer can define virtual devices and connec-
tions among them to redefine the underlying network into a new topology with rules
for processing messages based on the content.
In a multitenant cloud datacenter, each computer hosts multiple VMs, and uneven
changing workloads require different network topologies. A network hypervisor for
an SDN is used to create abstractions for VM tenants: with the control abstraction,
tenants can define a set of logical data plane elements that they can control; with the
packet abstraction, data is sent by endpoints should see the same service as in a native
network. To implement these abstractions, the SDN hypervisor sets up tunnels between
host hypervisors. The physical network sees only IP messages and the SDN controller
configures the hosts’s virtual paths. Logical data paths are implemented on the sending
hosts where tunnel endpoints are virtual switches and the controller modifies the flow
table entries and sets up channels.
Data plane contains streaming algorithms that act on messages. Routers get data
messages, check their headers to determine the destination. Then, the router looks up
the forwarding table for output interfaces, modifies the message header if needed (e.g.,
TTL, IP checksum), and passes the message to the appropriate output interface.
7.8. PROGRAMMING VMS AS DISTRIBUTED OBJECTS 113
Figure 7.4: Java pseudocode for moving a hard disk from one VM to another VM in VirtualBox.
Consider the Java pseudocode example for moving a hard disk from one VM to
another VM in VirtualBox as it is shown in Figure 7.4. The pseudocode does not in-
clude any exception handling and error management, and its goal is to illustrate what
can be accomplished with the basic VirtualBox SDK6 . The framework package is im-
ported in line 1 for the version of VirtualBox 5.1. In line 2, the object of the class
VirtualBoxManager is created and it is an abstraction for the hypervisor that hides
4 http://download.virtualbox.org/virtualbox/SDKRef.pdf
5 https://www.virtualbox.org/svn/vbox/trunk/
6 https://www.virtualbox.org/sdkref
114 CHAPTER 7. CLOUD VIRTUALIZATION
low-level virtualization details. In line 3, the client connects to a specific computer that
hosts the hypervisor and in line 4 it obtains the object of the interface IVirtualBox.
In line 5 and line 8, references are obtained for VMs from which the hard drive is
take and to which this drive is attached. In lines 6-7, a session object is created to ob-
tain a lock on the VM from which the drive is taken, and the same is accomplished is
lines 9-10 for the VM to which this drive is attached. Next, the object of the interface
IMedium is obtained in line 11 and it serves as a reference for the hard drive that is
used to move the drive between the VMs in lines 12–16. The VMs are unlocked in
lines 17-18. Overall, this procedure is referred to as hot swap, since the VMs can ex-
ecute their guest OSes that run user programs. The reader can use her imagination to
construct various computing environments automatically according to some declarative
specifications that can serve as inputs to clients of VMs.
7.9 Summary
In this chapter, we describe the theory of virtualization in the cloud. We explained
the basic abstractions that enable us to represent hardware resources as abstract enti-
ties with interfaces that can be implemented in software. After reviewing the concepts
of simulation and emulation of computer environments, we discuss the process and
system virtual machines (VMs) and explain how they are hosted by the VM monitor
that exports virtualized hardware resources. Next, we present the Popek-Goldberg the-
ory of hypervisors, explain the fundamental properties of virtualization, and give the
Popek-Goldberg theorem with its proof that directs how to build a hypervisor. Then,
we extend the concept of virtualization to computer networks, and we briefly discuss
virtual or software defined networks. We conclude this section by showing that VMs
can be manipulated programmatically by clients as distributed objects where clients
can manipulate virtual resources, start and shutdown VMs and programs that run in
them, and combine them in a pipelined execution.
Chapter 8
115
116 CHAPTER 8. APPLIANCES
When we mention configuring a VAP, one component that we have treated as im-
mutable up to this point is the guest OS. However, the hypervisor assumes a few func-
tions of an OS such as context switching between running processes, (de)provisioning
memory, and handling filesystems. One can wonder why a fully loaded OS is required
for a VAP that runs a distributed object, which may need only a fraction of OS services.
In fact, why should a VM emulate the entire computing environment? In many cases,
the sound cards, PCI busses, floppy drives, keyboard, bluetooth and other hardware
are not needed to deploy distributed objects. Moreover, the more virtual hardware is
enabled in the VM, the higher the risk of some security attack that may exploit this
hardware. For example, in VENOM attack, a flaw in the virtual floppy disk controller
in the emulator QEMU enabled a denial of service attack by using out-of-bounds write
and guest crashes and even executing arbitrary code3 . Thus, by optimizing the OS to
include only those components that are needed just enough to run a software applica-
tion, it is possible to reduce its size, make it faster by removing redundant instructions
that consume the CPU time, make it efficient, since it will consume fewer resources,
and more secure. Combining this Just Enough OS (JeOS) – pronounced as the word
“juice” – as a guest OS in a VM with a software application for which this JeOS is con-
figured is called creating a software appliance, a concept that supersedes the notion of
VAP. Software appliances can be connected in a virtual cluster, where each VAP com-
municates via a virtual Network Interface Controller (vNICs), a device that connects
a computing node to the network by receiving and transmitting message from and to
routers at L1-L4 OSI layers, and virtual routers with other applications hosted in VMs
over the WAN. Virtual clusters, in turn, can be connected in virtual datacenters, which
we will discuss later in the book.
ns-vpx-overview-wrapper-con.html
2 https://marketplace.vmware.com/vsx
3 http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3456
8.1. UNIKERNELS AND JUST ENOUGH OPERATING SYSTEM 117
Question 2: What OS services are required only for sending and receiving
messages as sequences of bytes?
Three important insights show that there is a way to slim down the guest OS image.
First insight is that there may not be any need for multitasking in a guest OS. If a VM is
instantiated to run only one application, then the OS can be slimmed down by removing
the code that keeps track of processes and switches between them. Since it is a single
process that is executed by the guest OS, there is no need for protected disjoint address
spaces and keeping track of offsets for running process to ensure that they do not access
an address space that belongs to some other processes. These simplifications of the
guest OS remove unnecessary code, make its image smaller, and the OS runs faster.
The second insight is that the guest OS does not control physical resources like hard
drives, network cards, and monitors, and therefore, many drivers and algorithms can
be removed from the guest OS. Optimizations of space allocation on hard drives makes
little sense, especially since the hypervisor’s underlying filesystem may be HDFS. And
it is highly unlikely that physical monitors will be allocated to VMs even if they may
run components of gaming applications. Moreover, may OSes include legacy drivers
and protocol support plugins, which makes sense if their users may run some outdated
hardware, however, in a datacenter all hardware specifications can be homogenized,
thereby removing the need for legacy support in guest OSes.
The final, third insight is that selecting process VMs and supporting libraries that
are installed on top of the guest OS depend on the application’s requirements. Installing
.Net framework as part of Windows guest OS will simply bloat the VM if it runs a Java
application. Some applications may need specialized services that may not necessarily
run in the same VM. For example, an application may send its data to some other
VM via an RPC that hosts an interface to a relational database, which may be a VAP.
Thus, configuring the entire software stack in addition to the guest OS to match the
computing needs of the hosted application results in a smaller and faster VMs that are
easy to control and manipulate in the datacenter.
4 https://www.vmware.com/pdf/vmotion_datasheet.pdf
5 https://www.microsoft.com/en-us/windows/windows-10-specifications
118 CHAPTER 8. APPLIANCES
8.2 OSv
OSv is a Linux-based JeOS designed to run in a guest VM on the cloud infrastructure
and it supports different hypervisors with a minimal amount of architecture-specific
code [83]. OSv and its VAP images are publically available 6 .
A key idea of OSv is to delegate many functions of the OS kernel to the type-1
hypervisor and run OSv as a single application in VMs that are hosted by this type-1
hypervisor. The address space is not partitioned in OSv – all threads including the OSv
kernel run in the same address space and they have a direct access to the system calls
that traditionally resulted in traps passing the control to the OS. With the direct access,
the overhead is removed of copying system call parameters from the user space to the
kernel space. OSv uses the open-source Z file system manager that can maintain a 128-
bit logical volumes that can store up to one billion TB (i.e., a zettabyte) of data. File
integrity is ensured by using checksums. Interested readers can download and study
the source code of ZFS7 .
MMU, memory references can be read and updated without duplicating the bookeeping
structures, leading to the improved performance and smaller memory imprint.
OSv introduced a new channel to allow user programs to obtain information from
the OS about memory usage and also (de)allocate it on demand. This channel is im-
plemented in shrinking and ballooning interfaces of OSv . Consider the allocation of
memory when running a Java program. Using the command-line option to allocate the
maximum size of the heap memory often results in either under- or over-utilization.
Alternatively, the shrinker interface allows the program to register its callback meth-
ods with OSv , which it will call when certain thresholds are reached (e.g., the amount
of the available memory is too low), and the program provides the functionality in the
callback to release the unused memory. The reader can view the memory as a fluid mat-
ter that flows from one program to the other, and the flow is regulated by OSv calling
registered callback methods.
Question 4: Describe how you would use the shrinking interface in your
map/reduce implementation.
Question 5: Explain what happens to the balloon when the GC kicks in.
To summarize, OSv does not have any limitation on the language in which a pro-
gram is written to run it. Creating a VM image with OSv requires between 10-20MB
of the overhead space, and it takes less than ten seconds to build on a reasonably mod-
ern workstation. OSv website contains a number of VAPs with software applications
optimized for OSv . Prebuilt OSv images can run on Amazon Elastic Cloud (EC2) and
Google Computing Engine (GCE) as well as on local computers using Capstan, a tool
for building and running applications on OSv . Moreover, given a small footprint of
120 CHAPTER 8. APPLIANCES
OSv and its impressive performance, it seems as a natural fit for the idea of server-
less computing to run short stateless functions that are written in high-level languages
like Javascript or Scala. Since OSv can boot up and shut down very quicky, it is suit-
able for hosting these serverless functions that take a fraction of the second to execute.
Providing Function-as-a-Service (FaaS) is important for controlling the cost of cloud
deployment, and OSv is a natural fit for FaaS. Finally, the programming interfaces for
administering deployed OSv images are available, so clients can make remote calls to
obtain the statuses of deployed OSv -based VMs and provide control actions.
8.3 Mirage OS
The approach of Mirage OS is to compile applications with supporting libraries into
unikernel appliances thus avoiding a considerable effort to (re)configure VAPs. Mirage
unikernel images are compiled to run on top of the Xen, a popular open-source virtual-
ization platform that is commercially supported by Citrix Corporation. Xen hypervisor
is type-1, it installed on top of bare metal and it is booted directly using the computer’s
basic input/output system (BIOS) that is non-volatile preinstalled firmware located on
the system board and it contains hardware initialization routines that are invoked during
the power-on startup and its runtime services are used by OSes and programs. Interest-
ingly, Xen is written is OCaml, a strongly typed functional programming language8 .
The account of building Xen and choosing a non-mainstream language are described
by the core engineers who built the original version of Xen and they include the follow-
ing objectives: performance can be delivered by the code written in OCaml, integration
with various Unix programs is facilitated by a simple and efficient foreign-function in-
terface of OCaml, strong type safety of OCaml makes it easier to run the hypervisor that
have a very long mean-time-to-failure (MTTF), and strong optimization mechanisms
of the OCaml compiler makes it easy to produce compact code [136]. At the time of
writing this book, Xen is used by tens of thousands of companies and organizations all
over the world.
Mirage OS has a number of interesting features, some of which we discuss briefly
in this section9 . One is zero-copy device I/O that uses a feature of Xen where two VMs
may communicate with one another via the hypervisor that grants one VM the right to
access memory pages that belong to the other VM. The table that maps pages within
VMs to integer offsets is called a grant table and it is updated by the hypervisor. Since
the notion of user space does not exist in Mirage, applications can obtain direct access
to memory pages, thus avoiding copying data from the pages into the user space. For
example, when an application writes an HTTP GET request into an I/O page that the
network stack writes to the driver, and when a response arrives and is written into a
page, the write thread receives a notification to access the page to collect the data. This
model is also used for storage, and it enables applications to use custom caching policy,
which may make more sense if the semantics of data accesses in the application has
patterns that may be exploited for effective caching, rather than delegating it to generic
OS policies like least recently used (LRU) pages. As we can see, the low-levels of the
8 http://ocaml.org
9 https://mirage.io/wiki/overview-of-mirage
8.4. UBUNTU JEOS 121
OS and device drivers are not isolated from the application level as we would expect in
the OSI seven layer model.
Question 6: What are the key differences between OSv and Mirage OS?
A number of VAPs have been implemented and evaluated under Mirage. Consider
the OpenFlow Controller Appliance, an SDN VAP, where controllers manipulate flow
tables (i.e., datapaths) in Ethernet routers. The Mirage library contains modules for
OpenFlow parsers, controller, routers, and other networking equipment and a Mirage
application can use these libraries to extend the functionalities of these basic virtual-
ized hardware elements. Because the Mirage network stack allows its applications to
access low-level components, Mirage VAPs can offer significant extensions of the basic
hardware elements with powerful functionalities. Interested readers should check out
the Mirage website and follow instructions in the tutorial to create and deploy Mirage
unikernel VAPs10 .
VAP. With the manual updates, the VAP owner will place tar files containing ap-
proved updates on a designated server and the cron job will access this file at prede-
fined periods of time. The last step is to test these steps, reset the VM for the first user
login, and put the VAP on the Internat for download.
Question 7: Create a VAP for a firewall and explain how you would deploy it.
An example of the OVF descriptor is shown in Figure 8.1 that is a modified version
of the example descriptor whose example is available in VMware documentation13 .
The OVF descriptor is written in the XML format with the tag Envelope as the root
12 http://www.dmtf.org/standards/ovf
13 https://www.vmware.com/pdf/ovf_whitepaper_specification.pdf
8.6. SUMMARY 123
element of the descriptor in line 1 with the reference to the OVF schema specified
in line 2. References to external files and resources are specified with the top tag
References and they are given between lines 3–5, where only one virtual disk file is
specified for brevity. Specific resources are designated with the tag Section between
lines 6–42, where descriptions of the virtual CPUs, the network, the VAP with the
assigned IP address, and the JeOS are given.
Of course, OVF files do not have to be written manually, virtualization platforms
provide tools to generate these files automatically. Consider a Java pseudocode ex-
ample for using VmWare API calls to export OVF of some VM that is shown in Fig-
ure 8.2. VmWare is one of the leading providers of virtualization platforms at the time
when this book is written, and it offers a powerful SDK to access and manipulate en-
tities in the datacenters that deploy the VmWare platform. To write Java programs
using the VmWare Java API calls, one must import the package com.vmware as
it is done in line 1. The next step is to obtain a reference to an object of the class
ServiceInstance in line 2, the singleton root object that represents the inventory
of the VMWare vCenter Server that provices a centralized platform for manag-
ing VMware virtual environments unified under the name vSphere. The reader can
access VMware documentation and study the syntax and semantics of various vSphere
API calls in greater detail14 .
The hierarchical organization has the root object of the class ServiceInstance
that contains zero or more Folders, each of them is a container for holding other
Folders or Datacenters, which are container object for hosts, virtual machines,
networks, and datastores. Each object Datacenter contains zero or more objects
Folder, which in turn contain zero or more objects VirtualMachine or ob-
jects DistributedVirtualSwitch or ComputeResource, which in turn con-
tain objects HostSystem and ResourcePool, which in turn contains zero or
more objects VirtualApp, which contain object VirtualMachine. vSphere API
calls are designed to enable programmers to navigate the inventory hierarchy pro-
grammatically. In line 3 we obtain a reference to the object VM using the method
obtainInstanceReference – this method does not exist, we invented it for
brevity to circumvent navigating the inventory hierarchy. In line 4 we obtain the ob-
ject that represents an OVF file and in line 5 we create an object that designates OVF
descriptor parameters. We connect these objects in line 6 and then in line 7 the VM is
accessed and its configuration is retrieved and put into the OVF descriptor object that
is returned in line 9. Samples of code in different languages that use the VMware SDK
are publically available15 .
8.6 Summary
In this chapter, we presented the concepts of virtual and software appliances. We
showed how useful they are for forming units of deployments in the cloud, where
they can be combined in virtual clusters and datacenters. We review Just Enough OSes
14 https://www.vmware.com/support/developer/vc-sdk/visdk400pubs/
ReferenceGuide/vim.ServiceInstance.html
15 https://code.vmware.com/samples
124 CHAPTER 8. APPLIANCES
1 <ovf:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
2 xmlns:ovf="http://schemas.dmtf.org/ovf/1/envelope" ovf:version="0.9">
3 <References>
4 <File ovf:id="file1" ovf:href="vmdisk1.vmdk" ovf:size="180114671"/>
5 </References>
6 <Section xsi:type="ovf:DiskSection_Type">
7 <Info>All virtual disks</Info>
8 <Disk ovf:diskId="vmdisk1" ovf:fileRef="file1" ovf:capacity="10000000"
9 ovf:format="http://www.vmware.com/specifications/vmdk.html#sparse"/>
10 </Section>
11 <Section xsi:type="ovf:NetworkSection_Type">
12 <Info>List of logical networks used in the package</Info>
13 <Network ovf:name="VM Network">
14 <Description>The network for the email service</Description>
15 </Network>
16 </Section>
17 <Content xsi:type="ovf:VirtualSystem_Type" ovf:id="Email Appliance">
18 <Info>The Email appliance</Info>
19 <Section xsi:type="ovf:ProductSection_Type">
20 <Info>This appliance serves as an email server</Info>
21 <Product>Email Appliance</Product>
22 <Vendor>Lone Star Consulting</Vendor>
23 <Version>1.0a</Version>
24 <ProductUrl>http://www.url.com</ProductUrl>
25 </Section>
26 <Property ovf:key="emailvap.ip" ovf:defaultValue="192.168.3.101">
27 <Description>The IP address of emai; appliance</Description>
28 </Property>
29 </Section>
30 <Section xsi:type="ovf:VirtualHardwareSection_Type">
31 <Info>1000Mb, 2 vCPUs, 1 disk, 1 nic</Info>
32 <Item>
33 <rasd:Caption>2 vCPUs</rasd:Caption>
34 <rasd:Description>Number of vCPUs</rasd:Description>
35 <rasd:ResourceType>3</rasd:ResourceType>
36 <rasd:VirtualQuantity>1</rasd:VirtualQuantity>
37 </Item>
38 </Section>
39 <Section xsi:type="ovf:OperatingSystemSection_Type">
40 <Info>Guest Operating System</Info>
41 <Description>Ubuntu Server JeOS</Description>
42 </Section>
43 </Content>
44 </ovf:Envelope>
(JeOSes) and unikernels as mechanisms for creating slimmed down versions of OSes
for virtual appliances. After studying OSv , Mirage OS, and Ubuntu Server JeOS, we
8.6. SUMMARY 125
1 import com.vmware.*;
2 ServiceInstance inst = new ServiceInstance(url,uname,pswd,true);
3 ManagedEntity vm = inst.obtainInstanceReference("VirtualMachineName");
4 OvfFile[] ovfData = new OvfFile[0];
5 OvfCreateDescriptorParams ovfDescParams = new OvfCreateDescriptorParams();
6 ovfDescParams.setOvfFiles(ovfData);
7 OvfCreateDescriptorResult ovfDesc = inst.getOvfManager().
8 createDescriptor(vm, ovfDescParams);
9 ovfDesc.getOvfDescriptor();
Figure 8.2: Java pseudocode for using VmWare API calls to export OVF of some VM.
learned about Open Virtualization Format and how virtual appliances can be distributed
with OVF files. We conclude by showing how OVF descriptor can be created automat-
ically from the existing VM using the VMware SDK.
Chapter 9
Web Services
Web services are software components that interact with other software components
using document-based messages that are exchanged via Internet-based protocols [49].
The word service in the distributed computing context was coined by Gartner analysts
W. Roy Schulte and Yefim V. Yatis [108] as a discrete unit of functionality that can
be accessed remotely and it is deployed independently from the rest of a distributed
application [107]. W3C Working Group defines a service1 as “an abstract resource that
represents a capability of performing tasks that form a coherent functionality from the
point of view of providers entities and requesters entities. To be used, a service must
be realized by a concrete provider agent.” To map these definitions to the RPC do-
main, a web service can be viewed as an RPC server object which responds to clients’
calls made to the web service using the Internet, the global worldwide interconnected
computer network that use the general protocols to link computing devices. As such,
a cloud datacenter can be viewed as a service whose interfaces expose methods for
accessing and manipulating other services (e.g., VMs) using the Internet.
Viewed as a bigger picture, web services are components on the World Wide Web
(WWW), a global graph where nodes designate resources or documents and edges spec-
ify hyperlinks. Nodes in WWW are usually accessed using the HyperText Tranfer Pro-
tocol (HTTP), which is one of the Internet’s protocols2 . Each node can be a client or a
server or both in the client/server model. Thus, web services are distributed software
components deployed on WWW whose interface methods can be accessed by clients
using the HTTP. Method call parameters and return values are embedded in the HTTP
messages. Essentially, engineering distributed applications with web services can be
viewed as routing text document (i.e., messages) between nodes in WWW where each
node is a web service with endpoints described in specifications of the web service in-
terfaces that accept and process the HTTP messages. Thus, creating applications from
web services results in a composite web service that orchestrates message exchanges
between the composes web services.
1 https://www.w3.org/2002/ws/arch/
2 https://www.w3.org/Help/webinternet
126
9.1. SIMPLE OBJECT ACCESS PROTOCOL (SOAP) 127
Figure 9.1: An illustrative example of a SOAP request to call the method Add of the class WSrv
that takes values of two integer parameters x and y.
of some web service can be used as the input to some other methods of the same web
service. Doing so may significantly improve the performance of applications, since the
number of roundtrip request/return messages between clients and the web service can
be reduced significantly. Unfortunately, the basic RPC model cannot be easily extended
to the document-based model because its core abstraction targets the complexity of a
single method call in the distributed environment. Therefore, web services and HTTP-
based document protocols while sharing the same client/server model with the RPC,
represent a significant conceptual departure from the single distributed method call
abstraction.
A set of rules for conveying messages between nodes is called a binding and the
SOAP protocol binding framework defines bindings and how nodes can implement
them. Many SOAP binding implementations exist and XML provides a flexible way
to encode binding rules and constraints. For example, HTTP binding defines how
SOAP calls are embedded into an HTTP message that contains the SOAP envelope
that contains the SOAP body element describes the call made over the network.
Let us convert the example of XML-RPC request shown in Figure 6.4 into a sim-
plified SOAP request that is shown in Figure 9.1. We assume that the actual web
service is registered with some web server at a known Universal Resource Locator
(URL) . Recall that a URL is a subset of URI that is an instance of the following spec-
ification: protocol:[//authority]path[?query][#fragment], where a
protocol is one of the allowed communication request/response schemes usually
registered with the Internet Assigned Numbers Authority (IANA) like http or ftp;
authority is an optional component of the specification defined as [username@
]hostname[:port] that may contain the user name, the name of the host or its
IP address and the port it is bound to (separated from the rest of the information in
9.1. SIMPLE OBJECT ACCESS PROTOCOL (SOAP) 129
authority with the colon); path is a sequence of names separated with the forward
slash; query is an optional element that defines a sequence of key-value pairs that
can be viewed as named parameters to a function call and their values; and finally the
element fragment that specifies a tagged resource within the resource defined by the
path. For more information, we refer the reader to as the corresponding Requests for
Comments (RFC) and other Internet standards.
In the SOAP request shown in Figure 9.1, the web server that host the web ser-
vice WSrv is located at http://www.cs.uic.edu. The web service exports the
method Add that takes two parameters named x and y. Lines 1-4 contain the HTTP
header that specifies that it is a POST request. According to RFC 26164 , the request
POST tells the web server to “accept the entity enclosed in the request as a new sub-
ordinate of the resource identified by the Request-URI in the Request-Line.” That is,
the action of the web server depends on the type of the request. As part of the HTTP
header, it is specified that the object of interest is /DOSOAP/object=WSrv and it is
located on the UIC server. The SOAP action is defined as WSrvAdd where the hash
sign designates the fragment of the resource WSrv that is its method Add. Thus, we
can see that the resource abstraction where it is described by a URI is concretized in
the HTTP request with specific bindings to the SOAP message that is described using
the XML notation.
The SOAP message is embedded in the SOAP envelope in lines 5–13. The at-
tribute xmlns specifies the URL to the namespace data that define XML elements and
attributs for the SOAP message as well as the encoding format. The actual document
payload for the RPC call is contained in lines 7–12 where the SOAP body is given. In
line 8 the method Add is specified and in line 9 and line 10 the values of its parameters
x and y are specified respectively. Once the SOAP message is delivered to the web
server, its SOAP runtime identifies the service that processes the incoming requests.
Question 2: Describe how you would build an RPC protocol around SOAP.
To use SOAP, one does not have to create an application as it was done in Fig-
ure 6.4, but instead a popular command-line utility named cURL for accessing re-
sources and transferring data using a wide variety of protocols and platforms to which
cURL is ported5 . The template for this call is the following: “curl --header
"Content-Type:text/xml;charset=UTF-8" --header "SOAPAction:
SOMEACTION" --data @FILENAME WSENDPOINT” without the outer quotes.
where SOMEACTION is the name of the web service and its method to call, FILENMAME
is the name of the local XML file that contains the body of the SOAP message, and
WSENDPOINT is substituted with the URL to the web server that hosts this web ser-
vice. The utility cURL constructs a SOAP document and submits it, waits for a reply
and outputs it to the console.
Since SOAP is implemented over stateless transport protocols, it is inherently state-
less. It means that after a method of a web server is invoked and the results are returned
4 https://www.ietf.org/rfc/rfc2616.txt
5 https://curl.haxx.se
130 CHAPTER 9. WEB SERVICES
to a SOAP client, the web server object does not keep any information related to this
method invocation. Implementing a calculator application using a stateless protocol is
a good choice, if the calculator does not have the function memory. However, imple-
menting stateful applications where the information of the previous request/response
operations is kept requires using some persistence mechanisms to save the interme-
diate state of the web server and the previous document exchanges with each of its
clients. Doing so imposes a significant overhead on the implementation and the run-
time of stateful distributed applications that consist of web services. We will revisit the
question of statefulness in the rest of the book.
Recall the Interface Definition Language (IDL) that we discussed in Section 4.5.
One of its main function is to ensure type correctness between the client and the server
procedures. Invoking an IDL compiler on an IDL specification results in generating
client and server stubs in a given programming language with the procedure declara-
tions in the generated header files that force client and server programmers to adhere
to the same signature of the procedure. If the client programmer mistakenly passes a
string variable instead of the integer as a parameter to the procedure call, the compiler
will detect this error statically and inform the programmer. However, this is not the case
with web services, since they are built and deployed autonomously and independently
from client programs.
Web Services Description Language (WSDL) is a W3C specification for an IDL for
web services, which are defined as collections of ports and messages as abstract defini-
tions of data that are sent to these ports 6 . Port types specify collections of operations
that can be performed on these data. Just line in IDL, there are no concrete definitions
of how the underlying network transfer of data is performed or how the data of differ-
ent types are represented during the wire transfer. Specific protocols that governs data
6 https://www.w3.org/TR/2001/NOTE-wsdl-20010315
9.2. WEB SERVICES DESCRIPTION LANGUAGE (WSDL) 131
transfers from and to ports can be defined separately and linked to WSDL documents
using its binding mechanism.
Consider an illustrative example of a WSDL document for a web service in Fig-
ure 9.2. On the left side of this figure, the web service for configuring a font on a printer
is represented as a Java class named Printer that has one public method SetFont.
Its parameter takes the value of the name of the new font and returns the name of the
old font. Assuming that this class is deployed as a web service, we want to show how
WSDL can describe its structure.
Figure 9.2: A simplified example of mapping between WSDL elements and the source code of
the Java class web service.
WSDL document is a text file that contains a set of definitions in the XML format
using the names that are defined in the WSDL specification. The document has the root
element definitions whose attribute name is assigned a semantically meaning
name of the web service, i.e., DocumentPrinting. It also defines the attribute
xmlns that points to the WSDL schema that defines the namespace. Child element
service contains the definition of the web service PrinterService and there
can be one or more web service definitions in a WSDL document. In our case we name
this service PrinterService and it is mapped to class Printer.
A web service is viewed as an endpoint that is assigned to a specific port that
receives messages and responds with messages using a binding to some specific pro-
tocol. In our case, the web service PrinterService has one port that we name
PrinterPort and it corresponds to the invocation of the method SetFont. There
are two messages that are associated with this port: the message Printer.SetFont
whose part is fontName of the type string and the message Printer.OldFont
whose part is named Result of the type string. The port is given the binding to
132 CHAPTER 9. WEB SERVICES
the SOAP, i.e., the messages are delivered and transmitted encapsulated in the SOAP
envelope using the HTTP. To interact with a web service, its clients perform opera-
tions, which are abstract descriptions of actions supported by the web service. In our
example, the operation is transmitting the SOAP message Printer.SetFont to the
port PrinterPort and receiving the reply message Printer.OldFont. As you
can see, these descriptions are abstract because they do not contain implementation
details of how the web service infrastructure translates these messages into low-level
operations on remote objects and their procedures.
Thus, WSDL allows stakeholders to obtain an abstract description of the interfaces
of web services. Each web service makes computations in response to input messages
and produces output messages, where the collection of input and output messages for a
service is called an operation. Port type is an abstract collection of operations supported
by one or more endpoints, e.g., a printing port type that collectively describes a range
of services offered by multiple endpoints. A port is a single endpoint defined as a
combination of a binding and a network address. Each port is associated with some
binding that defines a concrete protocol that specifies how operations are accessed. The
reader may visualize a WSDL document as a virtual dashboard on the Internet where
web services represent grouped sockets, each of which represents some port, clients
plug in virtual wires with connectors into these sockets and predefined messages are
transmitted and received via these wires. Whereas the source code of the physical web
services may be modified on demand, the WSDL interfaces remain immutable thereby
allowing clients to access web services on demand.
rates for products. Since many webstore applications have already been built, team
members have a strong feeling that creating this component from scratch would be
a waste of time, since it is highly likely that shipping companies like UPS already
created shipping rate web services for use by many companies and organizations all
over the world. Therefore, the question is how to find and integrate these web services.
Universal Decrip-
tion, Discover and In-
tegration (UDDI) is
a specification for a
web services directory
project that was ini-
tially viewed as Yel-
low Pages for elec-
tronic commerce appli-
cations8 . The main
idea is to create a
global public registry
where everyone can Figure 9.3: The UDDI interaction model.
register components,
for example, a company can make its web services accessible to other companies. A
few commercial companies and many organizations participated in the development of
UDDI specification, and implemented their versions of UDDI as web services whose
clients can use to locate other web services that implement specific requirements for
software projects.
The UDDI model is shown in Figure 9.3. A publisher creates an XML document
(e.g., WSDL) that describes the web services that are made available to users. This
document is stored onto one of Operating Sites that are linked to the central UDDI
Registry, a key/value store that is updated with information coming from all operat-
ing sites. When a client/user make a call to locate a web service using some UDDI API
function, the information about the properties of the desired web service is passed as
a parameter to this function. Depending on the sophistication of the search algorithm,
WSDL documents describing web services that are matched to the query are returned
to the client who can make choices about what web service to use in her applications.
UDDI specification defines four types that describe any web service:
1 business
information type is the White Pages of UDDI that describe what business offers web
services, its organizational structure, and other information like unique identifiers;
2
service information type groups web services by the business process they belong to
and some technical characteristics;
3 binding information that allows a user to locate
8 http://www.uddi.org/pubs/ProgrammersAPI-V2.04-Published-20020719.htm
134 CHAPTER 9. WEB SERVICES
and invoke methods of exposed interfaces of remote web services, e.g., the URI of a
web service and platforms that it runs on; and finally,
4 service information describes
various aspects of a web service, for example, compatibility with a specification of
some other web service the former depends on. Each <bindingTemplate/> el-
ement of the types contains a list of references to related specifications that can be
thought of as a technical fingerprint of a web service or specification metadata. This
information is contain in tModels to help developers to obtain a common point of ref-
erence that allows compatible web services to be discovered and used. Of course, it
requires web service vendors to supply sufficient information about the specifications
in a format that is known, can be parsed using some standardized tools (e.g., XML
parsers, and queried.
Question 6: Explain the relation between UDDI and location and migration
transparencies.
Relationship between WSDL and UDDI documents is shown in Figure 9.4. WSDL
document information about the corresponding web service that it describes is reflected
in the business service type of the corresponding UDDI; the information about ports is
placed in the binding template; and the rest of the information about messages and port
types is placed in the tModel sections. Thus, WSDL with UDDI represent a powerful
mechanism for sharing and reusing web services on the Internet scale.
9.4. REPRESENTATIONAL STATE TRANSFER (REST) 135
GET command retrieves information from a resource that is identified with some URI
that has the same specification that we specified above following specification:
protocol:[//authority]path[?query][#fragment]. For exam-
ple, REST command https://owner-api.teslamotors.com/api/
1/vehicles shows all Tesla vehicles that the client owns, assuming that the
authorization is provided in the part authority of the URI. Conditional GET
using the data put in the axilliary fields regarding the date of the last modification,
matching to some values or a range of values to return the state of the resource
only when these data are evaluated to true. For example, the Tesla GET request
can be made conditional by requesting information only if it was modified since
yesterday and for the model S only.
HEAD command results in the return value that contains only the header. For exam-
ple, the return value 200 OK signals that the web service is available and can
respond to requests. It may be used to check if the authorization is still valid, or
if there are changes to the representation of a resource, among other things.
POST command contains the information about some entity that is related to the ex-
isting resource. It may contain some metadata to annotate the existing resource;
it may contain the representation of a new resource (e.g., a new HTML file or a
message on a social media site); or it may contain the information about method
invocation of some web service and its parameters values. Abstractly, this com-
mand inserts a new resource or updates an existing resource. Executing this
command more than once on the same resource may result in a different state of
the resource. For example, posting the same message N times on a social media
website will result in the appearance of N same messages.
PUT command has the same semantics as the command POST except it is idempotent,
i.e., executing it on the same resource N times will result in the same state of the
resource as if this command was executed only once. In the example above,
PUTting the same message N times on a social media website will result in the
appearance of only one message.
DELETE command deletes a specific resource. Once deleted, the repeated execution
of the command on the same resource identifier has no effect.
OPTIONS command requests information about the specified resource. For example,
it may return a document analogous to WSDL that specifies interfaces and their
methods of the designated web service.
Using the HTTP protocol as the main transport mechanism for RESTful web ser-
vices adds the benefit of caching. The HTTP specification defines fields in the header
of an HTTP request that control how the HTTP responses are cached. For example, the
field Age specifies the duration in seconds since the information was obtained and the
field Date specifies when the representation state response was created. Caching may
provide stale data, however, it may significantly improve the availability of the system.
We will explore the tradeoff between these important properties later in the book.
9.5. INTEGRATED APPLICATIONS AS WEB SERVICE WORKFLOWS 137
RESTful web services are easier to make available to clients without creating UDDI
registries. All it is required is to report the name of a web service, describe its seman-
tics, provide the URI, and give some examples of input parameters and the correspond-
ing responses. Many commercial companies and organizations release development
kits and specifications for their web services that other companies can use. For ex-
ample, United Parcel Service (UPS), an American multinational package delivery and
supply chain management company with the revenue of over USD $65 Billion as of
2017 offers its UPS developer kit for its web services with which clients can embed
UPS tracking and shipping information in their applications10 . The list of available
web services include but not limited to address validation, pickup, shipping, signature
tracking and many other delivery-specific APIs supported by the UPS web services.
online-tools-rates-svc.page
138 CHAPTER 9. WEB SERVICES
sends a notification to a company accountant, who uses Purchase Invoices and Esti-
mates (PIE) to create invoices for ordered goods. When the ordered goods are received
from OD, a receiving agent compares them with the entries in QE. The accountant can
view but cannot modify records in QE, and likewise, no other agent but the accountant
can insert and modify data in PIE. If the received goods correspond to the records in
QE, the receiving agent marks the entries for the received goods in QE and notifies the
accountant. After comparing the invoices in PIE with the marked entries in QE and
determining that they match, the accountant authorizes payments.
Each of these services can be used as a standalone application, however, it would
require a significant manual effort to enter the data into each application and to verify
the entries by matching them against one another. Real EPSes are much more so-
phisticated, since there are many government and industry regulations that govern pro-
curement processes, and these regulations are updated and modified. Correspondingly,
EPSes must keep up with these updates and modifications. Unfortunately, modifying
the source code of these services is laborious, error-prone, and it takes time, which is
critical in the competitive environment where other companies can adjust their services
quicker and at a lower cost.
This example shows that it is important to reflect changing business relationships
in integrated applications where the distributed state is maintained by two or more
services of different business parties and business transactions represent a consistent
change in the state of the business relationships. A key to creating flexible integrated
applications is to separate the concerns for business relationships among services from
the concerns that describe the business logic or units of functionality (i.e., features) in
each service. This separation of concerns is easy to achieve in graph representations
called workflows, where nodes represent operations performed by individuals, groups,
or organizations, or machines and edges represent the business relationships between
operations. A workflow can be viewed as an abstraction of some unit of work separated
in finer granular operational units (e.g., tasks, workshares or worksplits) with some
order imposed on the sequence of these operational units.
Question 9: Create a workflow for an EPS and explain how to map its nodes
to web services that implement the EPS’ functionality.
The structure of many workflow systems is based on the Workflow Reference Model
that is developed by the Workflow Management Coalition (WfMC)11 . Workflow appli-
cations are modeled as collections of tasks, where a task is a unit of activity that is
modeled with inputs and outputs. The workflow structure is defined by interdependen-
cies between constituent tasks.
A business process is easy to represent using workflows with starting and comple-
tion conditions, constituent activities and rules for navigating between them, tasks and
references to applications which should be run as part of these taks and any workflow
relevant data that is ingested and manipulated by these applications. Once a workflow is
constructed, the workflow enactment software interprets it and controls the instantiation
11 http://www.wfmc.org
9.6. BUSINESS PROCESS EXECUTION LANGUAGE 139
of processes and sequencing of activities, adding work items to the user work lists and
invoking applications, manages the internal states associated with the various process
and activity instances under execution. In addition, the workflow enactment software
includes checkpointing and recovery/restart information used by the workflow engines
to coordinate and recover from failure conditions. Essentially, constructing distributed
applications using web services is a workflow wiring activity where the engineer de-
fines a workflow to implement some business applications, then selects web services
with specific semantics that satisfy some business feature requirements, and finally,
wires these web services using the workflow schema into the application that will run
on top of some workflow enactment engine whose job is to deliver the input data to
these web services in some order, take their outputs, and continue in this manner until
some conditions or objectives are met. Thus, the key is how to express workflows, so
that they can be executed by the underlying enactment engine.
Figure 9.5: A BPEL program that defines a synchronous RPC between a client and a server.
at which remote objects are deployed and bindings specify how communications are
performed using concrete protocols.
The next two sections are defined by elements variables in lines 7–10 and
sequence in lines 11–16. Variables are named locations that store values and in
our example we use two variables named request and response whose spe-
cific message types are defined in some WSDLs. The variable request is some-
how set a value that will be received by the remote function someFunc defined in
the section sequence that is a set of sequential activities receive and reply.
The former activity specifies in lines 12–13 that it receives a message of the type
tns:requestMessage from the partnerlink clnt and it will automatically put
the value of this message into the variable request. Finally, it will return the reply
contained in the variable response to the client caller.
Question 10: Describe how you would implement an EPS with BPEL.
It is important that the client program does not have any knowledge about the server
program. The BPEL program implements the workflow based on the notions of ports
and data types – it provides location and migration transparencies to allow web services
to issue requests and receive reply messages independently of how other web services
are created and deployed. The underlying protocols govern how web services are ac-
cessed, however, each web service is opaque – its internal implementation is hidden
from its users and the users know only of its interfaces and their exposed methods as
declared in WSDLs. Building applications becomes akin to wiring components on a
dashboard excepts these components are web services, which can be located on servers
connected to the Internet around the world.
9.7 Summary
In this chapter, we describe web services, RPC server objects which respond to clients’
calls made to the web service using the Internet, their underlying protocols, and how
to build applications using web services. Our starting point was the Simple Object Ac-
cess Protocol (SOAP), a rather heavyweight way to enable web services to interoperate
by describing method invocations and data exchanges in XML messages embedded in
HTTP. Next, we presented Web Services Description Language (WSDL) as an IDL of
web services. A client program can obtain the WSDL for a web service by sending
a SOAP request and analyze the interfaces and methods (i.e., messages and ports) ex-
posed by the web service. Aggregating many WSDLs for different web services makes
it possible to create a universal database as a web service in Universal Decription, Dis-
cover and Integration (UDDI) mechanism, where web services can search for other
web services that have desired properties expressed in search queries. As a lightweight
alternative to SOAP, we describe the Representational State Transfer (REST) architec-
tural style for constructing HTTP requests to access and manipulate web services. We
conclude by explaining how distributed applications can be built as workflows connect-
ing web services.
Chapter 10
What is the right granularity for a web service? How to measure the granularity of a
web service? Lines of code (LOC) is a very popular measure in systems and software,
however, as many generic measurements it hides some important semantics, e.g., how
many functional units are implemented in a web service. At one extreme, the entire
application can be created as a single monolithic web service. Opposite to it, the design
of the application can be decomposed into highly granular web services, where each
service may implement only few low-level instructions (e.g., read a line from a file or
write a value to a memory location). While a single monolithic web service may be
easier to debug, its scalability and efficiency may suffer as well as performance, since it
will likely be a bottleneck for processing multiple requests simultaneously. In addition,
different features of this web service may be implemented and supported by different
geographical entities, thus making it difficult to deploy as a single monolithic unit.
Creating highly granular web services has its disadvantages, most notably tracking
deployment and enabling choreography/orchestration of thousands of these services
in a single application, performing transactions across the multitude of web services,
analyzing logs to debug production failures, and update web services with patches and
new versions, among many things. Equally important is the issue of how to deploy
highly granular web services – placing each service in a separate VM is likely to result
in thousands of running VMs for a single application. Each VM includes its own
virtualization layer and the OS thus consuming significant resources, many more that
are needed to run a web service.
In the world of transportation, containerization is a freight mechanism where trans-
ported goods are packed in International Organization for Standardization (ISO)-defined
shipping containers that have standardized dimensions and support for keeping the
transported goods in the original condition. Goods must be isolated from one another
inside a container to prevent their accidental damage during shipping and inadvertent
tampering as well as isolated from the external conditions (e.g., rain or high temper-
atures). Using the transport containerization analogy, a container for service compo-
nents includes an isolation mechanism that allows services to run independently from
one another and the storage mechanism that keeps service image inside the container.
VMs are containers, of course, but given how diverse web services are, different types
142
10.1. MICROSERVICES 143
of containers are needed to host them, and this is what we discuss in this chapter.
10.1 Microservices
Creating monolithic applications, which contain programming modules that are often
run in the same address space with explicit hardcoded dependencies, is popular espe-
cially during the prototyping phase of the software development lifecycle. The idea is
to demonstrate the feasibility of a certain business concept and to receive the feedback
from stakeholders. With monolitic application design, programmers are not required
to worry about issues that plague distributed computing, e.g., availability, scalability,
or performance of the prototypes. However, monolitic applications are difficult to de-
ploy and maintain in the cloud environment for the same reasons that make creating
monolitic applications popular – the hardcoded dependencies in the source code pre-
vent stakeholders from quickly adding and removing features and from provisioning
resources in a granular manner to the parts of the applications that require different
support from the underlying hardware.
Decomposing an application into smaller components or services and deploying
them in distributed VMs allows stakeholders to distribute the load effectively among
these VMs and to elastically scale up and out these VMs with the increasing workload.
Consider a web store where customers search and view products much more frequently
than they purchase them. Splitting the web store application into searching, purchasing,
and shipping services allows the application owner to scale out searching services to
provide fast responses to customers whereas fewer resources are needed for purchasing
and shipping services, thus reducing the overall deployment cost while maximizing the
performance of the application. Doing so would be difficult if the application was a
single monolithic service itself.
source code of the microservice can change its state; other microservices can obtain
projections of its state by invoking its exposed functions and receiving the return values.
In general, there should be no global memory that can be accessed and manipulated by
microservices, since each microservice would be able to affect the execution of other
microservices by changing values in the global memory locations. Doing so results
in losing the autonomy and increasing the coupling among microservices, since the
behavior of each microservice will depend on specific operations performed by the
other microservices that change the global state.
Microservices can be coupled indirectly via exposed memory regions when these
microservices are deployed on the same hardware platforms. Consider a situation when
one microservice writes data in some memory location it owns before the underlying
OS switches the execution to some other microservice that runs on the same hardware.
If the memory is not erased when switching contexts, which may happen to reduce
the context switching time, then the other microservice may read the data since it will
own the memory locations during its execution. Therefore, it is important that the
underlying platform provides strong isolation of contexts for microservices that will
prevent accidental data sharing and data corruption between microservices.
Question 2: How would you create a distributed application where all mi-
croservices are coupled dynamically, at runtime.
One of the most difficult to quantify measures for microservices is the measure of
granularity. At one extreme, a microservice may expose a single function that returns
some constant value. At another extreme, a microservice can export many functions
each of which implements a whole business process and all these functions are used in
one another. The former results in a large number of microservices that are difficult to
remember and programmers are overloaded with information about them. Moreover,
latency becomes a primary concern when an application is composed of too many
microservices, since the number and the frequency of messages between them results in
a significant performance overhead. A rule of thumb is that a microservice implements
some basic unit of functionality that may be used in different subsystems and they can
be composed to implement larger units of functionality.
Naturally, microservices are frequently implemented as RESTful web services or
as software components whose interfaces are exposed via web services. Moreover,
a microservice can be implemented as a function, which is invoked in response to
some events, e.g., (ev1,ev2, store)=>store(ev1.time-ev2.time). In
this case, the function is instantiated at runtime as a lambda function literal that is in-
voked in response to events ev1 and ev2 and another function called store is passed
as a parameter. The body of the function invokes the function store on the input pa-
rameter that is computed as the difference between the attribute time of these events.
Recall that the FaaS cloud computing model is designed to invoke “serverless” func-
tions in response to certain events. Thus, an application may consist of microservices
of different granularities that are designed to be hosted in different runtimes and whose
execution times can last from microseconds to many days at a time.
Given the diversity of microservices in terms of their execution time and com-
10.2. WEB SERVER CONTAINERS 145
plexity, they require deployment containers that satisfy different properties of these
microservices. VMs are heavyweight – they have much longer startup and shutdown
times, they occupy much larger memory footprint, and they contain many virtualization
layers (e.g., hardware, drivers, OSes) with many utilities and tools that are not needed
for most microservices. Clearly, deploying a stateless lambda function that takes only
one millisecond to run in a dedicated VM that hosts a guest OS is an overkill – it costs a
lot and the performance overhead is significant, especially if this function is called once
every couple of minutes or so. Therefore, a variety of containers are created for hosting
different types of microservices. We will discuss them in the rest of this chapter.
Question 3: What problems would you have to deal with when deploying an
application that is created from microservices that create side-effects?
Figure 10.1: Scala pseudocode example of a RESTful web service using Twitter Finch.
methods, and Finch combinators. The route specification is used in matchers that are
methods for finding the endpoint using the route specification. For example, in line 5
the route is given by the method Post followed by the combinator / called and then,
that is followed by the string variable students that is in turn followed by and then
combinator and the social security number (SSN) of the student and finished with the
map combinator /> that takes the input values for the SSN from the route specification
and passes it as an input parameter SSN of the programmer-defined function in line
6 – line 7 that creates a record for the student. Thus, the variable enrollStudent
specifies a route to creating a student’s record for this microservice.
The other route, studentInfo is shown in line 8 – line 13 and the reason for
creating a separate route is because it deals with the records of already created stu-
dents opposite to the route enrollStudent that deals with requests of students
whose records haven’t been created yet. The route contains HTTP methods Get, Put,
Delete, and Post for obtaining the list of students in line 9, obtaining the informa-
tion on a specific student identified by her uid in line 10, updating the information
about a specific student in line 11, deleting a student’s record in line 12, and creating
a subsection in the student’s record to list all her awards. The keywords long and
string are extractors whose goal is to convert a sequence of bits in the URL desig-
nated by ad then separators into a value of the designated type. Tthe combinator => is
the function mapping operator that separates the input and the result of the programmer
supplied functions for the microservices. Finally, the combinator :+: combines routes
in the second parameter to the method serve in line 14 thereby creating a web server
object that listens to clients HTTP requests and executes functions in response to them.
10.3. CONTAINERS FROM SCRATCH 147
The web server waits in line 15 for a shutdown signal and once received, it closes the
connection in line 16 after it cleans up necessary resources.
This example illustrates three important points. First, a web server is a distributed
object that contains (micro)services that are defined as combinations of routes to end-
points and user-defined functions that are invoked in response to clients’s HTTP re-
quests. Web servers with contained microservices can be deployed in VMs with all
necessary dependencies and load balancers can distribute requests among replicated
web servers in multiple VMs thus scaling out applications. Second, depending on the
functions performed by microservices, they may be deployed together locally in web
servers to minimize communication time among them. As we already know, localities
impact the latencies significantly.
Finally, adding new routes is easy, since the old ones can be kept intact with func-
tions redirecting requests from old routes to new ones if necessary. Unfortunately, it
may not be possible to deploy many microservices in different web servers in the same
VM for many reasons, most important of which is the lack of isolation among mi-
croservices and the difficulty to scale up VMs, since some microservices may not a
lot more resources to perform their functions than their counterparts. In addition, web
servers will stay active in the VM all the time consuming resources, even if requests
from clients arrive infrequently. Equally important is the support for interacting with
persistent storages (i.e., files and databases). Web servers are agnostic as containers to
what database support is required for microservices, however, some other containers
provide this level of support.
processes share data while preventing accesses for other processes is complicated in
a dynamic environment where it is unknown in advance what microservices will be
installed on each physical computer. Moreover, it is difficult to achieve location and
migration transparencies, since when moving a microservice to a different computer it
is important to ensure that all needed libraries and configuration files are installed there,
otherwise the microservice will not run. Finally, it is important to regulate how many
resources each microservice can consume, since the owners of microservices pay for
the consumed resources and allowing a cheaper service to utilize the same CPU equally
with a more expensive microservice whose owner paid for scaling up.
produced is that once the directory was chroot jailed, the directory /bin becomes
inaccessible and the program sh cannot be accessed any more.
It leads us to the second part of the idea to resolve all program dependencies and
move them to the chrooted directory. One solution is to obtain the list of all de-
pendencies by using the Unix command ldd that determines all shared objects that
are required to run a specific program. Consider the output of this command in Fig-
ure 10.3. A technically savvy reader will notice that this command is executed under
Cygwin, a Unix OS layer ported to the Windows OS. The command is shown in line 1
with the output of all shared objects/libraries shown in lines 2–14. The output consists
of the name of the shared object with the separator => followed by the absolute path
to the shared object and its hexadecimal memory address in parentheses. All these de-
pendencies and the program itself can be copied to the destination directory and the set
of paths to these objects can be created to mimic the paths to the original destinations.
Detecting and recreating dependencies within chroot jail directories can be easily
automated by extending the original skeleton program shown in Figure 10.2. Moreover,
many popular platforms and frameworks have well-known and documented dependen-
cies, so it is possible to organize these dependencies in compressed file archives and
store them in repositories from which they can be easily retrieved on demand. For
example, Java programs depend on the Java Runtime Environment (JRE), so different
versions of these environments can be retrieved and uncompressed in jails. Of course,
some dependencies may be specified manually in configuration files.
The isolation provided by jailing containers is not absolute. Consider different
levels of caches, e.g., L1, L2, and L3 caches. When the OS kernel context switches
between different processes that run in different containers, the data in these caches
may be shared between processes. A full discussion of the computer architecture issues
with cache data sharing is beyond the scope of this book, however, depending on the
type of memory address reference, i.e., physical or virtual, the OS reads the info from
a special unit called Translation Lookahead Buffer (TLB) to determine what process
150 CHAPTER 10. MICROSERVICES AND CONTAINERS
1 $ ldd /bin/sh
2 ntdll.dll => /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll (0x7ffb3b1f0000)
3 KERNEL32.DLL => /cygdrive/c/WINDOWS/System32/KERNEL32.DLL (0x7ffb3a300000)
4 KERNELBASE.dll => /cygdrive/c/WINDOWS/System32/KERNELBASE.dll (0x7ffb37780000)
5 ADVAPI32.DLL => /cygdrive/c/WINDOWS/System32/ADVAPI32.DLL (0x7ffb3a1f0000)
6 msvcrt.dll => /cygdrive/c/WINDOWS/System32/msvcrt.dll (0x7ffb3a150000)
7 sechost.dll => /cygdrive/c/WINDOWS/System32/sechost.dll (0x7ffb3a550000)
8 RPCRT4.dll => /cygdrive/c/WINDOWS/System32/RPCRT4.dll (0x7ffb3b0c0000)
9 SYSFER.DLL => /cygdrive/c/WINDOWS/System32/SYSFER.DLL (0x74ef0000)
10 cygwin1.dll => /usr/bin/cygwin1.dll (0x180040000)
11 cygreadline7.dll => /usr/bin/cygreadline7.dll (0x579cd0000)
12 cygiconv-2.dll => /usr/bin/cygiconv-2.dll (0x5461d0000)
13 cygintl-8.dll => /usr/bin/cygintl-8.dll (0x5ee2d0000)
14 cygncursesw-10.dll => /usr/bin/cygncursesw-10.dll (0x48ca30000)
Figure 10.3: The dependencies of the program /bin/sh are obtained using the utility ldd.
and which memory address accessed by this process has the corresponding entry in
the cache. Yet, the cache access happens in parallel with the TLB access, so that the
speculative cache read will reduce the wait if the TLB lookup confirms the cache hit.
This example shows that the absolute isolation is difficult to achieve, however, for most
applications the level of isolations provided by containers is enough.
Two more important concepts enable effective program isolation and access man-
agement: capabilities and namespaces. A user who logged into the Unix OS with the
superuser privileges has the permission to run any command and to perform any oper-
ation on any software or hardware installed on the system and to install any software
package or to configure new hardware modules. However, only trusted system admin-
istrators have the superuser privileges and even if they do, they often log into the system
as regular users to prevent accidental damage by executing, for example, the command
rm -r /root. Capabilities are special permission bits that can be set programmat-
ically for processes to execute some privileged commands and only for the duration
of this privileged commands. For example, setting the capability CAP SYS RAWIO
enables the process to access a USB drive without obtaining the superuser privileges.
Fine-grained permission setting is the main benefits of using capabilities.
The reason that capabilities are important for containers is because the container
owner is the superuser of the jailed environment, however, the owner is not the supe-
ruser of the Unix OS in which this container is deployed. It means that a process that
runs in a container may request certain capabilities, e.g., to access ports or files in the
system to obtain the data that is necessary for further execution. Consider a situation
when a containerized process must ssh to a different computer to run a different pro-
gram. On the one hand, ssh is an external dependency for this process, however, it is
also a popular Unix utility. Instead of copying it to every container, it is easier to grant
containerized processes capabilities to run this and similar utilities.
Finally, Unix namespaces partition resources by grouping isolating them from the
exposure to other groups. Consider as an example the XML namespaces that are used
for resolving name clashes for elements and attributes in an XML document. By spec-
ifying the scope of the namespace, the same name for elements and attributes can be
10.3. CONTAINERS FROM SCRATCH 151
Question 8: Explain how to prevent name clashes for the filesystem that is
used by different containers.
Figure 10.4: An example of Dockerfile for creating a container with a Java program.
rather than indicate a practical need to do so. We moved from deploying distributed
objects in VMs because of the high overhead of the latter – putting a container into a
VM or vice versa must accomplish certain objectives that will outweigh the effect of
using a container with an expensive overhead.
10.4 Docker
Docket is one of a popular container management systems that is an open-source
project and in addition, its enterprise version is sold commercially. Docker uses the
core ideas that we described in Section 10.3 to realize container management in chroot
jailed environment with several other additional services. Similar to the definition of
a process as an instance of a program, a Docker container is an instance of an image,
which is recursively defined as a set of layered images each containing software mod-
ules that the target program depends on. For example, a RESTful microservice written
in Java depends on the JDK and the Apache web server that in turn depend on the JVM
that depends on Ubuntu Linux. These dependencies are specified in a simple declara-
tive program in a file called Docketfile that the Docker platform uses to build an
image that can be instantiated into a container on demand.
Consider how a container is created and deployed with a simple Java applica-
tion that contains a single Java class named PrintSomething with the method
main that outputs some constand string to console. Next, we create compile the
program and create a jar file named PrintSomething.jar without the cus-
tomized manifest data file. The program can be run with the command java -jar
PrintSomething.jar. Next, we create the file Dockerfile that contains the
following commands as it is shown in Figure 10.4. In line 1, we specify that the im-
age java:11 will be included in the container image. The first command in the file
is FROM <image>:<tag> that specifies the name of the docker image and its tag,
optionally, and in this case it is the image that contains the JVM and the JDK version
11. The next command in line 2 is WORKDIR that sets the working directory for all
commands that follow in the docker configuration file. In line 3 the command ADD
copies the file PrintSomething.jar to the filesystem of the image. The com-
mand EXPOSE in line 4 specifies that the docker container listens at port 8080. Only
clients on the same network can contact the container on this port; to make it available
to clients located in other networks, the command PUBLISH should be used. Finally,
in line 5 the command CMD runs the deployed program in the container.
The container image can be built by executing the command docker build
-t PrintSomething. Once the image is built successfully, it can be uploaded to
10.4. DOCKER 153
1 import docker
2 client = docker.from_env()
3 print client.containers.run("mycontainer", ["ls", "-la"])
4 for container in client.containers.list(): container.stop()
a Docker repository, e.g., DockerHub3 using the docker command docker push
/PrintSomething:latest. The user must log into DockerHub prior to execut-
ing this command and create the repository. Once the image is created and pushed into
the repository successfully, it can be pulled from the repository using the command
docker pull /PrintSomething on a computer where Docker is installed and
then the container can be run using the command docker run /PrintSomething.
Question 9: Install Docker and create a container for a simple Java HelloWorld
application that writes the message into a text file.
Docker is created using the client/server architecture where the Docker’s daemon
is the central component of the server. Docker daemon is responsible for the entire
container management lifecycle where it controls creation and deployment of Docker
images. Docker clients issue commands to the docker daemon using the program
docker as the Client-Level Interface (CLI). Moreover, docker clients communicate
with the daemon using the RESTful API calls that are described in Docker’s documen-
tation4 . Consider a Python program that is shown in Figure 10.5. It uses the Docker
SDK API calls to create a container running a version of the Unix OS and to execute
a command to list the content of some current directory. In line 1 the docker package
is imported into the program environment and in line 2 the docker client is initialized.
The variable client references the local docker container environment, and using
this variable in line 3 we obtain a list of deployed container, select a container named
mycontainer that runs a version of Unix OS and send the Unix command ls -la
to obtain a list of files in the current directory in the container. Then, in line 4 we
iterate through the list of containers and stop each container on this list. This example
illustrates how it is possible to create programs that orchestrates containers.
Deploying containers in the cloud is a rudimentary exercise. Consider the deploy-
ment instructions for Docker containers on Amazon Web Services (AWS) cloud5 or the
steps for deploying Docker applications on Microsoft Azure Cloud6 . Specific details
vary, but the idea is the same as we illustrated above with creating and deploying a
container image with a Java application. All major cloud vendors provide mechanisms
for container deployment. A programmer must ensure that ports are open and com-
puting nodes at which container images are deployed are up and running. Often, it is
3 https://hub.docker.com
4 https://docs.docker.com/develop/sdk
5 https://aws.amazon.com/getting-started/tutorials/
deploy-docker-containers/
6 https://docs.microsoft.com/en-us/azure/devops/pipelines/apps/cd/
deploy-docker-webapp?view=vsts
154 CHAPTER 10. MICROSERVICES AND CONTAINERS
Question 10: Discuss pros and cons of using the ZFS for Docker.
Consider a CoW file system where data is organized in blocks. A filesystem struc-
ture called inode stores information about file metadata and pointers to the blocks in
which the file data is stored. Suppose the number of block pointers in an inode is lim-
ited to some N pointers. If the size of a block is less than some predefined threshold
then the inode points to this block directly, otherwise, the data block will contain a
pointer of its own to point to the other blocks of data that contain the spillover data
for the block. Various CoW filesystems have different constraints on how pointers and
blocks are organized, but the idea is basically the same.
Now, suppose new data is written into a file or its existing data is modified. In
CoW, the inode and blocks will be copied and new blocks appended to the pointers
from the existing blocks. Depending on the granularity of a block and how copying is
done, CoW may results in significant data duplication and worse resulting performance
of the filesystem. Instead, it would make sense to use a filesystem that employ branch
overlaying, where an immutable part of a filesystem stays in one layer and a mutable
part of it is located in the other layer in its branch that extends some branch of the
immutable layer. Essentially, branches are unions of the different filesystems where
entries are either shown in all branches or they are hidden or combined using different
techniques. This strategy is implemented using the Union File System (UnionFS) that
combines disjoint file systems into a single representation by overlaying their files and
directories in branches. With UnionFS, applications can save the local states of con-
tainers in the corresponding branches without changing globally shared data among
different containers.
10.5 Kubernetes
The goal of Google’s cluster management systems, Borg and Omega is to manage ex-
ecutions of jobs in a cloud computing cluster, we will discuss them in Section 12.4..
With the containerization of applications, the unit of deployment shifts to a container
7 https://www.opencontainers.org
10.5. KUBERNETES 155
rather than a process or a sequence of tasks. A container does not only provide an
isolation environment for a program – it contains images that resolve all dependencies
required for the contained application to run. A key benefit is a new unit of abstrac-
tion for controlling and manipulating diverse applications in the heterogeneous cloud
environment – a container!
The container abstraction is important for three main reasons. First, it is important
to know in the cloud environment if the running application is responsive. Naturally,
each application may expose a friendly interface to response to a heartbeat signal with
some statistics of its execution. However, enforcing developers of each application to
implement such an interface is very difficult. Since containers automatically provide a
RESTful interface for monitoring its content, the monitoring and health check problem
is addressed without imposing any constraints on the deployed applications. That is,
the container itself is an application; the actual running processes within the container
are encapsulated by its boundaries and hidden from the cloud that is only concerned
with the measured properties of the deployed containers.
Second reason is in obtaining information about the structure of the deployed ap-
plication and its dependencies. Knowing that the application uses some library and a
database that already exist in the cloud enables the cloud to automatically optimize the
space and the execution time for deployed containers by positioning them on the server
where the library already exists and scheduling them to use an existing connection pool
to the given database. Since metadata that describes the content of a container is given
apriori, this data is embedded into the built container and it shares this data with the
cloud management system.
Finally, the third reason is in changing the load balancing using composability of
containers. Instead of thinking about individual hardware resources, they are abstracted
as container units (e.g., a container with two 3GHz CPUs and 8GB of RAM). Within
these outer container units, inner containers are deployed to use hardware resources
attached to the outer units called pods. When an application is divided into collaborat-
ing microservices, they can be deployed in inner containers in a pod and resources can
be scheduled among them based on their constraints (e.g., log data collection can be
performed periodically and sent to a log store somewhere on a server in the cloud).
Question 11: Discuss the analogy of optimizing the space and resources with
physical shipping containers vs the software containers.
located in these containers. The execution context of a pod is shared by the containers
hosted in this pod, and these containers may in turn have their own contexts to provide
additional levels of isolation. Of course, the reason that containers are co-located in
one pod is that they are logically and physically coupled, i.e., the applications in these
containers in a pod would otherwise be executed on the same computer or on com-
puters linked on the same network. All containers within the same pod are assigned
the same IP address and they have direct access to the shared context of the pod. The
applications within the same pod can communicate using Interprocess Communication
Mechanisms (IPC) like pipes or shared memory. Therefore, one main benefit of putting
containers in the same pod is to improve their communication latency.
Kubernetes controls the life cycle of a pod. Once a pod is created, it is given a
unique identifier, and then it is assigned to a computing node on which its applications
execute. The pod can be terminated due to the node becoming inoperable or because
a timeout is reached. Unlike VMs, pods cannot be migrated to nodes; a new identical
pod is created with a new unique identifier and it can replace the existing pod that will
be terminated after the new pod is assigned to a different node. Pods can be controlled
via exposed REST interfaces that are detailed in the documentation9 .
Question 12: Explain why pods are not migrated to different nodes.
Kubernetes is created using client/server architecture where nodes are clients which
are connected to the kubernetes master (or just the master) that uses distributed key/-
value storage called etcd10 . The master is implemented as a wrapper called Hypercube
that starts three main components: the scheduler kube-scheduler, cloud controller
and manager kube-controller-manager, and the API server kube-apiserver.
Kubernetes offers CLI via its command kubectl. Each node runs a process called
kubelet that serves as a client to the master and it runs a web server to respond to
REST API calls. Kubelets collect and report status information and perform container
operations. Kubernetes proxy kube-proxy configures IP table rules upon startup and
it load balances jobs for each container. It also acts as a network proxy by maintaining
IP table rules to control distribution of the network packets across the containers.
The job of using Kubernetes for a cloud datacenter is a system administration job.
Its essential components is to understand the network topology of the data center, its
resources, and how to create scripts that effectively distribute containers across the
clusters and efficiently deploy them to maximize the performance. Whereas an appli-
cation developer cares how to create, build, and debug applications and deploy them
in containers, a cluster system administrator has the goal of deploying these containers
in pods and scheduling these pods to nodes to utilize the cloud efficiently. This high
specialization of jobs within the cloud datacenter is a relatively new phenomenon.
9 https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/
#pod-v1-core
10 https://coreos.com/etcd/
10.6. APPLICATION CONTAINER SPECIFICATION AND RKT 157
First of all, many installations of the various flavors of Unix have many daemon
programs. Essentially, the boot system of many flavors of the Unix OS is designed
around init – the main daemon process that starts up when Unix boots and it is as-
signed PID = 1. It is the ultimate parent process and all other processes are spawned
by init. A newer module is systemd, a set of Unix utilities that replace the clas-
sic service init. In many version, the main daemon process consults a configuration
file that specifies what other daemon programs it should spawn. Examples of well-
known daemon programs are crond and nsf, a job scheduler for tasks that need to
11 https://github.com/appc/spec
12 https://github.com/appc/spec/blob/master/SPEC.md
158 CHAPTER 10. MICROSERVICES AND CONTAINERS
be executed at some time intervals and the service for the network file system. Having
many daemon programs may make the system less stable, since crashing and restarting
services make the system resources partially unavailable.
Second, many daemons work in the sleep/wake up/poll/work pattern where the
daemon program sleeps for some time, then wakes up, polls some resources and per-
form the task. Doing so introduces an unnecessary overhead on the system, since
the polling mechanism consumes resources without performing any useful tasks. Un-
like Docker that has its own daemon implementation, rkt does not have its own dae-
mons, it depends on systemd to start tasks. For example, to start a container, the
following command is executed: systemd-run --slice=machine rkt run
coreos.com/etcd:vX.X.X where X designates a (sub)version of the key/value
store etcd. Interested readers can study the documentation that describes how rkt is
used with systemd.
10.7 Summary
In this chapter we learned about cloud containerization. We started off by understand-
ing the concept of microservices and how they need lightweight runtime environments,
since they are small and they run for a short period of time. We moved on to review
web servers as containers for RESTful microservices and showed how programming
language support is important for writing concise and easy-to-understand code of mi-
croservices. Next, we introduced the container concept, where it provides a closure
environment for fully packaged services. We rolled our sleeves and showed how con-
tainers can be implemented from scratch using basic OS services. Interestingly, this
chapter shows how close the systems and OS research is intervowen with cloud com-
puting. To enable some desired behavior, it is often required not to look much father
away than the API calls exported by the underlying layers of software. The last three
sections of this chapter review three popular industry-wide systems: Docker, Kuber-
netes, and the application container specification and its implementation, rkt. It is
noteworthy that this is a fast-evolving field and the readers should check technical news
sources and documentation to keep abreast with new development in containerization
and microservice deployment platforms.
Chapter 11
Cloud Infrastructure:
Organization and Utilization
159
160 CHAPTER 11. INFRASTRUCTURE
met, the computation is considered failed, pretty much the same way as if the result
of the computation was incorrect. In fact, for a number of cloud-based applications a
slightly incorrect result may be more acceptable than the result that was computed with
a significant latency.
they send and receive messages asynchronously and the channels between processes
contains messages with which these processes communicate [34, 66]. Also, communi-
cation times are bounded, so an indefinite latency does not exist. Without the ability
to obtain a consistent global state at a given time, when a partial failure occurs when
some computer or a part of a network malfunctions, it is very difficult to determine its
effect on the correctness and the performance of the distributed application in general.
Masking failures can be done in computer hardware, so it is important to understand
the cost and impact associated with the choice of hardware for the cloud infrastructure
to provide failure transparency.
Up to this point, we abstracted away the physical components of the cloud infras-
tructure, because we concentrated mostly on programming interfaces of the distributed
objects. However, knowing the physical organization of the cloud infrastructure en-
ables stakeholders to assign distributed components to the physical resources that can
decrease the latency and it can improve the overall performance of the application sig-
nificantly. Consider the following example where the cluster, A is located in North
America and the cluster E is located in Europe. Suppose that a data object is hosted
in A and its client object that is hosted in E, also hosts some data. The client pulls
the data from A and merges it with its local data using some algorithm. Suppose that
there is client objects located in Middle East, M, that submit requests to E to obtain a
subset of the results of the data merging using some criterion. A question is whether
the assignment of objects to certain clusters results in the a better performance of this
distributed application with higher utilization of the provisioned resources?
One may argue that we need to know more about the sizes of the objects in A and
E. If A contains big data and E has a small database, then transferring the large chunks
of data from A to E results in a very high latency, whereas transfering the data from E
to A can be done with little overhead. Of course, one can argue that the object from E
can be moved to A, however, there may be two arguments against doing it. First, since
the M client is located close to E, the network latency may be much smaller when
obtaining the results from E rather than A. Second, privacy rules and other country-
specific laws may dictate that certain types of data and computations must not leave
specific geographies. On the other hand, A may not have computational capacity to
perform intensive computations provided by E’s infrastructure. Thus, we can see from
this simple example that the cloud infrastructure makes a significant impact on the
application’s architecture as well as its deployment to achieve the best performance.
In this chapter, we will review various elements of datacenters and how they are or-
ganized to host cloud-based applications. At an abstract level, we view the organization
of a cloud as a graph, where nodes are processing units and edges are network links
using which these processing units exchange data. The nodes can represent a whole
datacenter, a cluster in a datacenter, a computer in a cluster, a CPU or a Graphic Pro-
cessing Unit (GPU), a VM, or any other resource that is used to execute instructions.
All elements of the graph have latencies and costs and their capacities are provisioned
to distributed applications. We will review latencies of various components of the cloud
162 CHAPTER 11. INFRASTRUCTURE
infrastructure, then we will briefly review major components, their functions, and dis-
cuss their advantages and drawbacks.
11.2 Latencies
Certain latencies are important to know1 , since they drastically affect the performance
of the cloud-based application. Using approximate ranges for latency values and other
parameters that include the cost of devices, we will construct a hierarchy of devices and
their aggregates in the cloud infrastructure. Since big data applications have significant
presence in cloud computing, we will construct a storage hierarchy of devices that host
and transmit data for computations.
We start with the finest granularity of devices, specifically chips on the CPU that
are used as caches. Accessing L1 cache has the smallest latency in the storage hier-
archy, since it is directly built into the CPU with the zero wait state, where the CPU
accesses the memory without any delay. In general, to access memory the CPU must
place its address on the bus and wait for a response, so zero wait means that there is no
wait, which is fast but also expensive. A part of L1 keeps data and the other part keeps
program instructions. L1 is the most expensive and the smallest available storage usu-
ally less than 100Kb, whose reference takes between one and two nanosecond or 10−9
second (for L1 cache hit), and it is the smallest latency time in the latency hierarchy.
L2 cache is located on a chip that is external to the CPU, it larger than L1, able to hold
around 512Kb and even over one megabyte on some CPUs, and it is also a zero-wait
data access. The latency of L2 cache hit is approximately three to five nanoseconds,
and the idea is that if accessing L1 cache results in a miss, then L2 is accessed to check
if the data is there before accessing the main memory.
Given how fast L1 cache is the reader can wonder why it is not made much bigger
and why it is so expensive. Increasing the size of L1 cache means increasing the area of
the CPU, and larger die sizes decrease yield, a term that refers to the number of CPUs
that can be produced from source materials. Suppose there are D defects in a square
inch of the source material. If each chip takes S square inches, the likelihood is DS that
a chip will have a defect. The higher the likelihood is, the lower the yield of the chip
production, and this is one of the reason why circuit sizes decrease. The other reason
is the amount of heat that is often proportional to the size of the chip, and when bigger
chips produce too much heat, the computer systems overhead and get damaged sooner
and bulkier cooling systems are needed to keep these computers functioning. These
and other reasons make these small fast memories very valuable and expensive.
L3 caches frequently appear on multicore CPUs, and unlike L1 and L2 caches,
which are exclusive to each CPU core, L3 caches are shared among all cores and even
other devices, e.g., graphic cards. The size of an L3 cache can be 10Mb and they are
frequently used to store data from inter-core communications. The latency of the L3
cache hit is in the range of 10-20ns depending on the CPU. Finally, L4 cache is large
measuring in 100+Mb and it has the access latency after L3 cache miss in the range
of 30-40ns. L4 is also called a victim cache for L3, since the data that is evicted from
1 https://people.eecs.berkeley.edu/ rcs/research/interactive_latency.
˜
html
11.2. LATENCIES 163
L3 is moved into L4, which can be accessed both by the CPU and the GPU, making it
potentially valuable for high-performance graphic applications.
When the instructions are executed by the CPU and a jump instruction takes a
branch that was not predicted by a branch prediction algorithm that runs within the
CPU, then the prediction is discarded and the proper instruction from the correct branch
is loaded. This latency is in the range of three nanoseconds, which is comparable with
L2 cache hit. This is approximately the latency of a local function call if we count
simply locating the address of the first instruction of the function. Recall a context
switch between threads and processes, when a quantum expires or a thread/process is
blocked by I/O and the OS kernel saves the state of one thread/process, puts it to sleep,
and restores the state of some other thread/process to execute it. Measuring them is
a nontrivial exercise in general, and the cost of a context switch with Intel CPUs is
between 3,000 and 5,000ns at the high end. Depending on a benchmark the context
switch latency for a process is approximately 1,500ns, although these numbers may
vary depending on the number of processes/threads and available CPUs2 . Interestingly,
transferring data between the CPU and the GPU is expensive and it depends on the size
of the data and the types of the processors, and the range is from 5,000ns for 16 bytes
to 500,000,000ns for 64Mb [55].
Mutex (un)lock takes approximately 20ns, which is close to L3 cache hit and it
is five times faster than referencing a location in the RAM (i.e., ≈100ns), which is
20 times faster than compressing 1Kb using the Zippy compressor (i.e., ≈2,000ns or
2μs). Sending 2Kb of data within the same rack network takes less than 100ns, which
is more than 50 times faster than reading one megabyte of data from the RAM se-
quentially (i.e., 5,000ns or 5μs), which is three times faster than a random read of a
solid-state drive (SSD) a flash memory technology (e.g., over 15,000ns or 15μs). A
round trip of a packet in the same datacenter takes 500,000 ns or or 500μs. A seek op-
eration of a hard disk drive (HDD) with a moving actuator arm takes six times longer,
three milliseconds, and three times longer than reading 1Mb from the same HDD se-
quentially. While the cost of the SSD is much higher, its performance w.r.t. sequential
and random reads and writes beats the HDD with a moving arm actuator by several
orders of magnitude. Finally, a packet roundtrip from CA to the Netherlands takes
approximately 150 milliseconds.
To summarize, the storage/latency/cost hierarchy can be viewed as the following
chain where the latency numbers are given in parentheses in nanoseconds: L1(1)→L2(3)
→L3(10)→L4(30)→Access RAM(100)→Rack Send2K(100)→ RAM Read1Mb(5,000)
→SSD Read1Mb(15,000)→Datacenter Send(500,000)→ HDD Read1Mb(1,000,000)→
USB SSD Read1Mb(2,000,000) → HDD Seek(3,000,000) → USB HDD Read1Mb
2 http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.
html
164 CHAPTER 11. INFRASTRUCTURE
Figure 11.1: C program fragment for iterating through the values of the two-dimensional array.
We can make a few conclusions based on these latencies. L1–L4 caches are very
valuable, however, programmers are often forbidden from using them directly and in-
stead algorithms are used in the OS to load caches often based on data locality. That
is, if an instruction accesses a byte in a RAM, it is likely that contiguous bytes will be
accessed immediately after this instruction. Consider an example of a C program frag-
ment in Figure 11.1. The representation of the two-dimensional array is contiguous in
the RAM where the first row 10,000 integers are laid out followed by the second row
of 10,000 integers and so on. OS cache locality algorithms will preload a few thousand
contiguosly laid out numbers into L1 and the execution of this program will take, say,
100ms depending on the CPU. Now, suppose that the indexes are jumbled up in line 3.
Even though the programs are semantically equivalent, the execution time will increase
more than tenfold because of the frequent cache misses, since accessing columns means
non-sequential accesses of the contiguous memory where the next data item may not
be loaded in L1 cache. The compiler GCC offers instruction builtin prefetch,
however, programmers are discouraged to use it, since it is very difficult to determine
how to optimize the code by loading some of its data into caches, especially when exe-
cuting in a multithreaded environment. However, structuring data in a way to preserve
the locality is a good advice to improve the performance by reducing cache misses.
Question 4: Would it make sense to build a compiler that analyzes all com-
ponents of the application from the performance point of view using the values of
latencies and optimize the allocation of the components to different VMs in the
cloud environment? Please elaborate your answer.
Next, it is much faster to load all data directly into the RAM and then perform
computations on the data, since seek and read operations on HDDs with moving arms is
very expensive even compared with the corresponding SSD read and write operations.
However, RAM is quite expensive and many commodity computers have limitations on
how much RAM can be installed. For example, as of 2017, one terabyte of RAM costs
approximately $5,000, which is an expensive proposition for a data center that hosts
tens of thousand of computers. And big data sets are often measured in hundreds of
terabytes, which precludes loading all data into the RAM. Interested readers can read
more about computer memory and caches elsewhere [46, 70].
Next comes a permanent storage such as HDDs and SSDs. HDDs have large ca-
pacities, they are cheap, but they are also slow. SSDs have lower capacity and the
11.2. LATENCIES 165
higher cost than HDDs, but they are much faster in terms of the seek time and the read
and write operations. They are also prone to failure – various statistics exist for diffent
brands of drives showing the annualized failure rate up to 35%3 . A study done in 2007
shows that HDD replacement rates exceed 13% [133], which concurs with the Google
study that shows that at the extreme every tenth HDD in the datacenter can fail once a
year [121]. The authors of the study also concluded the following: “In the lower and
middle temperature ranges, higher temperatures are not associated with higher failure
rates. This is a fairly surprising result, which could indicate that data center or server
designers have more freedom than previously thought when setting operating temper-
atures for equipment that contains disk drives.”
Since it is highly likely that an HDD can fail in a datacenter, to prevent data loss,
data should be replicated across multiple independent computers. One implementation
of this solution is to use a Redundant Array of Inexpensive Disks (RAID), which we
will review in Section 11.4. The other is to use custom solutions for creating multiple
replicas of data, which is expensive and these replicas should be synchronized. Clients
will enjoy replication transparency and failure transparency, since with the correctly
implemented solutions, they will see only one failure-free data object. Consider the
replication and failure transparencies in the map/reduce model, where the computation
and the data are discarded on the failed computer, the shard is replicated to some other
healthy computer and the worker program is restarted there. Deciding how much data
is to load into RAM from a hard drive and at what point in program execution is one of
the most difficult problems of performance engineering.
Making a remote call in the cloud infrastructure means that a network latency is
added to the cache, RAM, or disk latency. A network latency is smaller if remote calls
are made between servers that are mounted on the same rack, which is a sturdy steel-
based framework that contains multiple mounting slots (i.e., bays) that hold blade or
some other rack-mounted servers that are secured in the rack with screws or clamps.
Blade servers designate a server architecture that houses multiple server modules (i.e.,
blades) in a single chassis. Each rack may house its own power supply and a cooling
mechanism as part of its management systems, which include a network router and a
storage unit. Hyperconverged infrastructures contain server blades that includes com-
putational units (e.g., the CPU, the GPU), local storage, and hypervisors, which can
be controlled by issuing commands using a console installed on the rack for managing
all of its resources. Due to the locality of all servers on a single rack and a router or
a switch that connects these servers in a close-proximity network, remote calls among
these rack-based servers incur smaller latency when compared to datacenter-wide calls
or remote calls among servers that are located in different datacenters. We discuss
networking infrastructure in datacenters below.
3 https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017
166 CHAPTER 11. INFRASTRUCTURE
A high-level view of the architecture of the GPU is shown in Figure 11.2 with
the on-chip GPU-portion of the architecture encompassed by the dashed line. Arrows
represent data transfers between the components of the architecture. We describe this
architecture from a viewpoint of offloading a unit of computation from the CPU to the
GPU, so that we understand the latencies of communications between the CPU and the
GPU as well as the benefits of parallel computing offered by the GPU.
Once this computation is invoked at the CPU, a request is made to transfer the
data upon which should be encrypted and decrypted from the CPU memory to the
GPU. A specially dedicated hardware card (e.g., PCI express card6 ) reads data from the
CPU memory in blocks of some predefined size and transfers this data to the read-only
memory of the GPU (also called texture). The texture is designed as a two-dimensional
array of data, and it can thought of as a representation of the screen with each element
of the memory mapped to some pixel of the screen.
The GPU comprises many processors P1 , . . . , Pn , and these processors invoke func-
tions (also called fragments) on the input data that is stored in the read-only texture.
4 https://cloud.google.com/gpu/
5 https://aws.amazon.com/hpc/
6 http://en.wikipedia.org/wiki/PCI Express
11.3. CLOUD COMPUTING WITH GRAPHICS PROCESSING UNITS 167
Modern GPUs have dozens of processors. Since the GPU is a type of the SIMD ar-
chitecture, all processors execute the same fragment at a given time. Programmers can
control the GPU by creating fragments and supplying the input data. However, pro-
grammers cannot control the scheduling mechanism of the GPU, and therefore they
often do not know how data items are scheduled into different pipelines for the pro-
cessors. It is assumed that the computation is split among different processors which
execute the fragment in parallel and independent of each other, achieving a high effi-
ciency in processing input data.
A goal to achieve high par-
allelism is the reason for having
separate read-only and write-
only textures. If different pro-
cessors could read from and
write to the same texture si-
multaneously, the GPU would
have to have a lot of additional
logic to prevent reading from
a previously modified position
in the texture memory in order
to improve parallelism and pre-
vent errors. Having this logic
as part of the GPU architecture
would worsen its performance, Figure 11.2: A high-level view of the GPU architecture
and interactions with the CPU.
however, the downside of hav-
ing two separate textures is that
it is more difficult to design programs for the GPU.
When the data in stored in the read-only texture, the call draw is invoked to execute
fragments on the GPU. This call is equivalent to the call exec which creates a process
in Unix, and it has a high overhead [138]. The call draw loads and initializes data and
fragments and passes control to the processors of the GPU to execute instructions of
the fragments. In general, it is better to execute fewer calls draw on larger data blocks
to improve the performance of the GPU [137]. Of course, the performance of the GPU
depends also on how much instructions and data are independent of one another.
Once processors executed fragments on all input data in parallel, they write results
into the frame buffer. The data from the frame buffer can be displayed on the screen,
or they can be passed back to the CPU. To do that, the data is written to the write-only
texture and then transferred to the CPU memory using a hardware card. This card may
be the same as the card used to transfer data from the CPU memory to the GPU texture
if this card supports bidirectional transfer. In general, transferring data between the
CPU and the GPU is fast and scalable. Increasing the size of the data blocks leads to the
linear increase in the transfer time. The results of the old experiment with GeForce
8800 GTX/PCI/SSE2 show that the data transfer rate is about 4Gb/s [85], which
is more than ten times faster than many types of internal hard drives. However, certain
CPU latency is involved when initiating a transfer of a block of data [120].
The GPU programming model is well-known and documented. Programmers can
use standard libraries (e.g., OpenGL, DirectX, or Cuda) to write applications for the
168 CHAPTER 11. INFRASTRUCTURE
GPU. These libraries hide hardware-specific details from programmers thereby im-
proving design and maintenance of GPU-bound programs. Since, the GPU inherently
supports parallel computations, many parallelizable computationally intensive algo-
rithms are good candidates to run on GPU clouds.
RAID is classified into six levels and a number of hybrid combinations of these
levels to achieve various guarantees of performance and data protection.
RAID 0 : the data on a hard drive is segmented and segments are distributed across
multiple hard drive in RAID using a round-robin algorithm where each data seg-
ment is assigned to a hard drive in a circular order. This technique is called
data striping and its biggest advantage for RAID is in increasing the speed of
data access, since multiple segments can be written and read in parallel from
multiple drives. However, a failure of one drive results in losing data segments
stored on this drive, which means that RAID 0 cannot be used to prevent data
loss completely.
RAID 1 : the data is mirrored on multiple hard drives, thereby decreasing the possi-
bility that the data is completely lost if at least one drive remains healthy when
the others fail. However, writes are a bit slower because data must be mirrored
to all drives, whereas the speed of reading data is not affected and is the same as
the one of RAID 0.
RAID 10 : contains four hard drives organized in pairs; data is striped across the pairs
and then mirrored within each pair.
RAID 2 is not currently used in any major system due to a number of reasons includ-
ing the complexity of the error correction code (ECC) and the need for all hard
drives to spin in perfect synchrony, also called a lockstep, when all operations
occur at multiple systems in parallel at the same time. The data is striped at the
bit level and distributed along hard drives with a parity code for each segment
11.4. RAID ARCHITECTURES 169
stored on separate hard drives, which makes it inefficient. The data transfer rates
are roughly proportional to the number of the ECC hard drives, which means that
the cost of RAID 2 will increase as the transfer rates are increased. A simplified
version of a parity code is adding the bit one when the number of bits in the
segment is even and bit zero if the number of bits in the segment is odd. RAID 2
uses Hamming ECC, which is computationally intensive and we will not review
it here. Interested readers can learn more about ECC from various textbooks on
this subject [75, 101].
To understand how ECCs work in RAID at a very high level, consider the logical
XOR operation that we designate it here with the following symbol ⊕. Two
equations that describe the meaning of XOR are the following: x ⊕ 0 = x and
x ⊕ 1 = ¬x, where x is a variable that represents a bit. From these equations it
follows that x ⊕ x = 0. Now, suppose that there are two segments of bits that
we designate with the variables a and b and we store these segments on two
separate hard drives in RAID 2, and the product ecc = a ⊕ b is stored on a third
hard drive. Without the loss of generality, suppose that the drive that holds the
segment a failed. After replacing it with a new hard drive, the value of a will be
restored by computing ecc ⊕ b ≡ a ⊕ b ⊕ b ≡ a ⊕ 0 ≡ a.
RAID 3 is also based on data striping, which is frequently done on the byte level, and
it computes the parity code that is stored on separate hard drives. Hard drives
that store data should also be synchronized and their number must be a power of
two due to the use of parity code for error correction. Since all strip writes are
done simultaneously on multiple drives, the write performance should be high,
but unfortunately, it is limited by the need to compute the parity code.
RAID 4 is similar to RAID 3 with the difference of dividing data in blocks from 16Kb
to 128Kb and the parity code is written on a separate hard drive, thus creating
contention during concurrent writes.
RAID 5 is similar to RAID 4 with the difference of writing parity codes to multiple
hard drives, called distributed parity. Therefore, write performance is increased.
RAID 50 : the data is striped across groups of hard drives that are organized in RAID
5 architecture, thus significantly improving writes performance and adding fault
tolerance to RAID 0.
RAID 53 or RAID 03 : the data is striped across groups of hard drives that use ECCs
instead of mirroring for fault tolerance.
RAID 6 is similar to RAID 5 with the difference of adding another parity code block.
RAID 60 : the data is striped across groups of hard drives with distributed two-parity
ECCs used for fault tolerance.
RAID 100 is a striped RAID 10 with a minimum of eight hard drives.
RAID architectures are more expensive, since they require more hard disks and
more elaborate design of disk controllers. In addition, the performance versus the
170 CHAPTER 11. INFRASTRUCTURE
cost tradeoffs are complicated, since they depend on how applications use the I/O – an
application that frequently reads the data, computes results, and writes small data infre-
quently needs a different type of RAID when compared with an application that writes
large data chunks frequently. However, in many cases it depends on the cost of compu-
tation – in the map/reduce model a single computation on a shard can be repeated in a
case of a hard drive failure, thus obviating the need for using a RAID architecture. Pur-
chasing commodity computers in a datacenter with a specific RAID architecture will
put certain application at disadvantage with respect to the performance; purchasing
computers with different RAID architectures will increase the level of heterogeneity,
making the hypervisors decide what specific computer with a matching RAID architec-
ture should an application be deployed on. Often, commodity computers are purchased
without any RAID configuration, only dedicated computers may host one if customers
specifically request them.
The need comes from requirements for storing and accessing big data – many com-
panies and organizations need to store petabytes and exabytes of data nowadays, and
in the next decade breaking the size barrier to store and process zettabytes of data.
Magnetic tapes are cheap and they can store up to 200Tb of data on a single tape7 ,
whereas hard disk, whereas the largest hard drive capacity is approximately up to four
times smaller. Also, give the number of hard drives to provide the same capacity as a
single magnetic tape, it takes up to ten times more cost to power the RAID-based hard
disks than the magnetic tape. On the other hand, accessing data on a magnetic tape is
sequential and very slow. Therefore, MAID architectures enables users to have both
the storage capacity and the high speed of accessing the data.
Unlike RAID architectures, the main goal of MAID is to reduce the energy con-
sumption while maximizing the amount of available storage and guaranteeing good
7 https://www.engadget.com/2014/04/30/sony-185tb-data-tape
11.6. NETWORKING SERVERS IN DATACENTERS 171
Question 9: Recall the OSI seven layer model. Map the infrastructure com-
ponents in this section to the layers of the OSI model.
Recall that servers are attached to racks that are connected using a top-level switch,
which forwards messages between servers on the rack and to servers on the other racks
in the datacenter. According to different sources, a micro-datacenter houses less than
500 racks, whereas a standard size datacenter can house from 30,000 to more than
100,000 servers in up to 10,000 racks8 . Google more than a dozen data centers around
8 http://www.netcraftsmen.com/data-center-of-the-future-how-many-servers
172 CHAPTER 11. INFRASTRUCTURE
the world consuming approximately 0.01% of the total power capacity on the planet,
and as of 2013 both Google and Microsoft each own over 1,000,000 servers9 . Racks
contain server sleds, which plug into a networking board in the rack shelf, and with
sleds, some rack components can be upgraded without major disruption to other com-
ponents. Since multiple server sleds share network connections within the same rack,
shorter cables are used and fewer NICs thus reducing the cost and complexity. Orga-
nizing these servers and racks of them into an efficiently networked cloud computing
facility is a fundamental problem of cloud computing.
At a physical connection level, a network interface device (e.g., NIC or mother-
board LAN) connects servers to the data center network via an Ethernet rack-based
switch. A virtual switch, vSwitch is often used to connect servers on the rack10 , which
is a shared memory switch, since the data often stays in the same shared memory on
the servers and pointers to local address spaces of the servers are passed between VMs
with subsequent translation by the hypervisors. Since many multitenant VMs often
share the same physical server and the hypervisor, vSwitch enables them to bypass net-
work communications, especially if the output of one VM is the input to the next VM,
i.e., they are pipelined.
Question 10: Suppose that you can include the knowledge about the servers
that run on the same rack into the hypervisors that run on these servers. Explain
how the hypervisors can use this information to improve the performance of ap-
plications that are executed on these servers.
Two popular datacenter designs are called top of rack (TOR) and end of row (EOR)
[89]. In TOR, servers are connected within the rack with short copper RJ45 patch
cables, and they are plugged into the Ethernet switch that links the rack to the datacenter
network with a fiber cable to aggregation Ethernet switches that are placed in some
designated area of the dataveter. With every server connected by a dedicated link to
the ToR switch, data is forwared to other servers in the rack, or out of the rack through
high-bandwidth uplink ports. Since thousands of VMs can run on a single rack that
can host 50+ servers and they can share the same network connection, 10Gb Ethernet
links servers with the ToR switch. Thus, ToR is also a gateway between the servers
and the rest of the datacenetr network. Since ToR intercepts all packets coming to or
going out of the rack, it can tunnel and filter messages by inspecting packet headers
and match various header fields using custom-programmed rules. Doing so reduces the
complexity of administering the network on the datacenter scale.
Alternatively, in EoR design, racks with servers, usually a dozen of them or so are
lined up side by side in a row with twisted category 6 pair copper cables connecting
them via patch panels on the top of each rack. Each server is connected using short
RJ45 copper patch cable to the corresponding patch panel. Bundle of copper cables
from each rack are laid out either over the racks or under the floor. Unlike ToR where
each rack has its own switch, EoR switch connects multiple racks with hundreds of
9 http://www.ieee802.org/3/bs/public/14_05/ghiasi_3bs_01b_0514.pdf
10 http://openvswitch.org
11.6. NETWORKING SERVERS IN DATACENTERS 173
servers, thus simplifying updates for the entire row at once. ToR design extends the L2
network where media access control (MAC) addresses are used for message delivery
in a spanning tree network layout whereas the EoR design extends a L1 cabling layer
with fewer network nodes in the architecture of the datacenter.
Distance plays a significant role not only in the organization of the network design
in a datacenter, but also in deploying distributed objects in cloud-based applications.
Consider that sending a message over a fiber optic cable between continents takes less
than 20,000,000 nanosenconds, a bit more with cable Internet, ≈ 30,000,000 nanosec-
onds, and a whopping 650,000,000 nanoseconds over the satellite 11 . A key limita-
tion is the speed of light, which travels in an fiber optic cable at about 70% of its top
speed. Therefore, using 1Gib Ethernet takes 10,000 nanoseconds to receive a 10,000
bits, whereas over a T1 link of 1.6 Mib per second the same task takes 6,477,000,000
nanoseconds. The rule of thumb is that the propagation delay in cables is about 4,760
nanoseconds per kilometer. On top of that, there are inherent router and switch laten-
cies, which is the time to shift packets from an input to output ports. The constituents of
the latencies include converting signals to bits and packets, buffering data, performing
lookups to determine an ports based on packets’ destination address using IP in L3 for
a router or L2 for a switch, and forwarding the packets from port by converting the bits
to electric signals. Network devices such as routers and switches and saturated network
links introduce more latency because network devices wait thousands of nanoseconds
before they place a message onto their output ports.
The formula for the overall latency for sending a message over the network is
LNet = LSend + LRec + ∑ (texec + t f orward + tqueue ) + ∑ tsend , where LNet is the to-
netdev paths
tal latency of the message, LSend and LRec are the latencies of sending and receiving
the message correspondingly at endpoints, ∑ is the summary of latencies at each
netdev
network device (e.g., switches and routers) along the sending path, where each devices
spends some time to execute some code that inspects a message and takes some actions,
texec , time to forward a message to the next device, t f orward , and time to (de)queue a
message if the processing capacity of the device is fully utilized, tqueue , In addition, la-
tencies for transmitting messages as electrical signals across paths between devices are
summarized, ∑ and added to the total latency [81]. Various measurement method-
paths
ologies exist for evaluating the overall latencies of different paths within and between
datacenters to achieve the desired level of SLA.
Adding significant complexity to the network organization of the datacenter results
in various unpredictable situation, one of which is called a network meltdown, when
the network latency increases to infinity due to excessive traffic. A starting point in net-
work meltdown is a broadcast storm, when a broadcast message, often multicast and
not necessarily maliciously crafted, results in a large number of response messages,
and each response message leads to more response messages, described as a snowball
effect. A particular network meltdown is described based on the configuration of the
network where two high-speed backbones are connected to different switches to which
racks are connected [119, 74–76]. A problem was with the configuration of switches
11 https://www.lifewire.com/lag-on-computer-networks-and-online-817370
174 CHAPTER 11. INFRASTRUCTURE
that send and analyze bridge protocol data unit (BPDU) data messages within a LAN
that uses a spanning tree (i.e., loop-free) protocol topology. BPDU packets contain in-
formation on ports, addresses, priorities and costs and they are used to “teach” switches
about the network topology. If a BPDU packet is not received within a predefined time
interval (e.g., 10 seconds), a switch sends a BPDU packet and goes into a leaning
mode, where it forwards received packets. When other switches become busy and the
learning mode switch forwards packets, the network may exhibit looping behavior with
messages circulating among switches and multiplying over time resulting in full net-
work saturation and its eventual meltdown. As a result, cloud computing services come
to a complete halt and recovering a datacenter to the working state may takes hours or
even days.
There is also the cost dimension to cloud computing, since not only computing
resources are provisioned to customers on demand, but also customers pay for the
provisioned computing resources. Adding tens of thousands of network devices to
datacenters adds tens of millions of dollars to their cost, which is propagated to the
customers who pay for their provisioned resources. Given the high cost of networking
equipment that contains many features that cloud providers do not need in their dat-
acenter, it is no wonder that cloud providers create their own customized networking
devices rather than buying them off-the-shelf 12 . The Open Compute Project (OCP)
is an organization created by Facebook to develop hardware solutions optimized for
datacenters13 . Whereas OCP advocates a solution where full racks are assembled from
standalone components, rack-scale architecture is an open-source project started by
Intel that advocates production of completely preconfigured and fully tested rack so-
lutions that can be replaced as whole units on demand 14 . At this point, it is too early
to say what solution provides the most optimal deployment both from the performance
and from the economical point of views.
11.7 Summary
In this chapter, we discussed the cloud infrastructure and how it affects the availability
and reliability of cloud-based applications. We reviewed key characteristics of dis-
tributed applications and discussed various latencies in-depth. Next, we considered
RAID and MAID storage organizations, reviewed the architecture of the GPU, and
analyzed the networking infrastructure of a cloud datacenter. We showed that choos-
ing the right infrastructure and organization of computing units in a datacenter affects
12 https://www.geekwire.com/2017/amazon-web-services-secret-weapon-custom-made-hardware-networ
13 http://www.opencompute.org/about
14 https://www.intel.com/content/dam/www/public/us/en/documents/guides/
platform-hardware-design-guide.pdf
11.7. SUMMARY 175
Deciding what object to host on what servers in a datacenter is difficult. Clients submit
requests to distributed objects hosted on cloud servers, and the these servers provide a
quantifiable computing capacity to process these requests, which is called a workload.
Computational tasks that are submitted by cloud customers result in creating work-
loads. We will use the terms computational task and a workload interchangeably in
this chapter to designate the work that servers must accomplish.
In general, the term workload includes not only the static part of the input to the
application (i.e., specific methods with the combination of values for their input param-
eters and configuration options), but also the dynamic part that comprises the number
of requests that contain remote methods with input values submitted to the application
per time unit and how this number changes as a function of time [99]. For example, a
workloads can specify that the number of client’s requests to a distributed object fluctu-
ates periodically according to the following circular function: yi = α × sint + β, where
α is the number if method invocations in the workload, β is the constant shift, and t is
the time. When workloads are described by a neat pattern like a sinusoid wave, they are
easy to predict and subsequently, it is easy to proactively (de)provision resources based
on the descriptions of these workloads. Unfortunately, the reality of cloud computing
workloads is much more complicated.
Recall that cloud workloads are often characterized by fast fluctuations and bursti-
ness, where the former designates a fast irregular growth and then a decline in the
number of requests over a short period of time, and the latter means that many in-
puts occur together in bursts separated by lulls in which they do not occur [115]. If a
workload rapidly changes, thus affecting an application’s runtime behavior when a new
resource was being provisioned before the workload changes, this resource may not be
needed any more by the time it is initialized to maintain the desired performance of the
application in response to the changed workload. Finding how to distribute workloads
to objects to maximize the performance and to handle these workloads efficiently is
one of the primary goals of cloud computing.
176
12.1. TYPES OF LOAD BALANCING 177
There are two main types of balancing the load between servers: static, where prior
information about the processing power, memory, and other parameters of the server
is used and dynamic, where information about the servers and their environments is
obtained at runtime. These types of load balancing are each realized in seven models
for which load balancing algorithms are created.
Centralized model centers on a controller that keeps information about all servers.
Using this model makes sense for a smaller private cloud when the all servers
are documented and few tasks with predefined performance requirements and
resource demands need to be distributed across these servers.
Distributed models are more suitable for unpredictably changing cloud computing
environments. Opposite to the centralized model, in the distributed model each
computing node makes load balancing decisions autonomously, sometimes in
cooperation with some other computing nodes. If a computing node has a high
workload, it may initiate requests to other nodes to share a part of this workload,
i.e., a sender-initiated load balancing, whereas if it has a low workload, it may
ask other nodes to offload some tasks to it, i.e., receiver-initiated load balancing.
Delay-free models requires tasks assigned to servers as soon as they are launched by
customers, whereas in the (see the next item)
Batch model, tasks are grouped depending on some predefined criteria, e.g., deep-
learning tasks may be grouped to be sent to the GPU-based servers.
178 CHAPTER 12. LOAD BALANCING
Single model addresses load balancing of a task that is independent of all other tasks,
i.e., its execution is not synchronous with the executions of other tasks, and op-
posite to it in the
Collective model accounts for dependencies among tasks. A pipe-and-filter task orga-
nization is a good example, where the execution of some task can proceed only
if the execution of the previous tasks in the pipeline is finished. For example, a
reducer worker can be scheduled only after some mappers complete their jobs in
the map/reduce model.
These goals translate into three main criteria for evaluating a load balancing algo-
rithm: its complexity, termination, and stability. A complex algorithm is difficult to
analyze and it may take a long time to compute the distribution of tasks to different
servers. Executing a load balancing algorithm is not free, since its execution takes time
and resources. Its overhead should be very small, preferably less than one tenth of a
percent of the total execution time of the server and its resources and it should make
scheduling decisions within this time period that are better than a random assignment
12.2. SELECTED ALGORITHMS FOR LOAD BALANCING 179
Question 3: How does the network latency affect the precision of a scheduling
algorithm? Discuss how scheduling algorithms can be improved with network
latencies taken into consideration when computing an optimal schedule.
We start off by listing non-systematic algorithms and techniques that do not take
into consideration the semantics of tasks and existing workloads of servers. A key
element of these algorithms is that if they are repeated for the same tasks, they are
likely to assign them differently to the servers.
Random algorithm is a static algorithm that is the fastest and simplest to implement.
Essentially, it consists of a random number generator that produces a value be-
tween zero and the total number of servers in the datacenter. Each server is as-
signed a unique integer between zero and some maximum value. When a task ar-
rives, the random generator produces a number that designates a server to which
this task is assigned. Random algorithm is the fastest, since it contains a single
step of generating a random integer. Since random numbers are distributed and
non-repeating frequently, it is highly likely that the distribution of tasks across
multiple servers will result in close-to-equal distribution of workloads. Same
idea is applied for selecting a task for execution on the same server. However,
random algorithms do not use any information about the servers’ capacities, the
approximate durations of task executions, and the existing workloads on servers.
180 CHAPTER 12. LOAD BALANCING
It is possible that multiple tasks can be assigned to the same servers while the
other servers have no tasks to execute. Often, randomization is used in conjunc-
tion with other heuristics for distributing tasks to servers.
Round robin scheduling algorithms also belong to the class of static load balancing
algorithms where each task on a server will be given an equal interval of time
and an equal share of all resources in sequence. Dynamic round robin algo-
rithms are based on the idea of collecting and using runtime information about
tasks, servers, and the network to compute weights as real numbers between zero
and one and use these numbers to multiply the time intervals for each task. Es-
sentially, an implementation of a round-robin algorithm is a loop in which each
tasks is executed until the time interval expired and then the other task from the
list is selected for the next loop iteration. The main benefits of the round-robin
algorithms are that they are fast, simple to reason about and to implement, how-
ever, as with random algorithms, little information about the tasks and the servers
is used to make decisions.
Min-min and max-min algorithms take the input of the tasks sorted by their comple-
tion times. In both algorithms, a status table for executing tasks is maintained.
The idea of the min-min algorithms is to give the highest priority to the tasks
with the shortest completion time, whereas max-min algorithm first selects the
tasks with the maximum completion time. Once a task is assigned, its status
is updated in the table. In the min-min algorithm longer executing tasks wait
longer to be scheduled, since the priority is given to shorter tasks. Conversely, in
max-min algorithm, the longer tasks are scheduled before the shorter tasks. Each
algorithm has its benefits, however, a common drawback is that the discriminated
tasks may wait for a long time resulting in a serious imbalance.
FIFO load balancing technique schedules tasks for execution on the first come first
serve basis. It is a fast and simple technique that suffers from the same drawbacks
of random algorithms.
Hashing algorithms are diverse, but the main idea is the same – using some attribute
of a task (e.g., the name of the program that is executed for a task), its hash value
is computed and the task is assigned to the server whose name or IP address is
hashed to the same value, e.g., Ht ≡s |S|, where Ht is the hash code, |S| is the
number of servers, and ≡s is the sign of modulo operation. In a way, using a hash
function is similar to a random assignment, since using the name of a program
or the IP address of the server does not amount to using some knowledge about
the behaviour of the task or the optimality of assignment of this task to a given
server.
Diffusion load balancing algorithm (DLBA) is fundamental static systematic algo-
rithm [40]. In DLBA, the network is viewed as a hypercube, a graph with n nodes and
D = log2 n dimensions, where each node has the total number of D edges. For exam-
ple, a hypercube with one node has the dimension zero; with two nodes connected by
the edge, the dimension of the graph is one, and with four nodes connected with four
edges to form a square, the dimension is two and so on. Each node is labeled with its
12.3. LOAD BALANCING FOR VIRTUAL CLUSTERS 181
order number and a constraint is that nodes are connected with an edge if their order
numbers differ only in one bit. The number of bits in the order number is equal to the
dimension with high-order padding bits set to zero. One can think of nodes as servers
and edges as network links in a datacenter.
A key step of DBLA is that at some time, t two neighboring nodes, u and v in
|wtu −wtv |
a hypercube exchange a fraction of the differences in their workloads, D+1 that is
inversely proportional to the degree of the hypercube. Only the information about the
|wtu −wtv |
previous time step is remembered, i.e., wtu = wt−1
u + ∑ D+1 , where N is the set
v∈N
of all neighboring nodes for the node, u. One can visualize the process by seeing the
workloads diffuse from some nodes to other nodes in the hypercube, one step at a time.
Discussing it in depth is beyond the scope of this book, and interested readers can study
the original papers that discusses constraints for achieving stability and a large number
of subsequent diffusing algorithms published in the past three decades.
the servers. In this definition, we do not consider the work of moving data across
the network to servers, which may be considered proportional to the network latency.
However, not only workloads for specific tasks can be scheduled for objects assigned to
different servers, but also network traffic should be taken into account when assigning
tasks to run in VMs on servers.
Question 5: How does the size of the data that the distributed objects need
to access affect the scheduling decisions for distributing workloads across these
distributed objects?
A key idea of the algorithm Oktopus is twofold: first, each server has a predefined
number of slots to run one VM in each slot, and each request contains not only the
number of VMs, N , but also the required bandwidth, B , for each VM. Second, if N
communicating VMs should be assigned to two servers, which are connected with a
link and one server can host m VMs and the other server hosts N − m VMs, then
the maximum bandwidth required between these servers is defined min((m × B, N −
m)×B). Indeed, the amount of data sent both ways is limited by the smallest bandwidth
producer or consumer. Using these two ideas the algorithm computes the assignments
of tasks to VMs greedily so that the tasks are distributed across available servers to
satisfy the bandwidth constrains in the tree network topology of the datacenter.
Algorithm Oktopus is shown in Figure 1 in lines 1–17 and the helper function,
Allocate, is given in lines 18–29. Given the number of empty slots on a server, E ,
the number of VMs, M that can be allocated to a server, L with the remaining link
capacity, CL is given in line 6. The number of machines with their total bandwidth
should be less or equal than the remaining bandwidth, CL . The algorithm starts with a
physical server, L , and it calculates the number of the VMs, M , that can be assigned to
this server to its open slots, E . The the function Allocate is called, and it returns in the
base case where it is applied to the server in line 19. However, if the level is higher, i.e.,
it is a switch, then for each subtree, t of the switch in loop 23–27, the number of VMs
allocated to it, m can be the total number, Mt , if enough slots are available, leaving the
residual number of VMs to allocate, m − count as well as the residual bandwidth.
The algorithm minimizes the message traffic at the top nodes in the tree, i.e., the
traffic should be maximized within racks mostly and minimized across the datacenter.
The choice is given to subtrees located at the same level of hierarchy, so that the traffic
is directed using as few switches as possible.
12.4. GOOGLE’S BORG AND OMEGA 183
job consists of one or more tasks. Borglets connect their respective computers up to
Borgmaster that polls these Borglets several times each minute to determine the states
of the corresponding computers and push requests to them. Even though the Borgmas-
ter represents a single point of failure for a cell, the multitude of cells makes the entire
datacenter decentralized and less prone to the complete failure.
Borg handles two types of jobs: short latency-sensitive requests that last from mi-
croseconds to milliseconds (e.g., send an email or save a document) and batch jobs that
may last from a dozen of seconds to days and weeks – sample Google cluster data are
publically available1 . Clearly, these types of jobs have different sensitivities to failures
and the underlying hardware support – for example, shorter jobs do not run in VMs,
since the cost of starting and maintaining a VM process is high. Instead, these jobs are
run in lightweight containers that are explained in Section 10. Also, jobs may depend
on one another or have specific constraints (e.g., to run on a GPU architecture).
The diversity of jobs creates serious difficulties for job scheduling and load bal-
ancing. Large jobs will take bigger computers and possibly, a smaller portion of their
resources will be unused, which may be useful in case the workload increases when a
job is executed. However, the resource usage becomes heavily fragmented. In contrast,
trying to pack jobs tightly on a computer in a cell leaves little room for assigning more
resources for jobs whose workload increase rapidly. And if the constraints are specified
incorrectly for a job, its performance will suffer because no resources may be available.
Finally, packing jobs tightly may lead to poorly balanced jobs in the cell.
Borg’s scheduler addresses these problems by first determining which jobs are fea-
sible given their constraints and available resources in the cell and then by assigning
these jobs to computers using an internal scoring mechanism that uses a mixture of cri-
teria to maximize the utilization of available resources while satisfying the constrains
of the job. One interesting criteria is to assign jobs to computers that already have
software packages on which the job depends. However, in the likely case where no
computer is available to run a job, Borg will terminate lower priority jobs compared
to the one that is waiting to run to free resources. As we will later learn, cloud spot
markets use a similar methodology to preempt applications for which much lower fees
are paid to run to enable more expensive applications to run in the cloud.
Omega is a cluster management system widely considered the next generation of
Borg within Google [135]. Omega incorporated elements of Borg that were success-
fully used for years and added a new idea of sharing the state of the entire cell among
many schedulers within each cell. After years of Borg deployment, the contours of two
main types of jobs formed: the first major type is batch jobs with over 80%, however,
the Borg scheduler allocated close to 80% of resources on average to service jobs that
run longer than batch jobs and they have fewer tasks compared to up to thousands of
tasks for each batch job. Scheduling a large number of jobs with many tasks optimally
is a very difficult problem, and a single scheduler per cell takes dozens of seconds or
longer to schedule a job.
1 https://ai.googleblog.com/2011/11/more-google-cluster-data.html
12.5. VM MIGRATION 185
12.5 VM Migration
As part of load balancing, a VM may be required to migrate to a different server to
distribute the load more evenly or relocate it to a more powerful server to improve the
performance. This is an important operation, since it requires computing resources and
the network bandwidth. In addition, migration transparency requires that the applica-
tion owners are not affected by migration of their VMs. Specifically, the application
that is run within the VM should not be paused for a long period of time or terminated
during migration. In fact, the cost of migration contributes to the cost of load balanc-
ing, so if the cost exceeds certain threshold level, the benefits of load balancing may be
obliterated. In this section, we review three techniques for VM migration [82].
In pre-copy migration, the memory state of the source VM is copied to the destina-
tion VM while the program in the source VM continues its execution. Once copying is
finished, newly updated memory pages within the source VM are identified and copied,
repeating this process until there are no modified pages to copy or some threshold for
the amount of copying is reached. Then, the context of the source VM (e.g., environ-
ment variables, files, open network connections) is transfered to the destination VM
and the execution is resumed.
Pre-copy migration is popular and it is used in Xen [20] and vmWare [131], how-
ever, a biggest problem with it is when the application frequently writes into memory.
As a result, the migration will use a significant amount of the available network band-
width to transfer memory papers, prolonging migration time and thus making load
balancing less effective to the point of the increased costs negating all benefits. In real-
world, i.e., non-simulated cloud settings, 60 second latency means that the migration
process failed [82].
The other technique for VM migration is called post-copy, where the VM con-
text is copied from the source to the destination VM before the memory content is
copied [74]. Next, the execution of the destination VM starts and in parallel, mem-
ory pages are being copied from the source to the destination VM. Ideally, memory
pages that are needed sooner for the execution of the destination VM should be copied
first, however, in general, it is not possible to determine which memory pages will
be needed at what point in execution. Therefore, memory pages may be ordered for
copying based on different criteria: by the most recent time they were accessed, by the
186 CHAPTER 12. LOAD BALANCING
number of accesses, or by locality – start with copying the most recently accessed page
and then copy other pages whose addresses are adjacent to the most recently copied
pages. Since the source VM does not continue to execute, it means that the number of
memory pages is bounded that should be copied to the destination VM.
A key problem with the post-copy technique is that when copying memory pages
the VM executes an application that may need memory pages other than the ones being
copied. If the requested page has not been copied yet, then an exception is created, the
execution is suspended, the memory page copying is suspended, the required memory
page is located in the source VM and transferred to the destination VM, after which
the process resumes. Generating exceptions and interrupting the migration process is
expensive, and if it happens frequently, the cost of migration will increase.
The third technique is called guide-copy migration that combines elements from
the pre-copy and the post-copy migration techniques. The main idea is to transfer the
VM context from the source VM the same way it is done in the post-copy technique,
and then to continue to execute both the source and the destination VMs. The source
VM becomes the guide context, and since the destination VM is likely to access the
same memory pages as the guide context. Therefore, the guide context provides pages
that the destination VM will need and these pages are copied in time for the access and
exceptions are avoided mostly. However, if copying is delayed for any reason, then
the exception is generated as in the post-copy technique. Disk changes are treated the
same way as memory changes and disk pages are transfered when they are modified at
the source VM. The technique may terminate the execution of the guide context at any
time, since non-transfered pages will be requested later by throwing exceptions as it is
done in the post-copy technique.
Executing both the source and the destination VM consumes additional resources,
and the process of transferring memory pages should complete relatively fast to avoid
long duplicated executions. However, it is possible that the application in the source
VM may execute for a long time without accessing new memory pages, i.e., the ac-
cesses may happen in bursts followed by long periods of lull. This is the most unfavor-
able scenario for the guide-copy technique, which is handled by early termination of
the source VM. Interested readers may obtain more details from the original paper [82].
method will take an HTTP request as its input and forward it to one of the existing VMs
that run the same application that processes these requests (e.g., a web store). Thus,
a load balancer is a distributed object that works as a multiplexer to send requests to
VMs to improve the workload distribution.
Question 8: Explain how you would design an adaptive system where sched-
ulers are added and removed elastically depending on the inequality of distributing
workloads across multiple servers.
Figure 12.1: A Java-like pseudocode for using a load balancer with Amazon Cloud API.
A Java-like pseudocode for creating and using a load balancer with Amazon Cloud
API is shown in Figure 12.1. The code is in general self-explanatory with the use
of the Amazon API documentation. Once references to a Elastic Computing (EC2)
client object are obtained in line 1 and to the elastic load balancer in line 2, a load
balancer request object (i.e., the input) is created in line 3, named in line 4 and listeners
for HTTP requests are created and attached to the load balancer in lines 5–7. A load
balancer result (i.e., the output) object is created in line 8, and existing VM instances
are obtained and registered with the load balancer in the remaining lines of the code.
At this point, a load balancer start servicing inputs and directing them to the registered
VMs. The flexibility of using cloud API to create load balancers and dynamically
configure them enables programmers to reconfigure their architecture on demand to
improve performance and scalability of their applications.
188 CHAPTER 12. LOAD BALANCING
12.7 Summary
In this chapter, we synthesized the knowledge of VMs and cloud infrastructure to intro-
duce the concept of load balancing. We discussed key characteristics of load balancing
algorithms, their types, and reviewed a few basic non-systematic algorithms that do
not take into consideration the load of the network. We also studied a fundamental
diffusing load balancing algorithm and Oktopus, a greedy algorithm for datacenters
that takes into consideration that the network load should be minimized. Next, we
study techniques for migration of VMs that load balancers use to distribute the load
more evenly. Finally, we discuss how to use load balancers as distributed objects to dy-
namically change architectures of applications in the cloud to adapt them to changing
workloads.
Chapter 13
189
190 CHAPTER 13. SPARK DATA PROCESSING
data in memory, creating a hash table with the integer key that designates a sum of
two integers and the value as a list of objects each holding the value of two integers
whose sum is the one in the corresponding key. As new integers arrive in real-time,
they will be directed to computational units that will compute their sum and add to a
corresponding hash table, and these tables can be merged at a later time. Thus, these
datasets can be reused repeatedly, unlike key-value pairs in map/reduce without the
I/O overhead. That is in the nutshell the main rationale behind Spark, which extends
the computational model of map/reduce with significant scheduling and performance
improvements [160].
Consider as an example an algorithm for clustering a dataset based on detecting
centers of data point clusters and grouping the data points together that lie within a
certain proximity to the center of each cluster. This algorithm is implemented within an
external loop whose condition is to determine whether the distance is smaller than some
threshold value between the center of a cluster computed in the previous loop iteration
and the newly computed center of the same cluster. Within this loop, the data points
are grouped by the closest distance to the center of each cluster and then the centers
of the grouped data point clusters are computed by averaging the distances between
the data points to the centers of their respective clusters. The algorithm may take
thousands of iterations to converge. Consider an implication of multiple iterations on
the implementation of this algorithm using the map/reduce model. It means that after
each iteration the output of the map/reduce program becomes an input to itself. The
overhead of recording the outputs that will be discarded is significant. As an exercise,
please implement this algorithm or one of its variations as a map/reduce application.
Since the partitioned collection of objects resides in memory, it can be easily lost
due to various reasons including electric power surge or some hardware failures. In
this case, the RDD will be rebuilt later using the sequence of steps from the original
dataset from the permanent storage (e.g., an HDD). Even though doing so will slow
13.1. SPARK ABSTRACTIONS 191
down the computation, if failures like that happen infrequently, RDDs can speed up
computations significantly by reducing the I/O overhead.
As an abstraction, RDD hides the implementation details that enable programmers
to obtain a physical dataset and perform operations on them. This is a main reason why
Spark does not define a particular permanent storage type and, unlike Hadoop, it does
not have a filesystem that is associated with its datasets. Doing so makes Spark flexible
to implement specifics of loading and saving RDDs – they can come from collections,
files, databases, in short, from any dataset in any format that can be partitioned and
transformed into some respresentation in the RAM.
Moreover, the Spark’s flexibility of dealing with RDDs allows programmers to in-
tegrate real-time data streams in Spark processing, where a data stream is a potentially
infinite collection of lazily computed data structures, i.e., the elements of the stream are
evaluated on as-needed basis. Consider a stream of data from smart meters installed
in houses in a city and how newly arrived sets of data can be obtained from a network
connection and added to RDDs based on some partitioning rule. Doing so in Hadoop
is nontrivial, since it requires modifications to the framework to add the data structures
to the existing shards or to new shards, change the map/reduce program configuration
to read from these newly modified shards to recompute the results. In Spark, new data
structures are simply added to an RDD and the corresponding computation is triggered
on it automatically. Programmers don’t need to write any additional code for that.
Abstract parallel operations allow Spark programmers to define standard manipu-
lations on RDDs. Key operations include reduction, which reduces the number of ele-
ments in a dataset to obtain its concise representation (e.g., multiplying all values in a
list of integers reduces this list to the single value of the product of these integers); map
iterations, which traverses all RDD elements and applies a user-defined function that
maps each element to some value; the operation collection that gathers all elements of
the RDD and sends it to the master node, also called the driver that allocates resources
for Spark operations; and the operations cache and save, where the former leaves the
RDD in the memory and the evaluation of its elements is done lazily, whereas the latter
evaluates the RDD and writes it into some persistent storage. Thus, unlike the map/re-
duce model, the persistence is uncoupled from other operations thereby allowing the
programmers to achieve greater flexibility and better performance in the design of the
big data processing applications.
Finally, Spark relaxes the requirement of data independence from the map/reduce
model by introducing a concept of shared variables, i.e., it allows dependencies among
data elements and enables certain synchronization mechanisms to use these data de-
pendencies. Of course, allowing a free-for-all read and write access to shared memory
regions can complicate the design and analysis of Spark-based programs to the point
that scheduling them would not be effective, since a some components of the program
may not utilize its resources because they may have to wait to obtain access to a shared
192 CHAPTER 13. SPARK DATA PROCESSING
memory region. In many cases Spark applications are stateful, they need to keep the
state and share it among the worker and the driver nodes. Yet, using a global memory
region for sharing data is likely to lead to side effects that will make it difficult for
Spark to enforce scheduling and resource distribution. Instead, the Spark introduces
shared memory abstractions that restrict the freedom of accessing and manipulating
the content of the shared memory.
Shared memory abstractions are called broadcast and accumulator variables. The
former represents read-only data structures (e.g., an employee hash table that maps
employee identifier to some file locations which contain employee data) and the lat-
ter designates storage that worker nodes can use to update some values (e.g., the
number of words in text documents computed using the map/reduce model). An
example of a broadcast variable is the following Scala statement val bcVar =
sparkContext.broadcast(List(1, 2)), where the object List is sent to
all worker nodes. However, when an associative operation like summation should be
applied to a collection object to accumulate the result in a variable, it can be declared
as an accumulator and used in mapping or reducing operations as the following exam-
ples: val acVar = sparkContext.accumulator(0, "Counter") and
then sparkContext.parallelize(List(1,2).map( => acVar+= ). The
Spark infrastructure take care of enforcing the semantics of these sharing abstractions.
Next, we take a few elements, filter out those not divisible by three, and compute the
sum: endp(1).toList.filter( %3 == 0).flatMap(x=>List(x, x*2,
x*3).foldLeft(0)( + ). The mapping operation, flatMap, takes the function
as its parameter that transforms the list by expanding each of its element into a list
containing the element itself, followed by the one multiplied by two, and then by three.
Finally, the operation, foldLeft, takes in the accumulator value, zero in the paren-
theses, and a reduction function that sums the value of the accumulator designated by
the first underscore with the next element of the list designated by the second under-
score putting the resulting value into the accumulator and returning it. These instruc-
tions are pipelined in the execution chain with the output of the previous operation
serving as an input to the next operation in the pipeline. One can view this pipeline as
13.2. DATA STREAM OPERATIONS 193
a conveyer where the stream elements are diced, sliced, their values or their types are
transformed into some resulting product.
Using this example, we can see that each operation produces a subset of the stream
computed by the previous operation in the pipeline. Moreover, these operations can
be performed independently on the subsets of the original stream from the endpoint,
endp(1). At a high level, Spark processing can be viewed as splitting the original
steam into subset streams, which are set as RDDs, and then applying operations in
pipelines to each subset RDD. At some points, the results of the pipelined operations
on each RDD is aggregated and reported as the final computed result.
A key programming concept is monad that allows programmers to construct pipelines
to process data in a series of steps. Monad is a function, M[U]→(U→M[V])→M[V],
where M is a type parameterized by some other types, U and V. One can think of the
type, M as the container type, say a list or a set, and the types U and V are the types
of elements that the container type hosts, say the types Int and String respectively.
The function maps the parameterized type, M[U] or using our concrete example, a
list of integers, List[Int] to the M[V] or a list of string, List[String] via
the intermediate function that takes the values of the type, U and maps them to the
values of the parameterized type, M[V]. With our example, one can view the input,
List[Int], as a container that holds employee unique identification number and the
monad maps this input onto List[String], a container that holds the names of the
people who are assigned these identification numbers. These monadic functions enable
type translations that are needed to construct pipelined operational sequences.
Consider the monadic function, def flatMap[U](f:U=>M[V]):M[V] that
takes a function that maps the value of the input type, U to the output container,
M[V] that contains values of some other type, V. In our example above, this func-
tion, f is represented as x=>List(x, x*2, x*3). In fact, there are two func-
tions: one that maps a value of some type, U, to a container that hold values of
this type, M[U] and the other function that maps the container, M[U] to the same
container that holds values of some other type, M[V]. The former function is de-
fined def unit[U](i:U):M[U]. Applying the function unit v produces a list
v::Nil where :: is the concatenation operator that joins elements together into a
list. The latter function, bind can be defined using the function flatMap as bind
container ≡ container.flatMap(f), where f:U=>M[V]. Thus, functions
unit and bind are fundamental functions whose composition enables programmers
to create data processing pipelined functions.
Question 4: Write algorithms for bubble sort, quicksort, and k-means cluster-
ing using monadic functions in Scala.
Monadic laws govern the application of the functions unit and bind. The left-
identity law stipulates that unit(i).flatMap(f)≡f(i), since the function, f is
applied to the input value, i. The right-identity law states that c.flatMap(unit)≡c,
where the function flatMap applies the function unit to each element of the con-
tainer, c resulting in the same container, c. Finally, the associativity law states that
c.flatMap(f).flatMap(g)≡c.flatMap(i=>f(i).flatMap(g)), mean-
194 CHAPTER 13. SPARK DATA PROCESSING
ing that the composition of the application of the functions f and g to the container
object, c is equivalent to the application of the function f to each element in the con-
tainer, c and producing some resulting element to which the function flatMap(g)
is applied.
Monadic laws can be demonstrated using a simple example with List as the func-
tion unit. We have the function def f(i:Int):List[Int]=List(i,i*2)
and the function def g(i:Int):List[Int]=List(i+1,i+2) that map the
input integer to a list of integers where each element of the list is a transformed in-
put integer. Let us apply unit i ≡List(i). Then, we apply the first monadic law
unit(i).flatMap(f)≡f(i) to determine that unit(i).flatMap(f)≡
List(i).flatMap(f)=List(i,i*2), which in turn is equivalent to f(i)=
=List(i,i*2).
Appying the right-identity law, c.flatMap(unit)≡c, gives us the following
result: List(i).flatMap(List( ))=List(i)≡List(i). Finally, the as-
sociativity monadic law, c.flatMap(f).flatMap(g)=List(i+1, i*2+2),
and List(i).flatMap(i=>f(i).flatMap(g))=List(i+1,i*2+2), so the
law holds for this example.
The reason that monadic operations are important in Spark is because these op-
erations can be chained together to construct a pipeline, which can be viewed as a
conveyor with elements of the streams arriving in different shapes and in containers
and robotic arms are monadic operations that filter, slice, dice, and generally chop and
compress data in a variety of ways. One can view the computation of the average of
the integer numbers in a list as a compression or reduction operation, since all integer
numbers are “reduced” to a single floating point value. In order for robotic arms to
process a rectangular object, it should be processed by some other robot prior to that to
chop off parts of the round shape of this object to make it rectangular. Consider shapes
of the objects as their types, and monadic operations become important to ensure the
producer/consumer compatibility of the types for consecutive operations.
Question 5: Discuss pros and cons of compiling Spark code directly to the
OS-level instructions.
13.3. SPARK IMPLEMENTATION 195
Consider a Spark program in Figure 13.1 that obtains a list of integers separated by
semicolons as streaming ASCII characters from a network address. In line 1 the Spark
configuration object is obtained and it is used to line 2 to create a streaming context
object where it batches the streaming data in ten second intervals. This stream can
be viewed, for example, as a list of two strings, concretely expressed, for instance, as
List("123;5;6","123;1;579;16"). The 10 second interval is not shown in
the concrete representation where two text strings are combined in the list container.
In lines 3 and 4, the object streamContext is used to create a socket that receives
text stream at an address and a port designated in the variables ipaddr and port
respectively. The input parameter StorageLevel.MEMORY AND DISK SER spec-
ifies that the RDD is stored as serialized Java objects and the partitions that do not fit
into the RAM will be stored to the HDD and they will not be recomputed when needed.
Figure 13.1: A Spark program for computing the average value for the stream of integers.
In line 4, two monadic functions are invoked on the input text stream: flatMap
and map. The former applies the function split to each element of the input stream
to split it into the lists of substrings using the separator semicolon and flatten the result
into a stream of substrings, each of which is transformed into an integer value (or None
if a string contains other symbols than alphanumeric digits). The resulting list of integer
values is assigned to the variable intList. In line 5, the chain of monadic methods
map and reduce transform each integer into the integer value one and apply the
summation to determine the number of all input integer values. The method collect
gathers all values from different nodes to tally them and store the result into the variable
total. In line 6, the number of input negative integer values is computed by filtering
out positive values and in line 7 the sum of all positive integers in the list is computed.
Finally, in line 8 the average is calculated using the previously computed values.
This example is highly representative of the core functionality of Spark, where the
main benefit of the Spark abstractions is to reduce the complexity that a programmer
has to deal with when creating an application that manipulates distributed objects. In
Spark, same RDDs can be reused in multiple iterations of the same algorithnm and
these RDDs with the algorithmic operations are distributed automatically and effi-
ciently across multiple computing nodes. Nevertheless, programmers view a single
196 CHAPTER 13. SPARK DATA PROCESSING
program withour reasoning about these multiple nodes thus preserving location and
migration transparencies. The bulk of the complexity is handled by the underlying
platform called Mesos on top of which Spark is built.
this reduction, framework use Mesos filter interfaces where frameworks pass their re-
source constraints to the Mesos master as lazily evaluated Boolean predicates. Doing
so enables the Mesos master select only those resources that pass constrains checks
thereby avoiding sending those resource offers that are known to be rejected by some
frameworks in advance. Also, sent resource offers that pass constraint checks can still
be rejected by the frameworks if certain runtime conditions warrant such rejections.
One main assumption behind the design of Mesos is that most tasks’ durations
are short. It means that Mesos master is designed to work as a fast switch allocating
resources to tasks that run for a short period of time, deallocating the pre-assigned
resources once tasks finish and reallocating these resources again to waiting tasks. It is
easy to see that if the assumption is violated frequently, Mesos will lose its efficiency.
To address this problem, Mesos determines if many long tasks hogged resources,
its master will inform the offending frameworks that it would kill its tasks after some
grace period. At the same time, it is quite possible to have long-running tasks, so Mesos
allows frameworks to inform the master that certain tasks will require guaranteed re-
source allocation without killing them.
Finally, Mesos provides a strong level of isolation among resources using the con-
cept of chroot or chroot jailing that became a part of Unix circa 1982. Recall that
the operation chroot allows users to run a program with a root directory that can be
freely chosen as any node in the file system, and the running program cannot access
or manipulate files outside the designated directory tree, which is called a chroot jail.
The chrooting mechanism is used in Mesos to prevent access to the same resources by
running tasks.
Suppose that a cloud computing cluster has M nodes with L task slots per nodes,
where a slot is an abstraction for available resources to execute some task, and all tasks
take approximately the same time, T to run. Let Pj designate the set of nodes that have
|P |
data for the job, j. Thus, the ratio of preferred nodes for j is p j = Mj where |Pj | is the
cardinality of the set Pj . Also, p j is the likelihood that the job, j has data on each slot
that becomes free at some point of time. Suppose that these slots become free in some
sequence and first available slots do not have the data for the job, j. Then, the job, j
can wait up to D slots until the master decides that the job must take a slot that does
not have the local data for this job. The main result that we will show next is that the
non-locality decreases with D exponentially.
The likelihood that a task does not get a slot with local data is 1 − p j and since the
availability of each slot is an independent event, we can multiply these likelihoods for
D sequentially appearing slots with non-local data: (1 − p j )D . Since each slot becomes
available on average each TS seconds, the job, j will wait at most TSD seconds. Choosing
the value of D becomes a critical choice.
Question 9: What is a ballpark good ratio of the long running vs short running
jobs for the delay scheduling approach to be more effective and efficient?
Suppose that the goal is to achieve the overall job locality, 0 ≤ λ ≤ 1, where the
value zero means no task has local data and the value one means that all tasks are
assigned to nodes with the local data. Suppose that there are N tasks that the job, j
must run on the cluster with the replication data coefficient, 1 ≤ ρ ≤ M, where the
value one means no data is replicated across nodes and the value M means that the data
has replicas on all nodes. The likelihood that some node does not have a replica to run
the kth task is (1 − Mk )ρ and the formula p j = 1 − (1 − Mk )ρ for the task 1 ≤ k ≤ N. The
reader recognizes the famous compound interest inequality in the limit: (1 − nr )nt ≤ ert
for n → ∞. Applying this inequality we obtain the likelihood that the job, j starts a task
ρDk
with the local data by waiting D is 1 − (1 − p j )D = 1 − (1 − Mk )Dρ ≥ 1 − e− M .
At this point, let us determine the average of the job locality, l given the waiting
time, D:
ρD
1 N ρDk 1 N ρDk 1 ∞ ρDk e− M
l(D) = ∑ 1 − e− M = 1 − ∑ e− M ≥ 1 − ∑ e− M ≥ 1 − ρD
N k=1 N k=1 N k=1 N(1 − e M )
Relational database optimization engines are complex programs that take an SQL
statement as the input and transform it into a sequence of low-level data access and
manipulation functions to accomplish the goal that programmers encode in the SQL
statements. Many SQL statements result in a very large number of combinations of the
low-level functions to compute the result and selecting the optimal combination that
minimizes the execution time is an undecidable problem in general.
Consider an example of an SQL statement that is shown in Figure 13.2. It retrieves
the information about employees of some common gender who work in the same de-
partment. The statement is in fact a nested query that contains two SQL statements on
the same table. Depending on the API used by an application, this query is sent to a
relational database engine, which processes and optimizes the query and then returns
200 CHAPTER 13. SPARK DATA PROCESSING
Generally speaking, the application may look like the one shown in Figure 13.1
in which we can insert this SQL query as a string parameter to some API call that
sends this query to the corresponding database for execution. However, the table
EMPLOYEES can be very big and it is unclear how it can be partitioned into RDDs,
and even if it does, say by distributing columns or groups or rows to different nodes,
then how does Spark optimize the query to retrieve the data with good performance?
Spark SQL addresses this question – it is a library on top of Spark that presents
abstractions to programmers who need to write SQL statements to retrieve data from
relational datasets. These abstractions are realized in interfaces that can be accessed
using a database open access software like Open DataBase Connectivity (ODBC) or
with the Spark SQL’s DataFrame API. An example of the Spark SQL’s implementa-
tion of the SQL query from Figure 13.2 is shown in Figure 13.3. Some names slightly
changed, it is easy to see how similar these queries are except that the SQL query is
encoded using DataFrame API calls. Doing so enables Spark SQL to generate an exe-
cution plan to retrieve the resulting data and to perform optimizations of the generated
plan to distribute tasks among nodes to improve the performance.
1 val employeeData = employees.as("a")
2 .join(employees.as("b"), $"a.deptID" === $"b.deptID")
3 .where(employees("gender") === 1).agg(count(name))
4 .groupBy(a("deptID"), a("name"))
Figure 13.3: A Spark SQL statement that retrieves information about employees.
The query optimization function within Spark SQL is performed by the Catalyst
optimizer, an extensible component that is based on Scala programming language. At
its core, the Catalyst optimizer contains a library for analyzing the DataFrame queries
in their tree representations and generating multiple execution plans using custom-
defined rules. Then these plans are evaluated and the winning execution plan is chosen.
for the expression agg(count(name)) the top root node of the tree represents the
call to agg and the child of this root node is a node that represents a call to the function
count whose child node in turn represents the literal “name”. Constructing a parse
tree is the first step in Catalyst optimizer.
Transforming a parse tree involves traversing its nodes and applying rules that re-
place nodes and branches into some other nodes and branches to reduce the execu-
tion time while maintaining the semantics of the query that is represented by the tree.
Transformation rules are expressed in Spark SQL using Scala pattern matching, where
transformations can be viewed as if-then rules with the antecendent describing the
substructure of the tree and the consequent specifying the actions that should be per-
formed on the instances of trees that matched the structure. For example, the following
rule case Square(Var(x)) =>Multiply(Var(x),Var(x)) tranforms an
expensive function that computes the square of a variable into the multiplication of the
variable value by itself, which is often cheaper. Spark SQL transformation rules are
much more complicated in general and they can contain arbitrary Scala code.
After logical optimization where rules are applied to transform the parse tree and
the physical planning phase that replaces DataFrame API calls with low-level opera-
tors, the resulting code is generated from these low-level operators. To do it, the parse
tree must be traversed internally by the Spark SQL library. This task is accomplished
using Scala quasiquotes, a language-based mechanism for translating programming
expressions internally into parse trees and storing them in variables as well as unpars-
ing trees into programming expressions [139]. For example, consider a scala vari-
able, val v = 1. Using a quasiquote, we can obtain a tree val tree = q"$v
+ $v", where tree: Tree = 1.$plus(1)). Interested readers can create
an sbt-based Scala project, include the dependency libraryDependencies +=
"org.scala-lang" % "scala-reflect" % scalaVersion.value in the
file build.sbt and use the module scala.reflect.runtime.universe.
Once the tree is constructed, rules are applied to transform the tree into the syntac-
tivally correct and semantically equivalent Scala program that Spark will execute on
the destination nodes.
13.7 Summary
In this section, we studied Spark and its underlying mechanisms for executing big data
processing applications in the cloud. Main abstractions in Spark enable programmers
to concentrate on the data processing logic of their applications and avoid compli-
cated and error-prone reasoning about data location, access, migration, and other kinds
of data manipulations to improve the performance. The underlying platform called
Mesos handles other frameworks besides Spark and it allows these frameworks to spec-
ify what resources Mesos should provide for executing their tasks. Moreover, the de-
layed scheduling technique helps Spark balance the fairness of executing diverse tasks
and improving the overall performance of the system. Finally, Spark SQL shows how
extensible Spark is for processing relational datasets and how it transplant relational
database optimizations techniques using Scala language mechanisms.
Chapter 14
Imagine yourself at your computer browsing and purchasing items in a web store. You
select an item you like and put it in a shopping cart and start the checkout process.
Unbeknownst to you, there are hundreds of shoppers who like this item and attempt to
buy it. Yet, there is a limited quantity of this item in the company’s warerhouse, since
the space is limited and expensive and the company cannot predict with a high degree
of precision the demand on all items. Nevertheless, all shoppers proceed with their
purchases of this item and receive the receipt stating the delivery date within the next
day or two.
However, within a day a few shoppers will receive an email informing them that
their item is on backorder and the shipment will be delayed by a couple of days. Few
of us get deeply frustrated with this experience – we shrug it off and we don’t overly
complain about a couple of days of delay. When we decompose this situation, it is
clear that the web store allowed us to purchase an item that they could not guarantee
to have in stock to satisfy all incoming orders. Instead of putting our shopping carts
on hold and telling it that it may resolve a few hours to determine if the requested item
is available, the web store optimizes our shopping experience by fibbing that the item
was available in unlimited quantities and we could proceed with the purchase. Some
web stores actually show a label that states that the item’s quantity is limited, but many
shoppers take it as a marketing ploy to boost the item’s demand. One way or the other,
the harsh reality is that we accept incorrect data from web stores in return for fast and
pleasant experience when shopping for items.
This example illustrates an important problem that we will describe in this chapter
and discuss solutions – how to balance the correctness of the results of the computa-
tion with the need to meet some performance characteristics of the application. When
requests are sent to a remote object, it can queue these requests and process them, or
it can respond with approximately correct results immediately and perform the com-
putations in batches at a later time. Of course, ideally, we want our remote distributed
objects to reply to requests with correct replies immediately, but as a sage once said,
experience is what we get when we don’t get what we want. Now get ready to get the
experience of the trade-off between the correctness of the computation and the timing
of the delivery of these correct results to clients in the presence of network failures.
202
14.1. CONSISTENCY, AVAILABILITY AND PARTITIONING 203
nesses like Amazon, performance of web-based applications was taking over the cor-
rectness of the computations that they perform. Various studies repeatedly show that in
the presence of competition among web-based businesses, an average customer/client
is willing to wait for less than 15 seconds for a reply from the web business application.
The switching cost is very small in most case, i.e., it take for a customer a few seconds
to redirect her browser to the URL of a competing web store, whereas a trip to a phys-
ical store took much longer and the shopping experience with the switching cost was
affected by the investment of time. Moreover, it is not just the fickleness of web store
shoppers that makes the performance of the applications more important than the con-
sistency of data – high-speed trading, search engines, social networks, content delivery
applications, insurance fraud detection, credit card transaction management, revenue
management, medical X-ray image analysis – the list of applications is growing where
response time is more important than the correctness of the results.
Of course, in the days of yore of computing, applications were mostly monolithic
and they exchanged a modicum of data by today’s standards with a handful of backend
databases. Failures arose from hardware and from logical errors made by programmers
when they implemented requirements in applications. In distributed systems like the
one that is shown in Figure 14.1 data loss or data corruption may happen at one or more
points between the mobile devices and the databases. In fact, to reduce the load on the
backend database, multiple instances of this database can be running, thus leading to
situations where some data may not be updated and it takes some time to sync up (e.g.,
one database keeps orders whereas some other database stores customer transactions
and the orders do not match the transactions for a given customer). Naturally, these
situations can be avoided if transactional correctness is enforced.
The easiest way to think about transaction execution is in terms of strict consistency,
where each operation is assigned a unique time slot and the result of the execution of
14.2. TRANSACTIONS AND STRICT CONSISTENCY MODELS 205
The third property, isolation provides the sequential consistency model in the con-
current transaction processing context. Without isolation, race conditions are possible
and they will make data inconsistent. Interestingly, changing the atomicity of a trans-
action, e.g., splitting it into two atomic transactions will result in exposing its internal
state that may be changed by some other concurrently executing transaction leading to
the loss of data consistency, however, in this case, it would be the fault of the software
engineer who made the decision. For example, a decision to split a flight reservation
into two or more transactions where each transaction reserves seats only for each leg
206 CHAPTER 14. THE CAP THEOREM
of the itinerary may lead to situations where there may not be any connecting flights
available for a seat reserved for some leg. Since many concurrent transactions reserve
seats for different passengers, by the time that one transaction reserves a seat, other
transactions may take all seats available in other connecting flights resulting in wasted
computing effort. However, doing so may be a part of the design, in some systems.
Isolation property is important when two transactions operate on the same data, e.g.,
two persons make withdrawal from the same bank account at two different locations
– without isolation the race condition will lead to the incorrect balance value of the
account. With isolations, locks are placed on data items within the database to ensure
that the result is equal to the one obtained using the sequential consistent model.
Finally, the property durability specifies that the output of a transaction should
be stored on some reliable storage and in the case that this storage experiences certain
types of problems, the stored data should not be lost. There is some interesting interplay
with consistency – the durability of the database may fail and some data may be lost,
but the remaining data will be consistent w.r.t. the database constraints and invariants.
Unfortunately, in the presence of failures this simple protocol does not suffice to
ensure ACID properties: some nonempty subsets of the messages can be lost and S
and C can crash and then restart at arbitrary times during committing a transaction. A
key premise of the 2PC protocol is that every participant, S and C of the transaction
14.3. TWO-PHASE COMMIT PROTOCOL 207
has a durable storage that it uses to write and read its sequences of transaction steps.
Writing information about the transaction steps on the durable storage is not commiting
a transaction, it is simply creating a log entry that describes the operations; this entry
can be made permanent, i.e., committed or it can be aborted. These commands are
issued by the program (e.g., a web store) that executes a transaction.
Doing so is accomplished by the 2PC protocol in two phases: prepare and
commit. In the phase prepare, C sends a corresponding message to each S (e.g.,
ready) to get the transaction ready. Once an S stores its sequence of steps on its durable
storage, it responds with a message that acknowledges that it is ready (e.g., prepared).
When the C receives the message prepared from all Ses, it will mark the comple-
tion of the first phase of the protocol and it will initiate the second phase where it sends
the message commit to all Ses that will proceed with committing the transaction steps
and when completed Ses will send acknowledgement messages back to C. Once all
acknowledgement messages are received from Ses, Ces will record that the transaction
is committed and no further action is required.
In case when messages are lost or have some latency beyond an acceptable limit, C
uses a timeout to send the message abort to Ses. If a message sent from C is lost, Ses
may continue to wait, and once the timeout is reached, C will abort the transaction and
restart it again. The same reasoning applies when a message is lost from some S to C.
The chosen timeout values vary greatly and they may impact the overall performance
of the application significantly.
If one or more of the Ses crash, a few scenarios are possible. The crash can happen
before the transaction is written into the local durable storage, after it is written, and
if the crashed S restarts and how soon. Often a crash of an S means that C does not
receive a response and the situation is handled with the timeout. If an S restarts soon
and it wrote the transaction steps before the crash, it will respond with the message
prepared to C and its crash will be masked. If a crashed S does not restart at all, then
it means that the node on which this S is running has some serious damage and it may
have to be replaced with a possible serious impact on the application’s performance.
The crash of C is handled similarly. If it crashed in the first phase, it will simply
repeat it after the restart. Since C also records its transaction state on the durable
storage, it will obtain the information if all Ses responded in the second phase and it
will proceed accordingly. The simplicity of 2PC and its robustness makes it a standard
protocol to ensure ACID properties in most distributed systems.
Let us consider the complexity of the 2PC protocol in terms of the number of mes-
sages it exchanges, since each message exchange involves some latency and it con-
sumes resources. For N servers, C sends N messages ready to which Ses respond with
N messages prepared bringing the total to 2N messages. The second phase also in-
volves 2N messages bringing the total to 4N messages. Writing transaction data into
the durable storage can also be viewed as sending 2N + 1 messages to the filesystem
managers bringing the total to 6N + 1 as the best case scenario without having failures
208 CHAPTER 14. THE CAP THEOREM
and without considering the overhead of memory and data manipulation. In practice,
with complex distributed server topologies and sophisticated multistep transactions the
overhead of the 2PC protocol is significant and often intolerable.
Question 6: Discuss an analogy between the CAP properties and three prop-
erties of economies: the cost, the availability and the quality of goods. How to
ensure that all three properties are guaranteed, say in the health market? Analo-
gously, how can one guarantee that it is possible to build a distributed application
with the guaranteed CAP properties?
The CAP theorem was formulated and proved in two variants: for asynchronous
and partially synchronous network models [61]. We state the theorems below and
explain the proof in a way that shows why the statement of the theorem holds true
when constructing a system is attempted with three properties holding.
Theorem 14.4.1 (CAP Theorem: asynchronous). An algorithm cannot be constructed
to read and write a data object and its replica using an asynchronous network model
where availability and atomic consistency are guaranteed in all fair executions in
which messages can be lost.
To understand this theorem and why Dr.Brewer’s hypothesis should be formulated
as a theorem in the first place let us specify precisely the terms used in this theorem. Let
us start with the notion of the algorithm that is constructed – it is a sequence of steps that
involve read and write operations on some data object and its replica object. We assume
that this algorithm will always terminate, i.e., it cannot execute forever and it must
produce a result or an error code. We state that the execution time is not unbounded,
however, we do not specify the time limit on the long executions, however, we assume
that such time limit exists. A more concrete example of this abstract assumption is that
an algorithm cannot have an infinitely executing loop without producing any results in
this loop.
14.4. THE CAP THEOREM 209
Moreover, the operations that this algorithm executes in its steps are atomically
consistent. Using the definitions from Section 14.2 we state that each operation in
the algorithm is a transaction and the sequential consistency model applies where all
operations are viewed to execute in some agreed sequence on a single node with a
strict happens-before order between these operations. A violation of the order of the
operations or observing the intermediate state of the step is impossible in this setting.
Finally, the notion of the fair execution is a way to state that everything that can
occur during an execution of the algorithm is correct to assume as a valid option in each
specific execution. One can enumerate all possible execution scenarios of the algorithm
that include exceptions and wrong inputs and each of the enumerated executions can
be viewed as valid possible executions. Since all communications between objects in
a distributed system are performed by exchanging messages, it is valid to assume that
each of the messages may be lost in each fair execution. Lost messages are modeled by
erecting a wall or a partition between distributed objects that prevents these messages
from passing, hence we use the term network partitioning.
Of course, there is a problem enforcing network partitioning and availability. What
do we do when a message is lost – the algorithm cannot wait forever for a response.
Therefore, to make this context consistent, we enforce that the algorithm must respond
to lost messages with atomic consistency. One can think of producing a random or
sloppily computed response by a message sending node if some predefined timeout to
receive a response to this message expired.
The context for the proof
of the CAP theorem for asyn-
chronous networks is shown in
Figure 14.2. A partition shown
as a solid black vertical line di-
vides the distributed system into
two subsystems G1 and G2 . The
algorithm is implemented in the
following way. Each partition
has a database that stored a data
object, V in one partition and
its replica in the other partition.
The initial value of this data ob-
Figure 14.2: The context of the CAP theorem.
ject is v0 in both partitions. The
sevice, S writes a new value, v1
in the partition G1 and the service, P reads the value of the replica in partition G2 .
Solid black arrows to and from the services designate write and read operation in the
exactly defined sequence: first the write operation comes and then the read operation.
The states of the objects are synchronized using messages that are shown with the gray
block arrow that crosses the divide. The network partitioning is indicated by the fat red
circle on the divide that results in the loss of some messages.
Suppose that there is an implementation of the algorithm where all three properties
exist, i.e., the contradiction to the statement of the CAP theorem. It means the follow-
ing. The initial state of the data object and its replica is the value v0 . Then the service
S performs the transaction write that changes the value of the data object to v1 with
210 CHAPTER 14. THE CAP THEOREM
the subsequent termination of the operation write. The assumption of the availability
means that the operation write must terminate at some point of time and the service S
must receive a message about the successful completion of the operation.
Next, the service P performs the transaction read on the data object replica. Since
the assumption of the availability holds, P must receive the value of the data object
replica. Since the data object and its replica can update their states using messages that
cross the divide, if a message that carries the value of the transaction write, i.e., v1 is
lost, then the algorithm either waits forever until the data object replica is updated or it
returns the value v0 to P. The former violates the availability assumption and the latter
violates the atomic consistency assumption. At the same time, assuming that all update
messages that cross the divide are not lost violates the fair execution assumption with
network partitioning. The proof is completed.
A variant of the distributed system for the CAP theorem is a partially synchronous
distributed system where the notion of local time is introduced by assigning an abstract
clock to each distributed node and assuming that there is no divergence between these
clocks. That is, the time intervals passed in each node since some event occurs is the
same. The purpose is not to define the global time, but to enable each node to determine
how much time passed since an event occurred. The idea of using intervals in the proof
is to determine when one transaction completes and the other begins. The format of
the proof is similar to the asynchronous network proof within some time interval.
Question 7: Write down and explain the proof of the CAP theorem for a
partially synchronous distributed system.
The CAP theorems directs the designers of the distributed systems to consider the
following classes of these systems: consistent with partitions (CP), available with par-
titions (AP), and consistent with availability (CA). In CP distributed systems, the ter-
mination is not guaranteed, since the operations wait until all messages are delivered
and all transactions execute. The 2PC protocol guarantees the ACID properties, how-
ever, the system may not be available due to waits and the high cost of transactions.
Opposite to it, AP distributed systems relax the consistency requirement, so a possible
result is that the clients will receive stale values that can be viewed as a result of the
out-of-order execution of operations. Finally, CA systems are not distributed, since
they assume that network partitioning events cannot happen.
Question 8: Assuming that we can quantify the willingness to pay for highly
available services, design an algorithm that can automatically balance the CAP
properties for a distributed application.
for the absolute majority of the users. However, delays in presenting information to
these users will lead them to switch to other, more available platform providers thus
leading to significant losses of the revenue to less available commercial organizations
and businesses. Hence, availability trumps consistency.
E-commerce application design is often guided to a certain degree by the 15-second
rule that roughly states that if a website cannot interest the user in the first 15 seconds
then many users will switch to the competition or leave the website altogether. A high
bounce rate of users is indicative of a poorly constructed e-commerce platform. It is
also widely believed that if a result of the query is not produced within some time
interval, usually much less than 15 seconds, the users will bounce from the website.
Therefore, even though the network can be very fast, the latency from enforcing strict
consistency may lead to a high bounce rate and the downfall of a business. For example,
the Amazons SLA rule limits the response time to 300ms for its web-based application
[42]. According to Dr.Vogels who oversaw the design of the e-commerce platform at
Amazon: “We do not want to relax consistency. Reality, however, forces us to.”
work partitioning and applied later when network access is restored. Second, replicas
or logs are used to record information about operations that cannot be performed due
to network partitioning or that are avoided because applying them will lead to poorer
performance. As a result, some users will obtain an inconsistent state of some objects
during some intervals of time. The job of the software engineers is to determine how
to resolve inconsistencies while keeping their applications highly available.
14.6 Summary
In this chapter, we introduced the context of the CAP theorem, showed its proof, and
discussed the implications of the CAP theorem on the design and development of large
distributed applications. We showed how the properties of consistency, availability and
network partitioning are played against one another to obtain a distributed application
that provides results of computations to users with high performance while allowing
some inconsistency in the computations. The deeper issues are in selecting consistency
models and the corresponding transactional support for distributed applications – de-
pending on the mix of operations on data objects and the level of network partitioning,
even high throughput networks will not help applications to increase their availabil-
ity, since it would take beyond some allowed latency threshold to provide the desired
level of consistency. We will use the framework introduced in the context of the CAP
theorem to review relaxed consistency models and we will see how they are used in
distributed applications.
Chapter 15
The CAP theorem serves as a theoretical underpinning for determining trade-offs be-
tween consistency and availability when designing large-scale distributed applications.
A transaction that locks multiple resources until the 2PC protocol commits the trans-
action is often unrealistic in the setting where clients must receive responses to their
requests within a few hundred milliseconds. There are multiple technical solution to
implement weaker consistency models: selectively lock data objects to enable opera-
tions on these data objects to compete without waiting on one another, replicate data
objects and assign certain operations to specific data objects while keeping the metain-
formation about these operations in a log (e.g., perform reads on one data object and
writes on some other data object with a log keeping timestamps of these operations),
or simply store all operations in log files and apply them in a large batch operation at
some later point. One way or another, the clients receive fast and sometimes incorrect
response values to their requests and the correctness will be achieved at a future time.
The value of strict consistency guaranteed by ACID transactions is often over-
estimated even in the domain of financial transactions. Suppose that two clients ac-
cess the same bank account from two ATM machines that are located across the world
from each other. One client credits the account whereas the other debits it. Apply-
ing the 2PC protocol to allow these transactions to leave the account’s balance in the
consistent state may result in the latency to one transaction at the expense of the other
one. If it seems like no big deal for a couple of clients, think of a large organization
that sells its product directly via its online store with thousands of purchase and return
transactions performed each second. Creating multiple replicas of the organization’s
bank account and allowing these transactions to proceed in parallel without any locking
will result in the inconsistent states of each replica, however, eventually, these replicas
will be synchronized and their states may reach the consistent global state, say on the
weekend when the store is closed. However, if the store is never closed, the synchro-
nization of the replicas will continue in parallel to the business transaction, at a slower
pace, perhaps, to avoid conflict and the entire application will fluctuate between var-
ious degrees of consistency, getting closer to the expected global state at some times
and farther away from it at some other times, eventually converging to the expected
global if all business transactions stop and the application reaches a quiescent state.
214
15.1. CONSISTENCY MODELS 215
This is the essence of the notion of the convergence property [64] or better known as
the eventual consistency [80, 149] that we will study in this chapter.
Question 1: For a set of five processors that execute read (R) and write (W) op-
erations, where R(x)v reads the value, v of the shared variable, x and W (y)q writes
the value, q into the shared variable, y, construct timelines with these operations
that satisfy the strict or the sequential consistency models.
One of the weaker consistency models is called causal consistency where the
causality between operations that occur in a distributed application are defined by mes-
sages that the objects that perform these operations send to each other [86]. That is, if
some object, ok sends a message to read the data from the object om and at some future
time the object ok sends a message to write some data into the object om , then the read-
ing event precedes the writing event. However, if some object, o p sends a message to
write some data into the object om and there are no messages exchanged between the
objects ok and o p , then it is uncertain what value the object ok will receive in response
to its message read – the one that was before or after the message write from the object
216 CHAPTER 15. SOFT STATE REPLICATION
o p was executed. Therefore, the causal consistency model is weaker, since it allows
distributed objects to operate on stale data.
To illustrate causal consistency, suppose that the processor (or a client) P1 writes
values a and c into the same memory location, x, i.e., W (x)(a), . . . ,W (x)(c). Indepen-
dently, the processor (or a client) P2 writes the value b into the same memory location,
x, i.e., . . . ,W (x)(b). The only order that we can infer from this description is the order
between the operations write on P1 . There are no read operations in P2 of the data writ-
ten by P1 and there is no other processor that can establish additional causal relations
between these operations. Therefore, the following orders are valid as seen by some
other processors: R(x)(b), R(x)(a), R(x)(c), or the order R(x)(a), R(x)(b), R(x)(c), or
the order R(x)(a), R(x)(c), R(x)(b) – the order or reads where values a and c appear
remains the same as specified in P1 .
Now, let us assume that the processor P2 executes the operation R(x)(a) before its
operation write. Doing so establishes a causal order between the operation write in P1
and the operation read in P2 . Subsequently, the operation W (x)(b) in P2 is causally
ordered after the operation read. Hence, the global history R(x)(b), R(x)(a), R(x)(c) is
not correct any more for the causal consistency model.
In the descending order by how weak consistency models are, the next one is called
First In First Out (FIFO) or Parallel RAM (PRAM) consistency, also known as the pro-
gram order or object consistency model [90,116]. In this model, the order in which each
process performs operations read and write is preserved, however, nothing is guaran-
teed about the order in which mixed operations from different processes are performed.
That is, the local histories are preserved for each process and the global history is not
defined. Applying the FIFO consistency model to the ATM example, each client will
see the exact order in which her debit and credit operations were performed on the
distributed account object, however, the desired order is not guaranteed in which these
operations are unified. Suppose that one client withdraws $100 and deposits $1000
from the account that has $100 while the other client withdraws $500 and deposits
$100. If the operations performs in the described sequence, the account balance never
drops below zero, however, it is possible that after the first client withdraws $100 and
the balance drops to zero the other client withdraws $500 making the account delin-
quent. Thus, the FIFO consistency can be viewed as a further relaxation of the ACID
property where transactions are not atomic and their internal states are exposed.
Question 2: For a set of five processors that execute read (R) and write (W) op-
erations, where R(x)v reads the value, v of the shared variable, x and W (y)q writes
the value, q into the shared variable, y, construct timelines with these operations
that satisfy either the casual or the FIFO consistency models, but not both.
Even weaker is the cache consistency model in which every read operation on a
shared memory location or a shared data object is guaranteed to obtain the most re-
cently written value by some other operation. In this model, out-of-order executions
are possible, since operations are concurrent, they are not synchronized and even the
internal process operation order is not preserved [59, 102]. Think of a memory con-
sistency model as a contract between the low-level hardware layer and the higher-level
15.2. DATA OBJECT REPLICATION MODELS 217
software layer that contains operations that are executed by the underlying hardware.
The hardware may have additional low-latency stores that we call caches (e.g., L1-L4
caches) that can store replicas of the values whose main store is in RAM. Suppose that
an operation updated the value stored in the cache, but the corresponding location in
the RAM is not updated instantly, there will be some time lag before it happens. That
is, the location in the RAM stores the most recent value that is updated by some other
operation, not by the operation that updated the value in the cache. In the meantime,
the next read operation may read the value from the RAM and not the cache. As a
result, the consistency is significantly weakened when compared to other models like
sequential, causal, and even FIFO consistency models. Of course, if cache coherence
is added to the cache consistency model, then all cache updates will be propagated to
their original locations before processes read their values.Of course, combinations of
these consistency models are possible and their implementations differ in difficulties
when it comes to enforcing the constraints of these models on data replicas.
following activities: find an item, put it in the shopping cart, select the checkout, enter
the payment and shipping information and apply all relevant coupons, then push the
button labeled Purchase. The result of this happy path is an email arriving within an
hour to the customer’s mailbox stating that the item is being shipped to the provided
address and providing the breakdown of the charges.
This simple and straightforward happy path hides many technical issues, since it
may be executed by hundreds of thousands clients per second. Consider the following
services that must be used in the happy path transaction: wearhouse checks, delivery
date confirmation, legal and export restrictions, communications with external vendors,
credit card charge processing, shipping rate check, calculation of all applicable taxes,
and determining the type of packaging, to name but a few. Some transactions may
involve RPCs with hundreds of remote objects. Yet, the SLA guarantees that the cus-
tomer will receive her response that confirms the transaction or shows why it failed in
less than, say, 300ms. Clearly, a straightforward solution of connecting each customer
directly to the backend database and applying the 2PC protocol will result in a serious
performance penalty that will violate the SLA.
Question 4: Does it make sense to apply the 2PC protocol only to a part of the
happy path in a distributed application?
whose methods implement various algorithms to process data to deliver the expected
functionality to clients. Finally, the third-tier or the back end contains data objects that
interact with a durable storage. Caches and hints are located in the first and the second
tier to prevent, whenever possible expensive operations on the objects in the back end.
Caches short-circuit repetitive expensive computations by storing in memory the
result of the previous computation in the first/second tier and delivering it to the client
instead of repeating this computation. Using caches is rooted in the principle of locality
that states that the same values or related storage locations are frequently accessed by
the same application, depending on the memory access pattern [87]. By storing the
relation ( f , i) → f (i) as a memory-based map with f and i as keys, the previously
computed value of f (i) can be retrieved using a fast memory lookup. Of course, a cache
may be limited in size and storing a smaller number of pre-computed values may result
in cache misses, which incur some additional computation penalty and recomputing
f for these missed entries at runtime may be quite expensive and performance will
worsen correspondingly.
Question 5: Given a legacy application, explain how would you approach its
redesign to use caches.
Unlike caching, a hint is the saved result of some computation, which may be
wrong and it is not always stored as a map. Consider a situation when shipping charge
is computed for an item ordered from a web store. Instead of contacting a shipping
service with the zip code to which the item is going to be shipped, it is possible to look
up the results of the other computations that is run in parallel to this one and select
the shipping charge for the one whose zip code is the closest to the given location.
Thus, the hint short-circuits an expensive computation and it returns an incorrect result,
however, the eventually corrected shipping charge will not differ significantly from the
one that was substituted from the closest zip code order. Even though a hint may be
wrong, eventually checking its correctness and replacing the value is convenient and
it improves the performance of the application. With caches, hints make distributed
applications run faster. Deciding when to use caches, hints, and replicas is a design
decision. Therefore, replicating a service without full synchronization forces designers
to use hints and caches, and stale data will be returned to clients.
are “consistent” with each other. By this we mean that given a cessation of update
activity to any entry, and enough time for each DBMP to communicate with all other
DBMPs, then the state of that entry (its existence and value) will be identical in all
copies of the database” [79]. We adopt this statement with small modifications as the
definition of eventual consistency. An application is eventually consistent if it makes
a transition to a soft state and then it reaches its hard state in which all replicas are
identical at some time in the future upon the cessation of all update activities.
prefix. A consistent prefix is a function of the bounded staleness constraint that dic-
tates a certain time limit after which all operations become fully consistent. Services
see different prefixes during the bounded staleness time interval that can be combined,
generally into four categories.
Read your own writes means if the service performed the operation write before it
performed the operation read, then the operation read will return the value writ-
ten by that operation write. The service will never obtain the value written by
out-of-sequence operation write, even though the service may obtain the value
written by some other service that executes concurrently. This is an eventual
consistency guarantee that is implemented in the Facebook Corporation’s cloud.
Writes follow reads is an application of the causal consistency that specifies that if the
operation write follows some previous operation read on the same data object by
the same service, then this operation write will be performed on the same or the
more recent value of the data object that was obtained by the previous operation
read. An application of this guarantee is manifested in the ability to post a reply
to a message on some social media that will be shown after the posted message,
not before it.
Monotonic reads guarantees that once a value is obtained by some operation read will
be also obtained as a part of the prefix by all services. Essentially, if there is the
operation write that updated the value of a data object by the service s that will
see this value by the following operation read, then this operation write and its
effect will also be seen by some other service p. Violation of the guarantee of
the read monotonicity would be a situation where services can obtain different
values of previous operation write as time goes by.
Monotonic writes guarantees the order of writes on a data object by the same service.
It is also most difficult to implement, since it involves a nontrivial amount of
coordination among services to converge to the monotonically written state.
Figure 15.1: A sales ACID transaction that updates data in three database tables.
1 Begin
2 INSERT INTO Order (Id, Seller, Buyer, Amount, Item)
3 Queue Message("UPDATE Vendor", Balance+$Amount, Seller)
4 Queue Message("UPDATE Client", Balance-$Amount, Buyer)
5 Commit
Figure 15.2: A transformed sales transaction that sends messages to update data.
item by the given amount and the corresponding statement update in line 4 subtract
this amount from the balance of the client who is the buyer. These three tables are
represented by distributed data objects that are located on different server nodes. Per-
forming this transaction with ACID guarantees will lock these tables for the duration
of the transaction.
The first step of converting this transaction into a BASE update is to break it into
two or three transactions with one SQL statement in each. Besides exposing the inter-
nal state to many other transactions, we still have problem with locking the distributed
objects for the duration of their updates thereby preventing other transactions from
reading/writing these objects. What we need is to inject some operation that allows us
to convert an SQL statement into a message that will eventually be delivered and exe-
cuted on some replicas of the distributed objects with some guarantees of the eventual
consistency.
Consider a transformed transaction that is shown in Figure 15.2. Updates are con-
verted into messages that take the necessary parameter values and these messages are
sent and queued by some MQM. Of course, the state of the application become soft
and to converge to the hard state the objects will have to be synchronized using some
algorithm, for example, the one that is shown in Figure 15.3.
The idea of this synchronization algorithm is to update the tables Vendor and Client
at some later time to make them consistent with the table Order. This is a possible syn-
chronization algorithm that may not be applicable to many applications based on their
needs to converge to the consistent data depending on the requirements. However, it
illustrates how eventual consistency can be implemented in a generic way using MQM.
224 CHAPTER 15. SOFT STATE REPLICATION
1 For each Message in the Queue Do
2 Peek Message
3 If there exists a transaction in Order with Message.Seller
4 Find Message in Queue with the corresponding Buyer
5 Begin
6 UPDATE Vendor with the corresponding parameters
7 UPDATE Client with the corresponding parameters
8 Commit
9 If Commit is Success, remove processed Messages
10 End For
Figure 15.4: A transformed sales transaction that sends messages to update data.
Of course, one big problem with this approach is that it is not scalable and having
the central coordinator object introduces a serious bottleneck (e.g., an object that re-
sults in a serious performance degradation of the entire application) and a single failure
point. Essentially, in a continuously running reactive distributed application applying
this variant of the 2PC protocol will lead to the loss of availability and serious perfor-
mance degradations. Thus, other protocols are needed that synchronize objects with
their replicas with little overhead.
Question 10: Discuss the pros and cons of using the 2PC protocol only for
a subset of the functionality of some distributed application to keep the states of
some distributed objects in sync.
Question 11: Discuss the overhead of keeping and comparing many versions
of distributed objects. How can you reduce this overhead?
Another solution is to combine the idea of the logical clock with a centralized
sequencer object. Each update is sent to the sequencer object that assigns a unique
sequence integer to the update and forwards the update message to all replicas, which
apply this update in the order of the sequence numbers assigned to them. Logical
15.7. CASE STUDY: ASTROLABE 227
clock-based algorithms are used in the number of cloud computing systems as we will
learn later in the book.
A class of epidemic-inspired message exchanges can be broken into three cate-
gories: direct, anti-entropy, and rumor-mongering [43]. The problem with sending
many messages between distributed objects and their replicas is that it generates dense
network traffic when the number of object replicas increases, since it takes significant
time to propagate update to all nodes. Instead, the idea of epidemic propagation is
based on the limited number of messages that are sent from the updated object to its
replicas and the other objects that need to be informed about the update.
In epidemic-inspired algorithms, distributed objects are viewed from the SIR per-
spective, where Susceptible are the nodes that have not received an update message,
Infective are the nodes that received an update and are ready to send it to other sus-
ceptible nodes, and Removed are the nodes that received an update but will no longer
send it to the other nodes. In the direct mail epidemic algorithm, each infected object
will send an update message to all other susceptible object on its contact list. In the
anti-entropy algorithm, every object regularly chooses some other susceptible object
at random to send an update. Finally, with rumor mongering algorithms, an update
message is shared while it is considered hot and timely. The more objects receive it,
the less it spreads. Benefits of epidemic-inspired algorithms include high availability,
fault-tolerance, tunable propagation rate, and the guaranteed convergence to the strict
state by all replicas.
Each computing entity must run an Astrolabe agent, which is a program that en-
ables these computing entities to exchange information within the zone using an epi-
demic gossip protocol. Thus, there is no central distributed object that must collect the
information and update zones and computing entities – with no single point of failure
like that, Astrolabe is robust.
In Astrolabe, each zone randomly chooses some of its sibling zones using some
predefined time period and these zones exchange MIBs choosing the one with the most
recent timestamp. The first issue is the propagation of the information about zone mem-
bership – in a dynamic open environment computing entities join and leave at arbitrary
time interval and frequently meaning that a significant amount of information about the
membership traffic can be generated. The hierarchical organization allows Astrolabe
the prevent this information from propagation systemwide, keeping it localized within
particular zone hierarchies.
Astrolabe uses the following rule for eventual consistency: “given an aggregate
attribute X that depends on some other attribute Y, Astrolabe guarantees with proba-
bility 1 that when an update u is made to Y, either u itself, or an update to Y made
after u, is eventually reflected in X.” Interestingly, the guarantees are not given on the
time bounds for the eventually consistent update propagations, but the word “even-
tual” specifies that it will happen in the future. To see the results of the experimental
evaluation of Astrolabe, the readers are recommended to read the original paper.
15.8 Summary
In this chapter, we studied the concepts of soft state replication and eventual consis-
tency in depth. First, we learned about different consistency and data replication mod-
els, their costs and benefits. Then we looked into different techniques for software
design of highly available applications where strict consistency models are relaxed to
obtain inconsistent results of the computations. Next, we reviewed the concept of even-
tual consistency and studied how it is implemented in eBay’s BASE using MQMIn a
separate section, we reviewed salient features of the open-source AMQP for MQM.
Of course, at some point inconsistent states must be resolved or converged towards
consistent ones, and this is why we give an overview of some protocols for computing
convergent states. Finally, we study how some of these protocols are used in Astro-
labe, an influential experimental platform for hosting eventually consistent distributed
applications and how hierarchical organization of the system improves it scalability.
Chapter 16
1 https://www.f8.com
2 https://zephoria.com/top-15-valuable-facebook-statistics
229
230 CHAPTER 16. FACEBOOK CLOUD
In this chapter, we reconstruct the internal operations of the Facebook cloud using
peer-reviewed publications by Facebook engineers and scientists at competitive sci-
entific and engineering conferences. First, we will use a presentation of engineering
managers Mssrs. O’Sullivan and Legnitto and Dr. Coons who is a software product sta-
bility engineer on the developer infrastructure at Facebook’s scale3 . Also, we include
information from an academic paper on development and deployment of software at
Facebook authored by Prof. Dror Feitelson at Hebrew University, Israel, Dr. Eitan
Frachtenberg a research scientist at Facebook, and Kent Beck, an engineer at Face-
book [48]. Next, we will study what consistency model is used at Facebook and how
eventual consistency is measured at Facebook. After that, we will review configuration
management at Facebook and learn about its tool called Kraken for improving perfor-
mance of web services. Finally, we will study an in-house built VM called HipHop
VM for executing the code in Personal Home Pages or PHP Hypertext Preprocessor
(PHP) pages on the Facebook cloud.
system. One Git repository is used per platform at Facebook, where its engineers run
more than one million source control commands per day with over 100, 000 commits
per week. Facebook cloud servers are used to store source code objects in a source
control system, first SVN, then GIT, and later, Mercurial. This is important, since with
approximate frequency of one commit per second, the development infrastructure must
provide quick operations, so that developers do not waste their time on operations that
result from concurrent changes and delays in merging them.
Suppose that ten developers modify the source code of the same application and
commit and push their changes at the same time. Thus, these concurrent updates are
likely to be in conflict, since these changes may be applied to the same statements and
operations. Naturally, the operation push fails and the developers must merge their
changes and repeat the commit/push. Some of them may do it faster than the other
ones, leaving the latter to repeat the same situation to continue to pull the updates,
merge, commit/push, By implementing server side rebases, pushes are simply added
to a queue and processed in order, ensuring developers don’t waste time getting stuck
in a push/pull loop while waiting for an opportunity to push cleanly. The majority
of rebases can be performed without conflict, but if there is then the developer gets
notified to manually handle the merge then submit a new push. It also helps that they
have a very intensive automated testing suite (also discussed in that video) so they can
have greater confidence in the integrity of their automated merging.
Facebook web-based application is highly representative of the development and
deployment process for cloud-based companies and organizations. To be competitive,
it is important for businesses to try new features to see how customers like them and
then to provide feedback quickly to developers and other stakeholders. The idea is
to split the users of a cloud-based application into two groups, A and B where the
users in the group A is the control group that keeps using the baseline application and
the users in the group B is the treatment group that is exposed to new features of the
same application. This process is called A/B testing [44]. Various measurements are
collected from both groups and then their are compared statistically to determine if
these measurements improved w.r.t. certain predefined criteria. If the improvement is
deemed statistically significant, then the new features are rolled out to all users.
Even though we described the development process at Facebook at a high level,
three key differences are important between cloud-based and individualized develop-
ment/deployment. First, it is heavily automated at a large scale with monitoring and
execution data collection and analysis that provides real-time feedback on various as-
pects of the (non)functional features. Second, users do not install or have access to the
server-side code – they are limited to the functionality presented to them by the UI.
Moreover, often they are not aware that the functionality changed or that there was a
new release of the application, since it may improve the performance and availability
without adding any new functional features. Finally, the time is significantly shortened
between identifying new requirements and deploying their implementation to billions
of users on millions of servers. Achieving that with monolithic individualized customer
installations and deployments would be very difficult.
Here we are, at a point where the cloud infrastructure is not used just to deploy-
ment of the applications, but it is itself a service, a set of distributed object whose
performances are monitored and controlled and whose exposed interfaces enable ser-
232 CHAPTER 16. FACEBOOK CLOUD
vice personnel to optimize their work to fine-tune application services to the customer
base at a much higher response rates, where more jobs are created for fine-tuning the
infrastructure to enable developers to write application code to provide better services.
that the clocks must be well-synchronized, and in Facebook the clock drift was esti-
mated approximately 35ms and it is accounted for in the consistency calculating algo-
rithm. The other problem is when newer values are equal to the previous values – it
is not possible to determine the dashed dependency edges. Regardless, the experiment
within Facebook showed that out of hundreds of millions of captured read/write oper-
ations, only a few thousand are registered as loop anomalies, or between 0.001% and
0.0001% of the total requests. Interestingly, only approximately 10% of the objects ex-
perience both reads and writes and they are the sources of inconsistencies. With a very
small percentage of the total number of objects leading to inconsistency, it is unclear if
serious remediation measures are needed.
Based on this experiment, Facebook introduced the concept of φ(P)-consistency for
a set of replicas, P that measures the percentage of these replicas that return the same
most recently written values for the same read request. Moreover, φ(P)-consistency
is divided into φ(G)-consistency and φ(R)-consistency. The former measures the con-
sistency of all cache replicas on the global scale and the latter does so only for some
Region. Using these measures allows engineers at Facebook to find undesired laten-
cies and problems with network configurations more effectively. For more information,
readers should be directed to the original paper [92].
open a text file with key/value pair assignments and change a value of some configu-
ration parameter – the repercussions of this change may manifest itself in component
failures, which are separated from the change spatially (e.g., in an unrelated component
running on a remote server) or temporally (e.g., after a long period of system execution
time). Configuration files are pervasive in the cloud datacenters and they are used to
manage the following aspects, among many others.
Product features are released to groups of customers frequently their release is con-
trolled by switching values of the appropriate configuration parameters.
A/B testing is similar to releasing new product features where live experiments are
performed with selected groups of users, as it is controlled by selected configu-
ration parameters.
Network traffic is controlled by configuration parameters that switch on/off routing of
messages or inclusion of features in the product that demand additional message
delivery and processing.
Load balancing config parameters switch on/off virtual routers and change the con-
figuration of the network topology to improve the overall performance of the
cloud infrastructure.
Smart monitoring is adjusted by a multitude of configuration parameters that deter-
mine the monitoring granularity and remedial actions depending on a certain
pattern of data.
validations that enable engineers to avoid many mistakes in a complex cloud environ-
ment with many interdependent processes.
Apache Thrift is used to specify configuration data types as an interface of the re-
mote configuration object. For example, one configuration data element could be a
financial instrument for trading, e.g., bonds, and some other configuration data ele-
ment is a protocol used to send trading messages for a given financial instrument, e.g.,
FIX. In some cases, certain financial instruments can be traded using only specific
protocols whereas in some other cases some protocols are disallowed for being used
with specific financial instruments. These constraints are encoded in a special program
maintained by cloud service personnel. This program contains reusable routines for
validating combinations of the configuration option values. Reusable configurations
are written in cinc files, e.g., if the chosen financial instrument is bond, then the con-
figuration variable for protocol is automatically assigned the value FIX. Programmers
specify their changes to configuration options in separate files in a given language, i.e.,
Python, and a different program called Configerator Compiler verifies the submitted
configuration changes against the validation program and cinc files to determine that
these configuration changes do not violate any constrains. Once verified, the configu-
ration changes are applied to the system.
Configuration management is a big part of services that cloud organizations pro-
vide to their customers. Not only do cloud providers configure sophisticated software
packages for hundreds of thousands of their customers, but also they configure various
software modules internally that are used by these customers. Using the RPC approach
to perform automated centralized configuration management enables cloud providers
to reduce the downtime and improve the availability of the cloud services.
Developers and testers need performance management tools for identifying per-
formance problems automatically in order to achieve better performance of software
while keeping the cost of software maintenance low. Performance bottlenecks (or hot
spots) are phenomena where the performance of the entire system is limited by one
or few components [4, 11]. In a survey of 148 enterprises, 92% said that improving
application performance was a top priority [134, 158]. The application performance
management market is over USD 2.3Bil and growing at 12% annually, making it one
of the fastest growing segment of the application services market [16, 57]. Existing
performance management tools collect and structure information about executions of
applications, so that stakeholders can analyze this information to obtain insight into
performance; unfortunately, identifying performance problems automatically let alone
correcting them is the holy grail of performance management. Performance problems
that result in productivity loss approaching 20% for different domains due to applica-
tion downtime [67].
When a user enters the address “www.facebook.com” in her browser, it is re-
solved to an IP address of some Point of Presence (POP) LB server in the geographic
proximity to the user. The LB server multiplexes the request from the user to a regional
datacenter whose LB entry servers multiplex this request further to a cluster within the
datacenter, which in turn uses its LB to direct this request to a server within the clus-
ter. When making multiplexing choices, LB servers may use sophisticated algorithms,
however, in reality situations occur when multiple datacenters can serve a specific re-
quest from a given user and the difference is that within a datacenter with a slightly
higher network latency a server may be available whose hardware configuration will
result in much smaller computation latency. That is, it may take longer to submit a re-
quest from a user to a specific server, however, this server will perform a computation
for this request much faster compared to a server in the datacenter that can be reached
faster because of its slightly better geographic proximity.
Kraken is a automatic performance tool created and used at Facebook that uses
feedback-directed loop to change the traffic between regions to balance the load [153].
We encounter feedback-driven loops in our devices everywhere - automatic systems in
our houses uses sensors to measure the temperature to determine when to turn on and
off the air conditioner or to measure the level of the visibility to change the intensity
of the street lights. In Kraken, this idea is applied to collect various performance mea-
surements such as the loads of the servers and various latencies among many others
to change the distribution of the customer request messages among the regions and
datacenters and their clusters. The latter is achieved by assigning weights to chan-
nels between POPs, regions, datacenters, and clusters and using LBs to change these
weights in response to collected measurements, thus creating the feedback-directed
loop between performance indicators of the cloud and the distribution of the load.
In the nutshell, Facebook’s Kraken works like the following way. The cloud
monitoring subsystem collects diverse metrics on CPU and memory utilization, er-
ror rates, latency, network utilization, write success rate, exception rate, lengths of
various queues, retransmit rates, and object lease count for in-memory data processing
software. Of course, various metrics are added over time. These metrics and their
timestamps are collected at predefined time intervals and they are stored in in-memory
high-performance database called Gorilla [114] and its open-source implementation
238 CHAPTER 16. FACEBOOK CLOUD
called Beringei is publically available4 . Once Kraken determines that an imbalance ex-
ist in the cloud where some datacenters or their clusters are getting overloaded whereas
other subsystems are underutilized, the weights are modified to change the distribution
of the traffic. As a result, new data metrics is collected and the feedback loop continues.
At a lower level of the implementation, Facebook uses Proxygen5 , a library for
creating LB-based web servers. A Proxygen-based web server ingests a cloud network
topology description from an external configuration file that contains weights for chan-
nels that connect POP, datacenters, clusters and web servers. Every minute Kraken
collects the metrics, aggregates the values, and then in the following minute issues cor-
rective actions for the weights based on the aggregated metric values. Once the weights
are changed, the traffic will shift.
It is noteworthy that a core assumption behind the use of Kraken is that the servers
in the cloud datacenter must be stateless. Consider a web server that must record
the state of the previous computations and use it in the subsequent computations. To
use a proper portion of the recorded state for a specific computation, stateful servers
establish sessions that define a logically and physically bounded exchange of messages.
As a result, requests for stateful computations must be routed to particular servers
in datacenters because they have specific states. Session can be long and they can
encapsulate many RPC computations. To ensure that all session requests are forwarded
to a specific server, the notion of the sticky session is used where RPCs are marked as
part of a particular session. As a result, load balancers will have to direct sticky session
requests to designated servers and it will interfere with Kraken’s traffic modifications
by directing RPCs to arbitrary servers in different datacenters based on the metrics.
robust enforcement of data encapsulation. All in all, PHP programs are difficult to
optimize and they are much slower in execution and error prone. However, given a
large PHP codebase at Facebook, it is not a sensible business solution to rewrite it in
some more efficient language, especially given that maintenance and new development
is required to add new features and fix bugs in the existing deployment of Facebook.
If the codebase rewrite is not possible, then a solution is to create a new execution
platform for PHP programs to address their performance drawbacks.
The new execution platform is a process-level VM called HipHop VM (HHVM)
that abstracts away the runtime interpretation of the PHP code by introducing a new
set of instructions called HipHop Byte Code (HHBC) [2]. A set of tools come with
the HHVM to compile PHP programs to HHBC and optimize the resulting bytecode
programs using Just in Time (JIT) compiler. Recall from Section 7 that a process VM
does not require a hypervisor, it can be a library loaded in the memory. The untyped,
dynamic nature of PHP programs makes it difficult to determine exact types of data
and their representations in PHP programs at compile time, hence the programs are
optimized by the HHVM during runtime. As the HHBC program keeps executing with
different input data, the HHBC is optimized based on the types of data detected during
the execution, thus making programs run faster the more times they are executed.
Financially, it is likely a sensible solution. Suppose that a new execution platform
can increase the performance of the cloud by 20%. Since Facebook uses approximately
20Mil servers, the improvement in the performance and efficiency of the executing PHP
programs may result in freeing an equivalent of a few millions of servers resulting in
a saving of tens of millions of USD per year not including the service and the mainte-
nance costs. Hence, the development of a new VM is not driven by pure curiosity or
simply the desire of software engineers to create some other platform, but a business
need to optimize the resources and the economics of the investment.
HHVM optimizes PHP programs’ bytecode using the abstraction of a bytecode unit
called a tracelet that can be viewed as a box with a single input multiple outputs. As
the data flows into the input entry, it is processed by the HHBC and the output data
flows out of the outputs. If the types of the input values were known at compile time,
the HHVM compiler could determine their runtime representations and optimize the
operations. Instead, the JIT compiler discovers the types of the data in a tracelet at
runtime and it generates multiple instances of the tracelet with guards that describe the
types of the input and output data. When a tracelet is invoked later, the JIT compiler
determines the types of the input data and it will find the appropriate version of the
tracelet for these types using the guards.
Consider an example of a tracelet in Figure 16.2 that compares the values of two
input variables and returns zero if they are equal and one otherwise. The types of
the input variables, $v1 and $v2 are not known. When the function compare(1,2)
is invoked, the JIT compiler determines that the type of the input data is integer
and it uses HHBC instructions that deal with this type in the generated tracelet and
correspondingly it creates the guards that designate the generated tracelet for the integer
240 CHAPTER 16. FACEBOOK CLOUD
16.6 Summary
This chapter was dedicated to the organization and operation of the Facebook cloud.
We live in an amazing time when we can determine how technical solutions are engi-
neered at large organizations simply by reading technical papers that they published.
We reviewed the development infrastructure at Facebook using which the code is cre-
ated and tested before deployed for use by billions of users. Setting aside the business
goal of Facebook, the scale of cloud deployment is truly breathtaking: billions and
billions of users are served every day to enable a continuous and non-interrupted flow
of information. Interestingly, the software development lifecycle for creating and de-
ploying software in the cloud differs from the classic waterfall/spiral models, since the
A/B testing process is embedded into the development mode. Releasing software to its
users who test the software by using it may be viewed as the alpha- or the beta-testing
preproduction phase except from the users’ point of view the software is in production.
In a way, this way of rapid deployment enables software engineers to add new features
to large software applications at a fast pace.
Next, we reviewed the Facebook algorithm of measuring the eventual consistency
and we learned how different levels of eventual consistency are used to adjust caching
and the data flow to make data more consistent while preserving the availability of
the cloud services. We analyzed the configuration management in the cloud as an
important mechanism to synchronize different components into a correctly assembled
software system. The introduction of the configuration compiler called Configerator
treats configurations as distributed objects with the RPC-based constraint enforcement
mechanism. Finally, we studied Kraken and HPVM, feedback-directed performance
management system with an intermediate bytecode translator and optimizer.
6 https://hhvm.com
7 https://hacklang.org/
Chapter 17
Conclusions
All good things come to an end and this book is not an exception. We have covered a
lot of ground components of the new engineering technology called cloud computing.
Its fundamental difference from other related technologies lies in managing computing
resources, dynamically at a wide scale and putting the cost on these resources based
on demand for them. As such, cloud datacenters go well beyond providing distributed
RPC-based services – they are also revenue management systems akin to the airline
industry where resources represent seats on an airplane. Of course, passengers are not
applications in the sense that they are not forcibly ejected from their seats when the
demand changes, well, not by all airlines. But the idea of balancing the load among
resources based on the elastic demand of applications and the willingness of the appli-
cations’ owners to pay for these resources is unique in the short history of software and
systems engineering at a wide scale.
Predicting how technologies will evolve is an impossible task. It has been proven
repeatedly that making these predictions is a fool’s errand. Yet it is likely that the tech-
nology will be improved in the near future incrementally, not by changing everything
overnight. Major trends include computing consistent results in highly available sys-
tems in the presence of partitioning, highly efficient utilization of cloud resources, and
revenue management to determine dynamic price ranges for resources based on their
utilization and the customer demand. We will discuss these trends below.
There is a decade-old debate about the CAP theorem and its influence on the de-
velopment and deployment of cloud-based applications. Whereas the CAP theorem
establishes theoretical limits on selecting two out of three properties, i.e., consistency,
availability or partitioning, modern cloud organizations become increasingly reliable
w.r.t. hardware and network faults. It is expected that with multiple redundant high-
speed network connections and physical servers the replicas will be synchronized very
fast and the eventual consistency will be minimal thus leading to highly consistent and
available cloud computing.
Intelligent load balancers will lead to significant improvements of utilization of
cloud services and to the decreased latency of computations. LBs will use machine
learning algorithms and techniques to route requests from users to regions, datacen-
ters, clusters and specific servers based on the collected runtime information like in
241
242 CHAPTER 17. CONCLUSIONS
Facebook’s Kraken. Of course, these algorithms take time and resources, however, as
they will become more efficient and more precise, their benefits will outweigh their
costs resulting in learning LBs that will improve the performance and efficiency of
cloud datacenters.
Cloud computing requires users to pay for used resources, so this cost dimension
adds emphasis to better elasticity and smart revenue management. As we learned in this
book, perfectly elastic services are impossible to create because it is impossible to de-
termine the exact performance of nontrivial software applications. However, improving
elasticity will result in smaller amounts of under- and over-provisioning of resources
thereby increasing the value of cloud computing services to customers. Equally im-
portant it will be to determine prices of resources dynamically and automatically based
on the changing users’ demand. Revenue management is a complex science that is
rooted in operation research, and the optimization of cloud resource pricing reflects
its use and its value thus adding to making cloud computing available, efficient, and
a default choice of deploying software applications for individual customers and large
companies and organizations for many decades to come.
Bibliography
[5] M. Ahamad, R. A. Bazzi, R. John, P. Kohli, and G. Neiger. The power of proces-
sor consistency. In Proceedings of the Fifth Annual ACM Symposium on Parallel
Algorithms and Architectures, SPAA ’93, pages 251–260, New York, NY, USA,
1993. ACM.
[7] M. Albonico, J.-M. Mottu, and G. Sunyé. Controlling the elasticity of web appli-
cations on cloud computing. In Proceedings of the 31st Annual ACM Symposium
on Applied Computing, SAC ’16, pages 816–819, New York, NY, USA, 2016.
ACM.
[10] N. Amit, D. Tsafrir, A. Schuster, A. Ayoub, and E. Shlomo. Virtual cpu valida-
tion. In Proceedings of the 25th Symposium on Operating Systems Principles,
SOSP ’15, pages 311–327, New York, NY, USA, 2015. ACM.
1
2 BIBLIOGRAPHY
[11] G. Ammons, J.-D. Choi, M. Gupta, and N. Swamy. Finding and removing per-
formance bottlenecks in large systems. In ECOOP, pages 170–194, 2004.
[17] A. Avritzer and E. J. Weyuker. Generating test suites for software load testing.
In ISSTA, pages 44–57, New York, NY, USA, 1994. ACM.
[23] K. Birman, G. Chockler, and R. van Renesse. Toward a cloud computing re-
search agenda. SIGACT News, 40(2):68–80, June 2009.
[27] P. Bogle and B. Liskov. Reducing cross domain call overhead using batched
futures. In Proceedings of the Ninth Annual Conference on Object-oriented
Programming Systems, Language, and Applications, OOPSLA ’94, pages 341–
354, New York, NY, USA, 1994. ACM.
[28] P. Bollen. Bpmn: A meta model for the happy path. 2010.
[29] P. C. Brebner. Is your cloud elastic enough?: Performance modelling the elastic-
ity of infrastructure as a service (iaas) cloud applications. In Proceedings of the
3rd ACM/SPEC International Conference on Performance Engineering, ICPE
’12, pages 263–266, New York, NY, USA, 2012. ACM.
[35] P. M. Chen and B. D. Noble. When virtual is better than real. In Proceedings of
the Eighth Workshop on Hot Topics in Operating Systems, HOTOS ’01, pages
133–, Washington, DC, USA, 2001. IEEE Computer Society.
[36] CISCO. Cloud computing adoption is slow but steady. ITBusinessEdge, 2012.
4 BIBLIOGRAPHY
[37] D. Colarelli and D. Grunwald. Massive arrays of idle disks for storage archives.
In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, SC ’02,
pages 1–11, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.
[39] L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making backup cheap and
easy. SIGOPS Oper. Syst. Rev., 36(SI):285–298, Dec. 2002.
[44] A. Deng, J. Lu, and J. Litz. Trustworthy analysis of online a/b tests: Pitfalls,
challenges and solutions. In Proceedings of the Tenth ACM International Con-
ference on Web Search and Data Mining, WSDM ’17, pages 641–649, New
York, NY, USA, 2017. ACM.
[46] U. Drepper. What every programmer should know about memory, 2007.
[49] C. Ferris and J. Farrell. What are web services? Commun. ACM, 46(6):31–,
June 2003.
[51] M. Flynn. Some computer organizations and their effectiveness. IEEE Trans.
Comput., C-21:948–960, 1972.
[52] I. Foster. Globus toolkit version 4: Software for service-oriented systems. Jour-
nal of Computer Science and Technology, 21(4):513–520, 2006.
[53] I. T. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing
360-degree compared. CoRR, abs/0901.0131, 2009.
[55] Y. Fujii, T. Azumi, N. Nishio, S. Kato, and M. Edahiro. Data transfer matters
for gpu computing. In Proceedings of the 2013 International Conference on
Parallel and Distributed Systems, ICPADS ’13, pages 275–282, Washington,
DC, USA, 2013. IEEE Computer Society.
[57] J.-P. Garbani. Market overview: The application performance management mar-
ket. Forrester Research, Oct. 2008.
[60] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceed-
ings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP
’03, pages 29–43, New York, NY, USA, 2003. ACM.
[61] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of consistent,
available, partition-tolerant web services. SIGACT News, 33(2):51–59, June
2002.
[62] Google. Auto scaling on the google cloud platform. Google Cloud Platform,
Oct. 2013.
[64] J. Gray, P. Helland, P. O’Neil, and D. Shasha. The dangers of replication and a
solution. In Proceedings of the 1996 ACM SIGMOD International Conference
on Management of Data, SIGMOD ’96, pages 173–182, New York, NY, USA,
1996. ACM.
[70] J. Handy. The Cache Memory Book. Academic Press Professional, Inc., San
Diego, CA, USA, 1993.
[72] J. Hill and H. Owens. Towards using abstract behavior models to evaluate soft-
ware system performance properties. In Proceedings of the 5th International
Workshop on Advances in Quality of Service Management, (AQuSerM 2011,
New York, NY, USA, 2011. ACM.
[74] M. R. Hines and K. Gopalan. Post-copy based live virtual machine migration
using adaptive pre-paging and dynamic self-ballooning. In Proceedings of the
2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution
Environments, VEE ’09, pages 51–60, New York, NY, USA, 2009. ACM.
[77] S. Islam, K. Lee, A. Fekete, and A. Liu. How a consumer can measure elasticity
for cloud platforms. In Proceedings of the 3rd ACM/SPEC International Con-
ference on Performance Engineering, ICPE ’12, pages 85–96, New York, NY,
USA, 2012. ACM.
[81] R. Kay. Pragmatic network latency engineering fundamental facts and analysis.
[82] J. Kim, D. Chae, J. Kim, and J. Kim. Guide-copy: Fast and silent migration of
virtual machine for datacenters. In Proceedings of the International Conference
on High Performance Computing, Networking, Storage and Analysis, SC ’13,
pages 66:1–66:12, New York, NY, USA, 2013. ACM.
[84] V. Kundra. Federal cloud computing strategy. Office of the CIO, Whitehouse.
[86] L. Lamport. Time, clocks, and the ordering of events in a distributed system.
Commun. ACM, 21(7):558–565, July 1978.
[87] B. W. Lampson. Hints for computer system design. In Proceedings of the Ninth
ACM Symposium on Operating System Principles, SOSP 1983, Bretton Woods,
New Hampshire, USA, October 10-13, 1983, pages 33–48, 1983.
[88] A. N. Langville and C. D. Meyer. Google’s PageRank and Beyond: The Science
of Search Engine Rankings. Princeton University Press, Princeton, NJ, USA,
2006.
[94] A. Madhavapeddy and D. J. Scott. Unikernels: Rise of the virtual library oper-
ating system. Queue, 11(11):30:30–30:44, Dec. 2013.
[97] P. M. Mell and T. Grance. Sp 800-145. the nist definition of cloud computing.
Technical report, Gaithersburg, MD, United States, 2011.
[100] I. Molyneaux. The Art of Application Performance Testing: Help for Program-
mers and Quality Assurance. O’Reilly Media, Inc., 2009.
[102] D. Mosberger. Memory consistency models. SIGOPS Oper. Syst. Rev., 27(1):18–
26, Jan. 1993.
[109] B. Nitzberg and V. Lo. Distributed shared memory: A survey of issues and
algorithms. Computer, 24(8):52–60, Aug. 1991.
[110] S. Niu, J. Zhai, X. Ma, X. Tang, and W. Chen. Cost-effective cloud hpc re-
source provisioning by building semi-elastic virtual clusters. In Proceedings of
SC13: International Conference for High Performance Computing, Networking,
Storage and Analysis, SC ’13, pages 56:1–56:12, New York, NY, USA, 2013.
ACM.
[113] D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of in-
expensive disks (raid). In Proceedings of the 1988 ACM SIGMOD International
Conference on Management of Data, SIGMOD ’88, pages 109–116, New York,
NY, USA, 1988. ACM.
[133] B. Schroeder and G. A. Gibson. Disk failures in the real world: What does
an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX
Conference on File and Storage Technologies, FAST ’07, Berkeley, CA, USA,
2007. USENIX Association.
[138] M. Seshadrinathan and K. L. Dempski. High speed skin color detection and
localization on a GPU. In Proceedings of Eurographics 2007, pages 334–350,
Prague, Czech Republic, 2007. Eurographics Association.
[139] D. Shabalin, E. Burmako, and M. Odersky. Quasiquotes for scala. page 15,
2013.
[142] J. Smith and R. Nair. Virtual Machines: Versatile Platforms for Systems and
Processes (The Morgan Kaufmann Series in Computer Architecture and De-
sign). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.
[148] B. Teabe, V. Nitu, A. Tchana, and D. Hagimont. The lock holder and the lock
waiter pre-emption problems: Nip them in the bud using informed spinlocks
(i-spinlock). In Proceedings of the Twelfth European Conference on Computer
Systems, EuroSys ’17, pages 286–297, New York, NY, USA, 2017. ACM.
[151] R. Van Renesse, K. P. Birman, and W. Vogels. Astrolabe: A robust and scalable
technology for distributed system monitoring, management, and data mining.
ACM Trans. Comput. Syst., 21(2):164–206, May 2003.
[152] R. vanderMeulen. Gartner says by 2020 ”cloud shift” will affect more than
$1 trillion in it spending. http://www.gartner.com/newsroom/id/
3384720, May 2017.
BIBLIOGRAPHY 13
[158] N. Yuhanna. Dbms selection: Look beyond basic functions. Forrester Research,
June 2009.