Virtualized Cloud Data Center Networks Issues in Resource Management
Virtualized Cloud Data Center Networks Issues in Resource Management
Linjiun Tsai
Wanjiun Liao
Virtualized Cloud
Data Center
Networks: Issues
in Resource
Management
123
SpringerBriefs in Electrical and Computer
Engineering
More information about this series at http://www.springer.com/series/10059
Linjiun Tsai Wanjiun Liao
•
123
Linjiun Tsai Wanjiun Liao
National Taiwan University National Taiwan University
Taipei Taipei
Taiwan Taiwan
v
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Server Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Server Consolidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scheduling of Virtual Machine Reallocation . . . . . . . . . . . . . . . . 3
1.5 Intra-Service Communications . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Topology-Aware Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Allocation of Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Adaptive Fit Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Time Complexity of Adaptive Fit . . . . . . . . . . . . . . . . . . . . . . . 13
Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Transformation of Data Center Networks . . . . . . . . . . . . . . . . . . . . 15
3.1 Labeling Network Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Grouping Network Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Formatting Star Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Building Variants of Fat-Tree Networks . . . . . . . . . . . . . . . . . . . 23
3.6 Fault-Tolerant Resource Allocation . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Fundamental Properties of Reallocation . . . . . . . . . . . . . . . . . . . 25
3.8 Traffic Redirection and Server Migration . . . . . . . . . . . . . . . . . . 27
Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Allocation of Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Multi-Step Reallocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Generality of the Reallocation Mechanisms. . . . . . . . . . . . . . . . . 34
4.4 On-Line Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
viii Contents
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Chapter 1
Introduction
Cloud computing lends itself to the processing of large data volumes and time-
varying computational demands. Cloud data centers involve substantial computa-
tional resources, feature inherently flexible deployment, and deliver significant
economic benefit—provided the resources are well utilized while the quality of
service is sufficient to attract as many tenants as possible.
Given that they naturally bring economies of scale, research in cloud data centers
has received extensive attention in both academia and industry. In large-scale public
data centers, there may exist hundreds of thousands of servers, stacked in racks and
connected by high-bandwidth hierarchical networks to jointly form a shared
resource pool for accommodating multiple cloud tenants from all around the world.
The servers are provisioned and released on-demand via a self-service interface at
any time, and tenants are normally given the ability to specify the amount of CPU,
memory, and storage they require. Commercial data centers usually also offer
service-level agreements (SLAs) as a formal contract between a tenant and the
operator. The typical SLA includes penalty clauses that spell out monetary com-
pensations for failure to meet agreed critical performance objectives such as
downtime and network connectivity.
Virtualization [1] is widely adopted in modern cloud data centers for its agile
dynamic server provisioning, application isolation, and efficient and flexible
resource management. Through virtualization, multiple instances of applications
can be hosted by virtual machines (VMs) and thus separated from the underlying
hardware resources. Multiple VMs can be hosted on a single physical server at one
time, as long as their aggregate resource demand does not exceed the server
capacity. VMs can be easily migrated [2] from one server to another via network
connections. However, without proper scheduling and routing, the migration traffic
and workload traffic generated by other services would compete for network
bandwidth. The resultant lower transfer rate invariably prolongs the total migration
time. Migration may also cause a period of downtime to the migrating VMs,
thereby disrupting a number of associated applications or services that need con-
tinuous operation or response to requests. Depending on the type of applications
and services, unexpected downtime may lead to severe errors or huge revenue
losses. For data centers claiming high availability, how to effectively reduce
migration overhead when reallocating resources is therefore one key concern, in
addition to pursuing high resource utilization.
The resource demands of cloud services are highly dynamic and change over time.
Hosting such fluctuating demands, the servers are very likely to be underutilized,
but still incur significant operational cost unless the hardware is perfectly energy
proportional. To reduce costs from inefficient data center operations and the cost of
hosting VMs for tenants, server consolidation techniques have been developed to
pack VMs into as few physical servers as possible, as shown in Fig. 1.1. The
techniques usually also generate the reallocation schedules for the VMs in response
to the changes in their resource demands. Such techniques can be used to con-
solidate all the servers in a data center or just the servers allocated to a single
service.
VM VM
VM VM Off Off
………
VM
VM
Server 1 Server 2 Server 3 …… Server n
(Active servers) (Non-active servers)
1.3 Server Consolidation 3
mechanisms for service allocation and traffic routing, the intra-service communi-
cation of every service sharing the network may suffer serious delay and even be
disrupted. Deploying all the VMs for a service into one single rack to reduce the
impact on the shared network is not always a practical or economical solution. This
is because such a solution may cause the resources of data centers to be
underutilized and fragmented, particularly when the demand of services is highly
dynamic and does not fit the capacity of the rack.
For delay-sensitive and communication-intensive applications, such as mobile
cloud streaming [10, 11], cloud gaming [12, 13], MapReduce applications [14],
scientific computing [15] and Spark applications [16], the problem may become
more acute due to their much greater impact on the shared network and much
stricter requirements in the quality of intra-service transmissions. Such types of
applications usually require all-to-all communications to intensively exchange or
shuffle data among distributed nodes. Therefore, network quality becomes the
primary bottleneck of their performance. In some cases, the problem remains quite
challenging even if the substrate network structure provides high capacity and rich
connectivity, or the switches are not oversubscribed. First, all-to-all traffic patterns
impose strict topology requirements on allocation. Complete graphs, star graphs or
some graphs of high connectivity are required for serving such traffic, which may
be between any two servers. In a data center network where the network resource is
highly fragmented or partially saturated, such topologies are obviously extremely
difficult to allocate, even with significant reallocation cost and time. Second,
dynamically reallocating such services without affecting their performance is also
extremely challenging. It is required to find reallocation schedules that not only
satisfy general migration requirements, such as sufficient residual network band-
width, but also keep their network topologies logically unchanged.
To host delay-sensitive and communication-intensive applications with network
performance guarantees, the network topology and quality (e.g., bandwidth, latency
and connectivity) should be consistently guaranteed, thus continuously supporting
arbitrary intra-service communication patterns among the distributed compute
nodes and providing good predictability of service performance. One of the best
approaches is to allocate every service a non-blocking network. Such a network
must be isolated from any other service, be available during the entire service
lifetime even when some of the compute nodes are reallocated, and support
all-to-all communications. This way, it can give each service the illusion of being
operated on the data center exclusively.
For profit-seeking cloud data centers, the question of how to efficiently provision
non-blocking topologies for services is a crucial one. It also principally affects the
resource utilization of data centers. Different services may request various virtual
topologies to connect their VMs, but it is not necessary for data centers to allocate
1.6 Topology-Aware Allocation 5
Switch
Server
the physical topologies for them in exactly the same form. In fact, keeping such
consistency could lead to certain difficulties in optimizing the resources of entire
data center networks, especially when such services request physical topologies of
high connectivity degrees or even cliques.
For example, consider the deployment of a service which requests a four-vertex
clique to serve arbitrary traffic patterns among four VMs on a network with eight
switches and eight servers. Suppose that the link capacity is identical to the
bandwidth requirement of the VM, so there are at least two feasible methods of
allocation, as shown in Fig. 1.2. Allocation 1 uses a star topology, which is clearly
non-blocking for any possible intra-service communication patterns, and occupies
the minimum number of physical links. Allocation 2, however, shows an inefficient
allocation as two more physical links are used to satisfy the same intra-service
communication requirements.
Apart from allocating more resources, the star network in Allocation 1 provides
better flexibility in reallocation than other complex structures. This is because
Allocation 1 involves only one link when reallocating any VM while ensuring
topology consistency. Such a property makes it easier for resources to be reallo-
cated in a saturated or fragmented data center network, and thus further affects how
well the resource utilization of data center networks could be optimized, particularly
when the demands dynamically change over time. However, the question then
arises as to how to efficiently allocate every service as a star network. In other
words, how to efficiently divide the hierarchical data center networks into a large
number of star networks for services and dynamically reallocate those star networks
while maintaining high resource utilization? To answer this question, the topology
of underlying networks needs to be considered. In this book, we will introduce a
solution to tackling this problem.
1.7 Summary
So far, the major issues, challenges and requirements for managing the resources of
virtualized cloud data centers have been addressed. The solutions to these problems
will be explored in the following chapters. The approach is to divide the problems
into two parts. The first one is to allocate VMs for every service into one or multiple
virtual servers, and the second one is to allocate virtual servers for all services to
6 1 Introduction
physical servers and to determine network links to connect them. Both sub-
problems are dynamic allocation problems. This is because the mappings from
VMs to virtual servers, the number of required virtual servers, the mapping from
virtual servers to physical servers, and the allocation of network links may all
change over time. For practical considerations, these mechanisms are designed to
be scalable and feasible for cloud data centers of various scales so as to accom-
modate services of different sizes and dynamic characteristics.
The mechanism for allocating and reallocating VMs on servers is called
Adaptive Fit [17], which is designed to pack VMs into as few servers as possible.
The challenge is not just to simply minimize the number of servers. As the demand
of every VM may change over time, it is best to minimize the reallocation overhead
by selecting and keeping some VMs on their last hosting server according to an
estimated saturation degree.
The mechanism for allocating and reallocating physical servers is based on a
framework called StarCube [18], which ensures every service is allocated with an
isolated non-blocking star network and provides some fundamental properties that
allow topology-preserving reallocation. Then, a polynomial-time algorithm will be
introduced which performs on-line, on-demand and cost-bounded server allocation
and reallocation based on those promising properties of StarCube.
References
1. P. Barham et al., Xen and the art of virtualization. ACM SIGOPS Operating Syst. Rev. 37(5),
164–177 (2003)
2. C. Clark et al., in Proceedings of the 2nd Conference on Symposium on Networked Systems
Design & Implementation, Live migration of virtual machines, vol. 2 (2005)
3. V.V. Vazirani, Approximation Algorithms, Springer Science & Business Media (2002)
4. M.R. Garey, D.S. Johnson, Computers and intractability: a guide to the theory of
NP-completeness (WH Freeman & Co., San Francisco, 1979)
5. G. Dósa, The tight bound of first fit decreasing bin-packing algorithm is FFD(I) = (11/9)OPT
(I) + 6/9, Combinatorics, Algorithms, Probabilistic and Experimental Methodologies,
Springer Berlin Heidelberg (2007)
6. B. Xia, Z. Tan, Tighter bounds of the first fit algorithm for the bin-packing problem. Discrete
Appl. Math. 158(15), 1668–1675 (2010)
7. Q. He et al., in Proceedings of the 19th ACM International Symposium on High Performance
Distributed Computing, Case study for running HPC applications in public clouds, (2010)
8. S. Kandula et al., in Proceedings of the 9th ACM SIGCOMM Conference on Internet
Measurement Conference, The nature of data center traffic: measurements & analysis (2009)
9. T. Ristenpart et al., in Proceedings of the 16th ACM Conference on Computer and
Communications Security, Hey, you, get off of my cloud: exploring information leakage in
third-party compute clouds (2009)
10. C.F. Lai et al., A network and device aware QoS approach for cloud-based mobile streaming.
IEEE Trans. on Multimedia 15(4), 747–757 (2013)
11. X. Wang et al., Cloud-assisted adaptive video streaming and social-aware video prefetching
for mobile users. IEEE Wirel. Commun. 20(3), 72–79 (2013)
12. R. Shea et al., Cloud gaming: architecture and performance. IEEE Network Mag. 27(4), 16–21
(2013)
References 7
13. S.K. Barker, P. Shenoy, in Proceedings of the first annual ACM Multimedia Systems,
Empirical evaluation of latency-sensitive application performance in the cloud (2010)
14. J. Ekanayake et al., in IEEE Fourth International Conference on eScience, MapReduce for
data intensive scientific analyses (2008)
15. A. Iosup et al., Performance analysis of cloud computing services for many-tasks scientific
computing, IEEE Trans. on Parallel and Distrib. Syst. 22(6), 931–945 (2011)
16. M. Zaharia et al., in Proceedings of the 2nd USENIX conference on Hot topics in cloud
computing, Spark: cluster computing with working sets (2010)
17. L. Tsai, W. Liao, in IEEE 1st International Conference on Cloud Networking, Cost-aware
workload consolidation in green cloud datacenter (2012)
18. L. Tsai, W. Liao, StarCube: an on-demand and cost-effective framework for cloud data center
networks with performance guarantee, IEEE Trans. on Cloud Comput. doi:10.1109/TCC.
2015.2464818
Chapter 2
Allocation of Virtual Machines
We consider the case where a system (e.g., a cloud service or a cloud data center) is
allocated with a number of servers denoted by H and a number of VMs denoted by
V. We assume the number of servers is always sufficient to host the total resource
requirement of all VMs in the system. Thus, we focus on the consolidation effec-
tiveness and the migration cost incurred by the server consolidation problem.
Further, we assume that VM migration is performed at discrete times. We define
the period of time to perform server consolidation as an epoch. Let T = {t1, t2,…, tk}
denote the set of epochs to perform server consolidation. The placement sequence
for VMs in V in each epoch t is then denoted by F = { ft | 8 t 2 T}, where ft is the
VM placement at epoch t and defined as a mapping ft : V → H, which specifies that
each VM i, i 2 V, is allocated to server ft(i). Note that “ft(i) = 0” denotes that VM i is
not allocated. To model the dynamic nature of the resource requirement and the
migration cost for each VM over time, we let Rt = {rt(i) | 8 i 2 V} and Ct = {ct(i) | 8
i 2 V} denote the sets of the resource requirement and migration cost, respectively,
for all VMs in epoch t.
The TCC problem is NP-Hard, because it is at least as hard as the server consol-
idation problem. In this section, we present a polynomial-time solution to the
problem. The design objective is to generate VM placement sequences F in
polynomial time and minimize Cost(F).
Recall that the migration cost results from changing the hosting servers of VMs
during the VM migration process. To reduce the total migration cost for all VMs,
we attempt to minimize the number of migrations without degrading the effec-
tiveness of consolidation. To achieve this, we try to allocate each VM i in epoch t to
the same server hosting the VM in epoch t − 1, i.e., ft(i) = ft−1(i). If ft−1(i) does not
have enough capacity in epoch t to satisfy the resource requirement for VM i or is
currently not active, we then start the remaining procedure based on “saturation
degree” estimation. The rationale behind this is described as follows.
Instead of using a greedy method as in existing works, which typically allocate
each migrating VM to an active server with available capacity either based on First
Fit, Best Fit, or Worse Fit, we define a total cost metric called saturation degree to
strike a balance between the two conflicting factors: consolidation effectiveness and
migration overhead. For each iteration of allocation process in epoch t, the satu-
ration degree Xt is defined as follows:
P
rt ði Þ
Xt ¼ 0 i2V
Ht þ 1 1
Since the server capacity is normalized to one in this book, the denominator
indicates the total capacity summed over all active servers plus an idle server in
epoch t.
During the allocation process, Xt decreases as |H′t| increases by definition. We
define the saturation threshold u 2 [0, 1] and say that Xt is low when Xt ≤ u. If Xt is
low, the migrating VMs should be allocated to the set of active servers unless there
are no active servers that have sufficient capacity to host them. On the other hand, if
Xt is large (i.e., Xt > u), the mechanism tends to “lower” the total migration cost as
follows. One of the idle servers will be turned on to host a VM which cannot be
allocated on its “last hosting server” (i.e., ft−1(i) for VM i), even though some of the
active servers still have sufficient residual resource to host the VM. It is expected
that the active servers with residual resource in epoch t are likely to be used for
hosting other VMs which were hosted by them in epoch t − 1. As such, the total
migration cost is minimized.
The process of allocating all VMs in epoch t is then described as follows. In
addition, the details of the mechanism are shown in the Appendix.
12 2 Allocation of Virtual Machines
Reference
1. S. Akoush et al., in Proc. IEEE MASCOTS, Predicting the Performance of Virtual Machine
Migration. pp. 37–46 (2010)
Chapter 3
Transformation of Data Center Networks
In this chapter, we introduce the StarCube framework. Its core concept is the
dynamic and cost-effective partitioning of a hierarchical data center network into
several star networks and the provisioning of each service with a star network that is
consistently independent from other services.
The principal properties guaranteed by our framework include the following:
1. Non-blocking topology. Regardless of traffic pattern, the network topology
provisioned to each service is non-blocking after and even during reallocation.
The data rates of intra-service flows and outbound flows (i.e., those going out of
the data centers) are only bounded by the network interface rates.
2. Multi-tenant isolation. The topology is isolated for each service, with band-
width exclusively allocated. The migration process and the workload traffic are
also isolated among the services.
3. Predictable traffic cost. The per-hop distance of intra-service communications
required by each service is satisfied after and even during reallocation.
4. Efficient resource usage. The number of links allocated to each service to form
a non-blocking topology is the minimum.
The StarCube framework is based on the fat-tree structure [1], which is probably the
most discussed data center network structure and supports extremely high network
capacity with extensive path diversity between racks. As shown in Fig. 3.1, a k-ary
fat-tree network is built from k-port switches and consists of k pods interconnected
by (k/2)2 core switches. For each pod, there are two layers of k/2 switches, called the
edge layer and the aggregation layer, which jointly form a complete bipartite net-
work with (k/2)2 links. Each edge switch is connected to k/2 servers through the
downlinks, and each aggregation switch is also connected to k/2 core switches but
Aggregation Core
switch switch
through the uplinks. The core switches are separated into (k/2) groups, where the ith
group is connected to the ith aggregation switch in each pod. There are (k/2)2 servers
in each pod. All the links and network interfaces on the servers or switches are of the
same bandwidth capacity. We assume that every switch supports non-blocking
multiplexing, by which the traffic on downlinks and uplinks can be freely multi-
plexed and the traffic at different ports do not interfere with one another.
For ease of explanation, but without loss of generality, we explicitly label all
servers and switches, and then label all network links according to their connections
as follows:
1. At the top layer, the link which connects aggregation switch i in pod k and core
switch j in group i is denoted by Linkt(i, j, k).
2. At the middle layer, the link which connects aggregation switch i in pod k and
edge switch j in pod k is denoted by Linkm(i, j, k).
3. At the bottom layer, the link which connects server i in pod k and edge switch
j in pod k is denoted by Linkb(i, j, k).
For example, in Fig. 3.2, the solid lines indicate Linkt(2, 1, 4), Linkm(2, 1, 4) and
Linkb(2, 1, 4). This labeling rule also determines the routing paths. Thanks to the
1 2 3 4
1 2 3 4
1 2 3 4
1
2
3
4
symmetry of the fat-tree structure, the same number of servers and aggregation
switches are connected to each edge switch and the same number of edge switches
and core switches are connected to each aggregation switch. Thus, one can easily
verify that each server can be exclusively and exactly paired with one routing path
for connecting to the core layer because each downlink can be bijectively paired
with one exact uplink.
Once the allocation of all Linkm has been determined, the allocation of the
remaining servers, links and switches can be obtained accordingly. In our frame-
work, each allocated server will be paired with such a routing path for connecting
the server to a core switch. Such a server-path pair is called a resource unit in this
book for ease of explanation, and serves as the basic element of allocations in our
framework. Since the resources (e.g. network links and CPU processing power)
must be isolated among tenants so as to guarantee their performance, each resource
unit will be exclusively allocated to at most one cloud service.
Below, we will describe some fundamental properties of the resource unit. In
brief, any two of the resource units are either resource-disjoint or connected with
exactly one switch regardless whether they belong to the same pod. The set of
resource units in different pods using the same indices i, j is called MirrorUnits
(i, j) for convenience, which must be connected with exactly one core switch.
Definition 3.1 (Resource unit) For a k-ary fat-tree, a set of resources U = (S, L) is
called a resource unit, where S and L denote the set of servers and links,
respectively, if (1) there exist three integers i, j, k such that L = {Linkt(i, j, k),
Linkm(i, j, k), Linkb(i, j, k)}; and (2) for every server s in the fat-tree, s 2 S if and
only if there exists a link l 2 L such that s is connected with l.
Definition 3.2 (Intersection of resource units) For any number of resource units
U1,…,Un, where Ui = (Si, Li) for all i, the overlapping is defined as \i=1…nUi =
(\i=1…nSi, \i=1…nLi).
Lemma 3.1 (Intersection of two resource units) For any two different resource
units U1 = (S1, L1) and U2 = (S2, L2), exact one of the following conditions holds:
(1) U1 = U2; (2) L1 \ L2 = S1 \ S2 = ∅.
Proof Let U1 = (S1, L1) and U2 = (S2, L2) be any two different resource units.
Suppose L1 \ L2 ≠ ∅ or S1 \ S2 ≠ ∅. By the definitions of the resource unit and the
fat-tree, there exists at least one link in L1 \ L2, thus implying L1 = L2 and S1 = S2.
This leads to U1 = U2, which is contrary to the statement. The proof is done. □
Definition 3.3 (Single-connected resource units) Consider any different resource
units U1,…,Un, where Ui = (Si, Li) for all i. They are called single-connected if
there exists exactly one switch x, called the single point, that connects every Ui.
(i.e., for every Li, there exists a link l 2 Li such that l is directly connected to x.)
18 3 Transformation of Data Center Networks
Lemma 3.2 (Single-connected resource units) For any two different resource units
U1 and U2, exactly one of the following conditions holds true: (1) U1 and U2 are
single-connected; (2) U1 and U2 do not directly connect to any common switch.
Proof Consider any two different resource units U1 and U2. Suppose U1 and U2
directly connect to two or more common switches. By definition, each resource unit
has only one edge switch, one aggregation switch and one core switch. Hence all of
the switches connecting U1 and U2 must be at different layers. By the definition of
the fat-tree structure, there exists only one path connecting any two switches at
different layers. Thus there exists at least one shared link between U1 and U2. It
hence implies U1 = U2 by Lemma 3.1, which is contrary to the statement. The proof
is done. □
Definition 3.4 The set MirrorUnits(i, j) is defined as the union of all resource units
of which the link set consists of a Linkm(i, j, k), where k is an arbitrary integer.
Lemma 3.3 (Mirror units on the same core) For any two resource units U1 and U2,
all of the following are equivalent: (1) {U1, U2} MirrorUnits(i, j) for some i, j;
(2) U1 and U2 are single-connected and the single point is a core switch.
Proof We give a bidirectional proof, where for any two resource units U1 and U2,
the following statements are equivalent. There exist two integers i and j such that
{U1, U2} MirrorUnits(i, j). There exists two links Linkm(i, j, ka) and Linkm(i, j, kb)
in their link sets, respectively. There exists two links Linkt(i, j, ka) and Linkt(i, j, kb) in
their link sets, respectively. The core switch j in group i connects both U1 and U2,
and by Lemma 3.2, it is a unique single point of U1 and U2. □
Lemma 3.4 (Non-blocking topology) For any n-star A = (S, L), A is a non-
blocking topology connecting any two servers in S.
Proof By the definition of n-star, any n-star must be made of single-connected
resource units, and by Definition 3.3, it is a star network topology. Since we assume
that all the links and network interfaces on the servers or switches are of the same
bandwidth capacity and each switch supports non-blocking multiplexing, it follows
that the topology for those servers is non-blocking. □
Lemma 3.5 (Equal hop-distance) For any n-star A = (S, L), the hop-distance
between any two servers in S is equal.
Proof For any n-star, by definition, the servers are single-connected by an edge
switch, aggregation switch or core switch, and by the definition of resource unit, the
path between each server and the single point must be the shortest path. By the
definition of the fat-tree structure, the hop-distance between any two servers in S is
equal. □
According to the position of each single point, which may be an edge, aggre-
gation or core switch, n-star can further be classified into four types, named type-E,
type-A, type-C and type-S for convenience in this book:
Definition 3.6 For any n-star A, A is called type-E if |A| > 1, and the single point of
A is an edge switch.
Definition 3.7 For any n-star A, A is called type-A if |A| > 1, and the single point of
A is an aggregation switch.
Definition 3.8 For any n-star A, A is called type-C if |A| > 1, and the single point of
A is a core switch.
Definition 3.9 For any n-star A, A is called type-S if |A| = 1.
Figure 3.3 shows some examples of n-star, where three independent cloud
services (from left to right) are allocated as the type-E, type-A and type-C n-stars,
respectively. By definitions, the resource is provisioned in different ways:
Using the properties of a resource unit, the fat-tree can be denoted as a matrix. For a
pod of the fat-tree, the edge layer, aggregation layer and all the links between them
jointly form a bipartite graph, and the allocation of links can hence be equivalently
denoted by a two-dimensional matrix. Therefore, for a data center with multiple
pods, the entire fat-tree can be denoted by a three-dimensional matrix. By
Lemma 3.1, all the resource units are independent. Thus an element of the fat-tree
matrix equivalently represents a resource unit in the fat-tree, and they are used
interchangeably in this book. Let the matrix element m(i, j, k) = 1 if and only if the
resource unit which consists of Linkm(i, j, k) is allocated, and m(i, j, k) = 0 other-
wise. We also let ms(i, j, k) denote the allocation of a resource unit for service s.
Below, we derive several properties for the framework which are the foundation
for developing the topology-preserving reallocation mechanisms. In brief, each
n-star in a fat-tree network can be gracefully represented as a one-dimensional vector
in a matrix as shown in Fig. 3.4, where the “aggregation axis” (i.e., the columns),
3.4 Matrix Representation 21
Pod-axis
Type-C allocation,
Edge-axis connected by one core switch.
Type-A allocation,
connected by one aggr. switch.
Aggregation-axis Type-E allocation,
connected by one edge switch.
the “edge axis” (i.e., the rows) and the “pod axis” are used to indicate the three
directions of a vector. The intersection of any two n-stars is either an n-star or null,
and the union of any two n-stars remains an n-star if they are single-connected. The
difference of any two n-stars remains an n-star if one is included in the other.
Lemma 3.6 (n-star as vector) For any set of resource units A, A is n-star if and
only if A forms a one-dimensional vector in a matrix.
Proof We exhaust all possible n-star types of A and give a bidirectional proof for
each case. Note that a type-S n-star trivially forms a one-dimensional vector, i.e., a
single element, in a matrix.
Case 1: For any type-E n-star A, by definition, all the resource units of A are
connected to exactly one edge switch in a certain pod. By the definition of matrix
representation, A forms a one-dimensional vector along the aggregation axis.
Case 2: For any type-A n-star A, by definition, all the resource units of A are
connected to exactly one aggregation switch in a certain pod. By the definition of
matrix representation, A forms a one-dimensional vector along the edge axis.
Case 3: For any type-C n-star A, by definition, all the resource units of A are
connected to exactly one core switch. By Lemma 3.3 and the definition of matrix
representation, A forms a one-dimensional vector along the pod axis. □
Figure 3.4 shows several examples of resource allocation using the matrix
representation. For a type-E service which requests four resource units, {m(1, 3, 1),
m(4, 3, 1), m(5, 3, 1), m(7, 3, 1)} is one of the feasible allocations, where the service
is allocated aggregation switches 1, 4, 5, 7 and edge switch 3 in pod 1. For a
type-A service which requests four resource units, {m(3, 2, 1), m(3, 4, 1), m(3, 5, 1),
m(3, 7, 1)} is one of the feasible allocations, where the service is allocated
aggregation switch 3, edge switches 2, 4, 5, 7 in pod 1. For a type-C service which
requests four resource units, {m(1, 6, 2), m(1, 6, 3), m(1, 6, 5), m(1, 6, 8)} is one of
the feasible allocations, where the service is allocated aggregation switch 1, edge
switch 6 in pods 2, 3, 5, and 8.
Within a matrix, we further give some essential operations, such as intersection,
union and difference, for manipulating n-star while ensuring the structure and
properties defined above.
22 3 Transformation of Data Center Networks
Definition 3.10 The intersection of two n-stars A1 and A2, denoted by A1 \ A2, is
defined as {U|U 2 A1 and U 2 A2}.
Lemma 3.7 (Intersection of n-stars) For any two n-stars A1 and A2, let Ax = (Sx, Lx)
be their intersection, exactly one of the following is true: (1) they share at least one
common resource unit and Ax is an n-star; (2) Sx = Lx = ∅. If Case 2 holds, we say
A1 and A2 are independent.
Proof From Lemma 3.6, every n-star forms a one-dimensional vector in the matrix,
and only the following cases represent the intersection of any two n-stars A1 and A2
in a matrix:
Case 1: Ax forms a single element or a one-dimensional vector in the matrix. By
Lemma 3.6, both imply that the intersection is an n-star and also indicate the
resource units shared by A1 and A2.
Case 2: Ax is null set. In this case, there is no common resource unit shared by A1
and A2. Therefore, for any two resource units U1 2 A1 and U2 2 A2, U1 ≠ U2, and by
Lemma 3.1, U1 \ U2 is a null set. There are no shared links and servers between A1
and A2, leading to Sx = Lx = ∅. □
Definition 3.11 The union of any two n-stars A1 and A2, denoted by A1 [ A2, is
defined as {U|U 2 A1 or U 2 A2}.
Lemma 3.8 (Union of n-stars) For any two n-stars A1 and A2, all of the following
are equivalent: (1) A1 [ A2 is an n-star; (2) A1 [ A2 forms a one-dimensional vector
in the matrix; and (3) A1 [ A2 is single-connected.
Proof For any two n-stars A1 and A2, the equivalence between (1) and (2) has been
proved by Lemma 3.6, and the equivalence between (1) and (3) has been given by
the definition of n-star. □
Definition 3.12 The difference of any two n-stars A1 and A2, denote by A1\A2, is
defined as the union of {U|U 2 A1 and U 62 A2}.
Lemma 3.9 (Difference of n-stars) For any two n-stars A1 and A2, if A2 A1, then
A1\A2 is an n-star.
Proof By Lemma 3.1, different resource units are resource-independent (i.e., link-
disjoint and server-disjoint), and hence removing some resource units from any
n-star will not influence the remaining resource units.
For any two n-stars A1 and A2, the definition of A1\A2 is equivalent to removing
the resource units of A2 from A1. It is hence equivalent to a removal of some
elements from the one-dimensional vector representing A1 in the matrix. Since the
remaining resource units still form a one-dimensional vector, A1\A2 is an n-star
according to Lemma 3.6. □
3.5 Building Variants of Fat-Tree Networks 23
Fig. 3.5 An example of reducing a fat-tree while keeping the symmetry property
Active server
Backup server
Active Link
Backup Link
Fig. 3.6 An example of inefficiently reserving backup resources for fault tolerance
units can be completely replaced by any backup resource units. This is because that
backup resource units are all leaves (and stems) of a star network and thus inter-
changeable in topology. There is absolutely no need to worry about
topology-related issues when making services fault-tolerant. This feature is par-
ticularly important and resource-efficient when operating services that require fault
tolerance and request complex network topologies. Without star network allocation,
those services may need a lot of reserved links to connect backup servers and active
servers, as shown in Fig. 3.6, an extremely difficult problem in saturated data center
networks; otherwise, after failure recovery, the topology will be changed and
intra-service communication will be disrupted.
The fault tolerance mechanisms can be much more resource-efficient if only one
or few failures may occur at any point in time. Multiple services, even of different
types, are allowed to share one or more single resource unit as their backup. An
example is shown in Fig. 3.7, where three services of different types share one
backup resource unit. Such simple but effective backup sharing mechanisms help
raise resource utilization, no matter how complex the topologies requested by
services. Even after reallocation (discussed in the next section), it is not required to
find new backups for those reallocated services as long as they stay on the same
axes. In data centers that are much more prone to failure, services are also allowed
to be backed with multiple backup resource units to improve survivability, and
those backups can still be shared among services or just dedicated. The ratio of
these two types of backups may be determined according to the levels of fault
tolerance requested by services or provisioned by data center operators.
Fig. 3.7 An example of efficiently reserving backup resources for fault tolerance
3.6 Fault-Tolerant Resource Allocation 25
allocated backup
...
Fig. 3.8 An example of efficiently sharing backup resources for an 8-ary fat-tree
As shown in Fig. 3.8, where there is at least one shared backup resource unit on
each axis, fault tolerance can be provided for every resource unit and hence every
service, no matter they are type-E, type-A or type-C services. To provide such a
property in a data center using a k-ary fat-tree, only 2/k resource is required to be
reserved. When k equals to 64 (i.e., the fat-tree is constructed with 64-port
switches), it takes only about 3 percent of resource to prevent services from being
disrupted by any single server failure and link failure.
Resource efficiency and performance guarantee are two critical issues for a cloud
data center. The guarantee of performance is based on the topology consistency of
allocations over their entire life cycle. Based on the properties and operations of n-
star, we design several fundamental reallocation mechanisms which allow an n-star
to be reallocated while the topology allocated to every service is still guaranteed
logically unchanged during and after reallocation. The basic concept of reallocating
each resource unit, called an atomic reallocation, is reallocating it along the same
axes in the matrix, with the reallocated n-star remaining a one-dimensional vector
in the matrix.
The atomic reallocation promises three desirable properties for manipulating n-
star allocations. In brief, (1) the topology is guaranteed logically unchanged; (2) the
migration path is guaranteed independent (link-disjoint and server-disjoint) to other
services or another migration path; and (3) the migration cost is limited and pre-
dictable. Using these properties, we can develop some reallocation mechanisms (or
programming models) for different objectives of data center resource management.
Definition 3.13 For every n-star A or resource unit, the allocation state is either
allocated or available (i.e., not allocated).
Definition 3.14 For any n-star A, any pair of n-stars (x, z) is called an atomic
reallocation for A if (1) A [ x [ z is single-connected; (2) x A; (3) z 6 A; (4) x is
allocated; (5) z is available; and (6) |x| = |z| = 1. In addition, reallocation is defined
26 3 Transformation of Data Center Networks
To realize server and path migration in fat-tree networks, some forwarding tables of
the switches on the path must be modified accordingly, and some services need to
be migrated. The process is different for the various types of n-star. Note that
type-S may be dynamically treated as any type.
1. Type-E. On each switch, it is assumed that the downlinks and uplinks can be
freely multiplexed to route traffic. Therefore, reallocating a type-E n-star does
not incur real server migration but only a modification in certain forwarding
tables. The new routing path will use a different aggregation switch and a
different core switch while the allocation of edge switch remains unchanged.
2. Type-A. It invokes an intra-pod server migration and forwarding table modifi-
cation, by which the path uses a different edge switch and core switch while the
aggregation switch remains the same. The entities involved in migration include
the current server (as the migration source), the current edge switch, the current
aggregation switch, the new edge switch and the new server (as the migration
destination). The first two links are currently allocated to the migrating service,
and the last two links must not have been allocated to any service. After the
migration, the last two links and the link between the current aggregation switch
and the new core switch (which is also available) jointly form the new routing
path for the migrated flow.
3. Type-C. It invokes a cross-pod sever migration and forwarding path modifica-
tion. The entities involved in the migration include only the current server (as
the source), the current edge switch, the current aggregation switch, the current
core switch, the new aggregation switch, the new edge switch and the new
server (as the destination). The first three links are being allocated to the service,
and the last three links must not have been allocated. After the migration, the
last three links of the migration path jointly form the new routing path for the
migrated flow.
Reallocating a type-A or type-C allocation requires servers to be migrated
among racks or pods. This generally incurs service downtime and may degrade the
quality of on-line services. Since such reallocation cost is generally proportional to
the number of server migrations, we could simplify it to the number of migrated
servers in this book.
Reference
1. M. Al-Fares et al., in Proc. ACM SIGCOMM, A Scalable, Commodity Data Center Network
Architecture, (2008)
Chapter 4
Allocation of Servers
A service is allocated if and only if the exact number of the requested resource
units is satisfied and all the allocated resource units jointly form a one-dimensional
vector in matrix M. Such topology constraints for each allocated service s are
defined as the following two equations, both of which need to be satisfied at the
same time.
8P
> ms ði; es ; ps Þ; if ts ¼ 1;
>
>P
< i
x s ds ¼ ms ðas ; j; ps Þ; if ts ¼ 2;
>
> P
j
>
: ms ðas ; es ; k Þ; if ts ¼ 3;
k
X
x s ds ¼ ms ði; j; k Þ:
i; j;k
1 es N ð eÞ ;
1 as N ðaÞ ;
1 ps N ð pÞ :
The following equation ensures each resource unit in the matrix M could be
allocated to at most one service.
X
ms ði; j; kÞ 1:
s
For service s that exists in both the current and the next epochs, by the reallo-
cation mechanism in the framework, the resource units that have been provisioned
are only allowed to be reallocated along the original axis in the matrix M. For each
service s, we represent its current allocation with variables xs′, ps′, es′, as′ and ms′,
which have the same meanings as xs, ps, es, as and ms except in different epochs. For
new incoming services, these variables are set to zero. The following equations
ensure the topology allocated to each service remains an n-star with exactly the
same single point after reallocation.
4.1 Problem Formulation 31
8 0
< xs xs es es þ ps ps ; if ts ¼ 1;
0 0
The reallocation cost for service s that exists in both current and next epochs is
defined as the number of modified resource units.
0
X ms ði; j; kÞ m0s ði; j; kÞ
cx ¼ xs xs :
i; j;k
2
The time complexity of the algorithm and the resource efficiency are the two most
critical factors. Instead of exhaustively searching all possible assignments of the
matrix, which is impractical for typical data centers, we take advantage of the
properties of the framework to reduce the time complexity of searching feasible
reallocation schedules. Simply searching all possible atomic reallocations, however,
may not be sufficient to make effective reallocation schedules. We use the proved
properties of the atomic reallocation, such as the topology invariance and
4.2 Multi-Step Reallocation 33
Aggr 1~4
Edge 1~4
matching algorithm is required to apply more than once for a column, leading to
higher time complexity. In later sections, we will show that such a limit on the
length is sufficient to achieve near-optimal resource efficiency, though longer
combined reallocations may render improved efficiency.
In addition to the topology constraints, the length of the combined reallocations
may incur a certain cost. To reduce and limit the reallocation cost, we can gradually
extend the search space for discovering longer candidate combined reallocations
while reserving the possibility of selecting shorter combined reallocations or even
atomic reallocations.
We can use the same mechanism to allocate type-E and type-A services by taking
advantage of the symmetry of fat-tree networks, and using the transposed matrix
(i.e., by inverting the bipartite graph) as the input. With such a transposed matrix,
each allocated type-A service is virtually treated as type-E to preserve the topology
constraints, and each allocated type-E service is also virtually treated as type-A. The
incoming type-A service is treated as type-E and could be allocated to the trans-
posed matrix with the same procedure. Such virtual conversions will not physically
affect any service. Because of the longer per-hop distance between servers on
type-C allocations, with the algorithm, a service is allocated as type-C only if it
cannot be allocated as type-E, type-A or type-S. Such a policy could depend on
service requirements, and modifying the procedure for a direct type-C allocation is
very straightforward.
SCAP tries to allocate an incoming type-E (or type-S) service into a pod. If the
service cannot be allocated given that the total available resource units are enough,
the network is considered as fragmented and certain sub-procedures will be trig-
gered by SCAP to try to perform reallocations. If, after reallocations, a sufficient
number of resource units are available for allocating the incoming service, the
reallocations will be physically performed and the service is allocated. Otherwise,
SCAP will try to perform multi-pod reallocations if the service prefers to be allo-
cated as type-C rather than be rejected. This is particularly helpful when the
available spaces are distributed among pods.
This procedure is to discover candidate reallocations for each resource unit in a pod.
The search scope is limited by a parameter, and we use L(x) to denote the list of
discovered candidate reallocations for releasing resource unit x. The detailed steps
of the mechanism are shown in Table A.1.
Step 1. If the search scope ≥ 0, discover every dummy reallocation that makes a
resource unit available without incurring reallocation and other costs (e.g.,
(m(2, 3), m(2, 3)) in Fig. 4.2). In details, for each available resource unit x,
add (x, z) into L(x), where z = x.
Step 2. If the search scope ≥1, discover every atomic reallocation that makes a
resource unit available while incurring exact one reallocation along a row
in the matrix (e.g., (m(1, 3), m(1, 5)) in Fig. 4.2). In details, for each type-A
or type-S resource unit x, add all (x, z) into L(x), where z is an available
resource unit in the same row of x.
Step 3. If the search scope ≥ 2, discover every combined reallocation that makes a
resource unit available while incurring exact two atomic reallocations (e.g.,
(m(1, 3), m(3, 1)) in Fig. 4.2). In details, for each type-A or type-S resource
unit x, add all combined reallocations constructed by (x, y) and (y, z) into L
(x), where y satisfies all the following conditions: (1) y is an type-E or
type-S resource unit in the same row of x; (2) y is not x; and (3) z is an
available resource unit in the same column of y.
Step 4. If the search scope ≥ 3, discover every atomic reallocation that makes a
resource unit available while incurring exact one cross-pod reallocation
(e.g., (m(1, 4, 1), m(1, 4, 2)) in Fig. 4.1). In details, for each type-C
resource unit x, add all (x, z) into L(x), where z satisfies all the following
conditions: (1) z is an available resource unit; (2) z is in another pod; and
(3) z and x belong to the same MirrorUnits.
Step 5. Return all L(x) for every x.
36 4 Allocation of Servers
This procedure performs reallocations for obtaining n available resource units along
the pod axis in the matrix. It first invokes LAR to obtain candidate reallocations for
each pod, selects a MirrorUnits(i, j), and then reallocates some resource units such
that there are at least n available resource units in the selected MirrorUnits(i, j).
The procedure repeats multiple iterations with an incremental value, indicating
the search scope of the combined reallocations discovery in LAR and is ranged
from 0 to 2. Upon finding a successful result, the iteration stops. The detailed steps
of the procedure are shown in Table A.3.
Step 1. Use LAR to obtain all cost-bounded candidate reallocations for each
matrix.
Step 2. According to the candidate reallocations, for each MirrorUnits(i, j), count
the number of resource units available (i.e., there exists at least one can-
didate reallocation.).
Step 3. Select the first MirrorUnits(i, j) such that the number derived in the pre-
vious step is at least n. Go to the next iteration if such MirrorUnits(i, j)
cannot be found.
4.7 Multi-Pod Reallocation (MPR) 37
Step 4. In the selected MirrorUnits(i, j), select and release the first n resource units
available. For each resource unit, the first candidate reallocation is used in
case multiple candidate reallocations exist.
Step 5. Return the first n available resource units in the selected MirrorUnits(i, j).
This procedure allocates a type-E service for a request with n resource units into the
matrix. Note that the same procedure can be used to allocate type-A services, which
has been explained before. The detailed steps of the procedure are shown in
Table A.4.
Step 1. Select the first pod with the most available resource units.
Step 2. If the number of available resources units is less than n in the selected pod,
go to Step 5 to perform a cross-pod reallocation (or terminate when type-C
allocation is not allowed).
Step 3. In the selected pod, select the first n available resource units in the first
column with at least n available resource units, and, if found, allocate the
service to it and the procedure successfully terminates.
Step 4. Invoke SPR in the pod and try to obtain n available resource units in the
pod, and, if found, allocate the service to it and the procedure successfully
terminates.
Step 5. This step is valid only if the service also accepts type-C allocation.
Invoke MPR and try to find a MirrorUnits(i, j) with n available resource
units, and, if found, allocate the service to it and the procedure successfully
terminates.
There are some desirable properties of the proposed mechanism. In summary,
(1) every service successfully allocated must be provisioned with an isolated,
non-blocking n-star topology; (2) the reallocation cost for each service allocation is
bounded; (3) the reallocations can be concurrently triggered; and (4) the network
performance is consistently guaranteed and isolated among tenants; and the time
complexity is polynomial.
Lemma 4.1 (n-star allocation) In the proposed mechanism, the topology allocated
to any service requesting n resource units is an n-star, and also a non-blocking
network.
Proof For any service requesting n resource units, a successful allocation is one of
Steps 3, 4 and 5 in SCAP. In Steps 3 and 4, the service is allocated into a single
38 4 Allocation of Servers
column (or a row when the matrix is virtually transposed), and in Step 5, the service
is allocated into cross-pod resource units, which must be in the same MirrorUnits.
The allocation is feasible only if there are n available resource units in such allo-
cation spaces. Therefore, such a feasible allocation must consist of exactly
n available resource units connecting to exactly one switch, and by Lemma 3.6, it is
an n-star. It is also a non-blocking network according to Lemma 3.4. □
Lemma 4.2 (Number of reallocations) When allocating any service requesting n
resource units by the proposed mechanism, it incurs at most n independent com-
bined reallocations of length two (i.e., each of them consists of at most two atomic
reallocations).
Proof When allocating any service requesting n resource units by the proposed
mechanism, the reallocation is equivalent to releasing at most n resource units along
a one-dimensional vector in the matrix. With LAR, releasing any resource unit
incurs either an atomic reallocation or a combined reallocation which is equivalent
to two atomic reallocations (i.e., in a propagation way). SPR and MPR both ensure
that any two combined reallocations do not share common resource units. Thus, it
incurs at most n combined reallocations, which are independent according to
Lemma 3.12, and each of them consists of at most two atomic reallocations. □
Theorem 4.2 (Concurrency of reallocations) When allocating any service
requesting n resource units by the proposed mechanism, it takes at most two time
slots to complete the reallocations, whereas an atomic reallocation takes at most
one time slot.
Proof When allocating any service requesting n resource units by the proposed
mechanism, by Lemma 4.2, there are at most n independent combined reallocations
of length two. By Lemmas 3.11 and 3.12, the first atomic reallocation of them can
be completed in the first time slot, and then the second atomic reallocation of them
can be completed in the next time slot. Thus, all reallocations can be completed in at
most two time slots. □
Theorem 4.3 (Bounded reallocation cost) When allocating any service requesting
n resource units by the proposed mechanism, the number of migrated resource
units are both bounded by 2n.
Proof When allocating any service requesting n resource units by the proposed
mechanism, by Lemma 4.2, there are at most n independent combined reallocations
of length two. By Lemma 3.13, every atomic reallocation migrates one resource
unit. Therefore, the number of migrated resource unit is bounded by 2n. □
Theorem 4.4 (Multi-tenant isolation) For any service allocated by the proposed
mechanism, the resource units, except the reallocated ones, are consistently and
exclusively allocated to the same service for its entire life cycle.
Proof For any service allocated by the proposed mechanism, by Lemma 4.1, it is
allocated with an n-star. Since the allocation is formed by available resource units,
the n-stars allocated to different services are independent according to Lemma 3.7.
4.9 Properties of the Algorithm 39
By Lemma 3.11, the resource units are also exclusively allocated when other
services are reallocated. The proof is done. □
Theorem 4.5 (Topology consistency) For any service allocated by the proposed
mechanism, the allocation is consistently n-star, and also consistently a non-
blocking network.
Proof For any service allocated by the proposed mechanism, by Lemma 4.1, the
allocation is n-star, and also a non-blocking network. By Lemmas 3.10 and 3.11, it
remains n-star during and after reallocation. By Theorem 4.4, it also remains n-star
when other services are reallocated. Thus it is consistently n-star, and also con-
sistently a non-blocking network by Lemma 3.4. □
Theorem 4.6 (Consistently congestion-free and equal hop-distance) For any ser-
vice allocated by the proposed mechanism, any traffic pattern for intra-service
communications can be served without network congestion except the servers on
reallocation, and the per-hop distance of intra-service communication is consis-
tently equal.
Proof For any service allocated by the proposed mechanism, by Lemma 3.4,
Theorems 4.4 and 4.5, the allocation is consistently an isolated non-blocking net-
work, thus any traffic pattern for intra-service communications can be served
without network congestion except for the servers during reallocation, and by
Lemma 3.5 the per-hop distance of intra-service communications is consistently
equal. □
Theorem 4.7 (Polynomial-time complexity) The complexity of allocating any
service by the proposed mechanism is O(N3.5), where N is the number of servers in
a pod.
Proof The time complexity of the proposed mechanism is dominated by the second
step of SPR, which uses a maximum cardinality bipartite matching algorithm to
select independent reallocation schedules for each column in the matrix. For each
column, we form a bipartite graph for mapping O(N0.5) resource units and O(N)
reallocation schedules, and hence the bipartite graph has O(N) nodes. With the
Hopcroft-Karp algorithm [2], the matching process takes O(N2.5) for each bipartite
graph with N nodes. There are O(N0.5) pods, and O(N0.5) columns in each pod. SPR
iterates at most three times for extending the search scope in LAR. Thus, the
complexity for allocating a service becomes O(N3.5). □
References
1. G.L. Nemhauser, L.A. Wolsey, Integer and combinatorial optimization, John Wiley & Sons,
New York, (1988)
2. J.E. Hopcroft, R.M. Karp, An n5/2 algorithm for maximum matchings in bipartite graphs,
SIAM J. Comp. 2(4), 225–231, (1973)
Chapter 5
Performance Evaluation
The simulation setup for evaluating Adaptive Fit is as follows. The number of VMs
in the system varies from 50 to 650. Let the resource requirement for each VM vary
in [0, 1] units; the capacity of each server is fixed at one. The requirement of each
VM is assigned independently and randomly, and stays fixed in each simulation.
The migration cost of each VM varies in [1, 1000] and is independent of the
resource requirement. This assignment is reasonable as the migration cost is related
to the downtime caused by the corresponding migration, which may vary from a
few milliseconds to seconds. The saturation threshold u is assigned with 1, 0.95 and
0.9 so as to demonstrate the ability to balance the tradeoff between migration cost
reduction and consolidation effectiveness in different cases.
We use the total migration cost, the average server utilization and the relative
total cost (RTC) as the metrics to evaluate the performance of Adaptive Fit and
compare it with other heuristics. FFD is chosen as the baseline because of its
simplicity and good performance in the typical server consolidation problem. Note
that since FFD has better performance than FF, we do not show FF in our figures.
RTC is defined as the ratio of the total cost incurred in a VM placement sequence
F to the maximum possible total cost, namely the maximum migration cost plus the
minimum hosting cost. Formally, RTC is defined as follows.
a mþe
RTC ¼
aþ1
P
t2Tnftk g;i2V; fj ðiÞ 6¼ fj þ 1 ðiÞ ct ðiÞ
m¼ P
t2Tnftk g;i2V ct ðiÞ
P 0
t2T Ht 1
e¼ P
t2T;8i2V rt ðiÞ
3 0:4 þ 1:1
¼ 0:575
3þ1
The normalized migration cost is shown in Fig. 5.1. It can be seen that Adaptive Fit
(AF) outperforms FFD in terms of the reduction level of total migration cost, while
keeping similar average server utilization levels, as shown in Fig. 5.2. The
reduction in total migration cost is stable as the number of VMs increases from 50
to 650. Thus, it demonstrates that our AF solution can work well even for
large-scale cloud services and data centers. Besides, by adjusting the saturation
threshold u, we see that for AF, the migration cost is decreased as u decreases.
120%
100%
Normalized Cost
80%
FFD
60%
AF, u=1
40%
AF, u=0.95
20%
AF, u=0.9
0%
50 150 250 350 450 550 650
Number of VMs
120%
100%
Utilization 80%
FFD
60%
AF, u=1
40%
AF, u=0.95
20%
AF, u=0.9
0%
50 150 250 350 450 550 650
Number of VMs
Next, we consider the effectiveness of consolidation. Figure 5.2 shows that the
average server utilization for AF is very stable and high, with utilization of 97.4 %
on average at saturation threshold u = 1, which is very close to FFD (98.3 % on
average). For a lower u, the need to turn on idle servers for VMs which cannot be
allocated to their last hosting servers is more likely, leaving more collective residual
capacity in active servers for other VMs to be allocated to their last hosting severs.
Therefore, server utilization will be slightly decreased, but migration cost can be
significantly reduced. At u = 0.9, average utilization is about 90.5 % and average
migration cost is further reduced to 21.7 %, as shown in Figs. 5.1 and 5.2.
To jointly evaluate the benefit and cost overhead of our server consolidation
mechanism, we compare the relative total cost RTC caused by AF and FFD. By
definition, the total cost depends on different models of revenue and hosting cost;
the total cost reduction is shown in Fig. 5.3. We vary the value of α from 0.25 to 32
to capture the behavior of different scenarios. Migration cost dominates the total
cost at high values of α, and hosting cost dominates the total cost at low values of α.
We fix the number of VMs at 650. As shown in Fig. 5.3, the total cost of AF is
much smaller than FFD. FFD incurs very high total cost as it considers only the
number of servers in use. The curves of AF match those of FFD very well when α is
44 5 Performance Evaluation
120%
Normalization Ratio α
very small because then the total cost is dominated by the hosting cost. When α
exceeds 0.5, or the maximum migration cost is at least half of the minimum hosting
cost, the total cost is much reduced by AF.
In summary, our simulation results show the importance of the adjustable sat-
uration threshold u and the effect of u. (1) For a system with high α, the migration
cost dominates the total cost. Therefore, a smaller u will result in enhanced
reduction in migration cost and thus lower the total cost. (2) A lower u results in
more residual capacity in active servers which can be used to host other VMs
without incurring migration. It shows that the solution works well for systems in
which low downtime is more critical than high utilization by providing an adjus-
table saturation threshold to balance the trade-off between downtime and utilization.
reallocation cost, scalability, and explore the feasibility for cloud data centers with
different dynamic demands. The resource efficiency is defined as the ratio of the
total number of allocated resource units to the total number of resource units in the
data center. The reallocation cost is normalized as migration ratio (i.e., the ratio of
the total number of migrated resource units to the total number of allocated resource
units.) For evaluating the scalability, the data center is constructed with a k-ary
fat-tree, where k is ranged from 16 to 48 and the number of servers is hence
accordingly ranged from 1024 to 27,648 to represent small to large data centers.
In each run of the simulations, a set of independent services is randomly gen-
erated. Their requested type of allocation may be type-E or type-A, which is ran-
domly distributed and could be dynamically changed to type-C by Method 3 in
some cases mentioned earlier. Each service requests one to N resource units, where
N is the capacity (i.e., the maximum number of downlinks) of an aggregation switch
or edge switch. The demand generation follows a normal distribution with mean
N/2 and variance N/6 (such that about 99 % of requests belong to [1, N] and any
demand larger than N will be dropped). We let the total service demands be exactly
equal to the available capacity of the entire data center. In reality, large cloud data
centers usually host hundreds and even thousands of independent services. With
such a large number, in the simulations we assume the load of services, which is
assumed proportional to the number of requested resource units, can be approxi-
mated by a normal distribution. We will also show the results based on uniform
distribution and discuss the impact of the demand size distribution.
For evaluating the practical capacities for various uses of cloud data centers, we
simulate different demand dynamics of a data center. Taking 30 % dynamic as an
example, in the first phase, the demands taking 100 % capacity are generated as the
input of each allocation mechanism, and then 30 % of the allocated resource units
are randomly released. In the second phase, new demands which take the current
residual capacity are generated as the input of each allocation mechanism. We
collect the data of resource efficiency and reallocation cost after Phase 2. Each data
point in every graph is averaged over 50 independent simulation runs.
The simulations for large-scale fully-loaded data centers (i.e., 48-ary and 10 %
dynamic) take about 1, 3 and 10 ms in average for Methods 1, 2 and 3, respectively,
to allocate an incoming service requesting 10 servers. It shows the run time of the
proposed algorithm is of a short delay compared with the typical VM startup time.
We evaluate the resource efficiency under different dynamic demands to verify the
practicality. As shown in Fig. 5.4, where the data center is constructed with a 48-ary
fat-tree (i.e., 27,468 servers), Methods 2 and 3, using the allocation mechanism that
cooperates with the proposed reallocation procedures, can achieve almost 100 %
resource efficiency regardless how dynamic the demand is. This excellent perfor-
mance results from rearranging the fragmented resource and hence larger available
46 5 Performance Evaluation
100%
Resource Efficiency
90%
80%
Method 1
70%
Method 2
60% Method 3
50%
10% 30% 50% 70% 90%
Demand Dynamic (48-ary, normal dist.)
100%
Resource Efficiency
90%
80%
Method 1
70%
Method 2
60% Method 3
50%
16 24 32 40 48
k-ary Fat-tree (30% dynamic, normal dist.)
Next, we evaluate the scalability of our mechanism. As shown in Fig. 5.5, where
the demand is fixed at 30 %, Methods 2 and 3 can both achieve higher resource
efficiency because that the proposed mechanisms effectively reallocate the resources
in cloud data centers of any scale. The result shows scalability even in a large
commercial cloud data center that hosts more than 20,000 servers. However, since
resource fragmentation may occur at any scale and reallocation mechanisms are not
supported. Method 1 can only achieve about 80 % resource efficiency.
100%
Resource Efficiency
90%
80%
Method 1
70%
Method 2
60% Method 3
50%
10% 30% 50% 70% 90%
Demand Dynamic(48-ary, uniform dist.)
100%
Resource Efficiency
90%
80%
Method 1
70%
Method 2
60% Method 3
50%
16 24 32 40 48
k-ary Fat-tree (30% dynamic, uniform dist.)
0.8
Reallocation Cost
10%
0.6
30%
0.4
50%
0.2 70%
0 90%
16 24 32 40 48
k-ary Fat-tree Network (method 3, unifom dist.)
0.8
Reallocation Cost 10%
0.6
30%
0.4
50%
0.2 70%
0 90%
16 24 32 40 48
k-ary Fat-tree Network (method 3, unifom dist.)
dynamics, since there are relatively larger clusters of available resource units, more
services can be allocated without reallocation and the average cost becomes lower.
Even with low dynamics and when the resource pool is fragmented to smaller
fragments, the cost is still about 0.4. The inter-pod reallocation cost, as shown in
Fig. 5.9, behaves similarly and is smaller than the inter-rack reallocation cost. This
is because the proposed mechanism gives higher priority to intra-pod reallocation
for reducing cross-pod migration which has longer per-hop distance and may lead
to longer migration time. The results show that our method incurs negligible
reallocation cost.
Chapter 6
Conclusion