Live Migration of Virtual Machines
Live Migration of Virtual Machines
Live Migration of Virtual Machines
USENIX Association NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation 273
We have implemented high-performance migration sup- have explored migration over longer time spans by stop-
port for Xen [1], a freely available open source VMM for ping and then transferring include Internet Suspend/Re-
commodity hardware. Our design and implementation ad- sume [4] and µDenali [5].
dresses the issues and tradeoffs involved in live local-area
migration. Firstly, as we are targeting the migration of ac- Zap [6] uses partial OS virtualization to allow the migration
tive OSes hosting live services, it is critically important to of process domains (pods), essentially process groups, us-
minimize the downtime during which services are entirely ing a modified Linux kernel. Their approach is to isolate all
unavailable. Secondly, we must consider the total migra- process-to-kernel interfaces, such as file handles and sock-
tion time, during which state on both machines is synchro- ets, into a contained namespace that can be migrated. Their
nized and which hence may affect reliability. Furthermore approach is considerably faster than results in the Collec-
we must ensure that migration does not unnecessarily dis- tive work, largely due to the smaller units of migration.
rupt active services through resource contention (e.g., CPU, However, migration in their system is still on the order of
network bandwidth) with the migrating OS. seconds at best, and does not allow live migration; pods
are entirely suspended, copied, and then resumed. Further-
Our implementation addresses all of these concerns, allow- more, they do not address the problem of maintaining open
ing for example an OS running the SPECweb benchmark connections for existing services.
to migrate across two physical hosts with only 210ms un-
availability, or an OS running a Quake 3 server to migrate The live migration system presented here has considerable
with just 60ms downtime. Unlike application-level restart, shared heritage with the previous work on NomadBIOS [7],
we can maintain network connections and application state a virtualization and migration system built on top of the
during this process, hence providing effectively seamless L4 microkernel [8]. NomadBIOS uses pre-copy migration
migration from a user’s point of view. to achieve very short best-case migration downtimes, but
makes no attempt at adapting to the writable working set
We achieve this by using a pre-copy approach in which
behavior of the migrating OS.
pages of memory are iteratively copied from the source
machine to the destination host, all without ever stopping VMware has recently added OS migration support, dubbed
the execution of the virtual machine being migrated. Page- VMotion, to their VirtualCenter management software. As
level protection hardware is used to ensure a consistent this is commercial software and strictly disallows the publi-
snapshot is transferred, and a rate-adaptive algorithm is cation of third-party benchmarks, we are only able to infer
used to control the impact of migration traffic on running its behavior through VMware’s own publications. These
services. The final phase pauses the virtual machine, copies limitations make a thorough technical comparison impos-
any remaining pages to the destination, and resumes exe- sible. However, based on the VirtualCenter User’s Man-
cution there. We eschew a ‘pull’ approach which faults in ual [9], we believe their approach is generally similar to
missing pages across the network since this adds a residual ours and would expect it to perform to a similar standard.
dependency of arbitrarily long duration, as well as provid-
ing in general rather poor performance. Process migration, a hot topic in systems research during
Our current implementation does not address migration the 1980s [10, 11, 12, 13, 14], has seen very little use for
across the wide area, nor does it include support for migrat- real-world applications. Milojicic et al [2] give a thorough
ing local block devices, since neither of these are required survey of possible reasons for this, including the problem
for our target problem space. However we discuss ways in of the residual dependencies that a migrated process re-
which such support can be provided in Section 7. tains on the machine from which it migrated. Examples of
residual dependencies include open file descriptors, shared
memory segments, and other local resources. These are un-
2 Related Work desirable because the original machine must remain avail-
able, and because they usually negatively impact the per-
formance of migrated processes.
The Collective project [3] has previously explored VM mi-
gration as a tool to provide mobility to users who work on For example Sprite [15] processes executing on foreign
different physical hosts at different times, citing as an ex- nodes require some system calls to be forwarded to the
ample the transfer of an OS instance to a home computer home node for execution, leading to at best reduced perfor-
while a user drives home from work. Their work aims to mance and at worst widespread failure if the home node is
optimize for slow (e.g., ADSL) links and longer time spans, unavailable. Although various efforts were made to ame-
and so stops OS execution for the duration of the transfer, liorate performance issues, the underlying reliance on the
with a set of enhancements to reduce the transmitted image availability of the home node could not be avoided. A sim-
size. In contrast, our efforts are concerned with the migra- ilar fragility occurs with MOSIX [14] where a deputy pro-
tion of live, in-service OS instances on fast neworks with cess on the home node must remain available to support
only tens of milliseconds of downtime. Other projects that remote execution.
274 NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association
We believe the residual dependency problem cannot easily three. For example, pure stop-and-copy [3, 4, 5] involves
be solved in any process migration scheme – even modern halting the original VM, copying all pages to the destina-
mobile run-times such as Java and .NET suffer from prob- tion, and then starting the new VM. This has advantages in
lems when network partition or machine crash causes class terms of simplicity but means that both downtime and total
loaders to fail. The migration of entire operating systems migration time are proportional to the amount of physical
inherently involves fewer or zero such dependencies, mak- memory allocated to the VM. This can lead to an unaccept-
ing it more resilient and robust. able outage if the VM is running a live service.
USENIX Association NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation 275
For network resources, we want a migrated OS to maintain VM running normally on Stage 0: Pre-Migration
Host A Active VM on Host A
all open network connections without relying on forward- Alternate physical host may be preselected for migration
Block devices mirrored and free resources maintained
ing mechanisms on the original host (which may be shut
down following migration), or on support from mobility Stage 1: Reservation
Initialize a container on the target host
or redirection mechanisms that are not already present (as
in [6]). A migrating VM will include all protocol state (e.g. Overhead due to copying Stage 2: Iterative Pre-copy
Enable shadow paging
TCP PCBs), and will carry its IP address with it. Copy dirty pages in successive rounds.
LAN. Our solution for managing migration with respect to Stage 4: Commitment
network in this environment is to generate an unsolicited VM state on Host A is released
ARP reply from the migrated host, advertising that the IP VM running normally on
Stage 5: Activation
Host B
has moved to a new location. This will reconfigure peers VM starts on Host B
Connects to local devices
to send packets to the new physical address, and while a Resumes normal operation
very small number of in-flight packets may be lost, the mi-
grated domain will be able to continue using open connec- Figure 1: Migration timeline
tions with almost no observable interference.
Some routers are configured not to accept broadcast ARP to system failure than when it is running on the original sin-
replies (in order to prevent IP spoofing), so an unsolicited gle host. To achieve this, we view the migration process as
ARP may not work in all scenarios. If the operating system a transactional interaction between the two hosts involved:
is aware of the migration, it can opt to send directed replies
only to interfaces listed in its own ARP cache, to remove Stage 0: Pre-Migration We begin with an active VM on
the need for a broadcast. Alternatively, on a switched net- physical host A. To speed any future migration, a tar-
work, the migrating OS can keep its original Ethernet MAC get host may be preselected where the resources re-
address, relying on the network switch to detect its move to quired to receive migration will be guaranteed.
a new port1 .
Stage 1: Reservation A request is issued to migrate an OS
In the cluster, the migration of storage may be similarly ad- from host A to host B. We initially confirm that the
dressed: Most modern data centers consolidate their stor- necessary resources are available on B and reserve a
age requirements using a network-attached storage (NAS) VM container of that size. Failure to secure resources
device, in preference to using local disks in individual here means that the VM simply continues to run on A
servers. NAS has many advantages in this environment, in- unaffected.
cluding simple centralised administration, widespread ven-
Stage 2: Iterative Pre-Copy During the first iteration, all
dor support, and reliance on fewer spindles leading to a
pages are transferred from A to B. Subsequent itera-
reduced failure rate. A further advantage for migration is
tions copy only those pages dirtied during the previous
that it obviates the need to migrate disk storage, as the NAS
transfer phase.
is uniformly accessible from all host machines in the clus-
ter. We do not address the problem of migrating local-disk Stage 3: Stop-and-Copy We suspend the running OS in-
storage in this paper, although we suggest some possible stance at A and redirect its network traffic to B. As
strategies as part of our discussion of future work. described earlier, CPU state and any remaining incon-
sistent memory pages are then transferred. At the end
of this stage there is a consistent suspended copy of
3.3 Design Overview the VM at both A and B. The copy at A is still con-
sidered to be primary and is resumed in case of failure.
The logical steps that we execute when migrating an OS are
Stage 4: Commitment Host B indicates to A that it has
summarized in Figure 1. We take a conservative approach
successfully received a consistent OS image. Host A
to the management of migration with regard to safety and
acknowledges this message as commitment of the mi-
failure handling. Although the consequences of hardware
gration transaction: host A may now discard the orig-
failures can be severe, our basic principle is that safe mi-
inal VM, and host B becomes the primary host.
gration should at no time leave a virtual OS more exposed
1 Note
Stage 5: Activation The migrated VM on B is now ac-
that on most Ethernet controllers, hardware MAC filtering will
tivated. Post-migration code runs to reattach device
have to be disabled if multiple addresses are in use (though some cards
support filtering of multiple addresses in hardware) and so this technique drivers to the new machine and advertise moved IP
is only practical for switched networks. addresses.
276 NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association
Tracking the Writable Working Set of SPEC CINT2000
80000
gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf
70000
60000
Number of pages
50000
40000
30000
20000
10000
0
0 2000 4000 6000 8000 10000 12000
Elapsed time (secs)
Figure 2: WWS curve for a complete run of SPEC CINT2000 (512MB VM)
This approach to failure management ensures that at least is: how does one determine when it is time to stop the pre-
one host has a consistent VM image at all times during copy phase because too much time and resource is being
migration. It depends on the assumption that the original wasted? Clearly if the VM being migrated never modifies
host remains stable until the migration commits, and that memory, a single pre-copy of each memory page will suf-
the VM may be suspended and resumed on that host with fice to transfer a consistent image to the destination. How-
no risk of failure. Based on these assumptions, a migra- ever, should the VM continuously dirty pages faster than
tion request essentially attempts to move the VM to a new the rate of copying, then all pre-copy work will be in vain
host, and on any sort of failure execution is resumed locally, and one should immediately stop and copy.
aborting the migration.
In practice, one would expect most workloads to lie some-
where between these extremes: a certain (possibly large)
set of pages will seldom or never be modified and hence are
4 Writable Working Sets good candidates for pre-copy, while the remainder will be
written often and so should best be transferred via stop-and-
When migrating a live operating system, the most signif- copy – we dub this latter set of pages the writable working
icant influence on service performance is the overhead of set (WWS) of the operating system by obvious extension
coherently transferring the virtual machine’s memory im- of the original working set concept [17].
age. As mentioned previously, a simple stop-and-copy ap-
In this section we analyze the WWS of operating systems
proach will achieve this in time proportional to the amount
running a range of different workloads in an attempt to ob-
of memory allocated to the VM. Unfortunately, during this
tain some insight to allow us build heuristics for an efficient
time any running services are completely unavailable.
and controllable pre-copy implementation.
A more attractive alternative is pre-copy migration, in
which the memory image is transferred while the operat-
ing system (and hence all hosted services) continue to run. 4.1 Measuring Writable Working Sets
The drawback however, is the wasted overhead of trans-
ferring memory pages that are subsequently modified, and To trace the writable working set behaviour of a number of
hence must be transferred again. For many workloads there representative workloads we used Xen’s shadow page ta-
will be a small set of memory pages that are updated very bles (see Section 5) to track dirtying statistics on all pages
frequently, and which it is not worth attempting to maintain used by a particular executing operating system. This al-
coherently on the destination machine before stopping and lows us to determine within any time period the set of pages
copying the remainder of the VM. written to by the virtual machine.
The fundamental question for iterative pre-copy migration Using the above, we conducted a set of experiments to sam-
USENIX Association NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation 277
Effect of Bandwidth and Pre−Copy Iterations on Migration Downtime Effect of Bandwidth and Pre−Copy Iterations on Migration Downtime
(Based on a page trace of Linux Kernel Compile) (Based on a page trace of OLTP Database Benchmark)
4 Migration throughput: 128 Mbit/sec 9000 4 Migration throughput: 128 Mbit/sec 8000
0 0 0 0
0 100 200 300 400 500 600 0 200 400 600 800 1000 1200
Elapsed time (sec) Elapsed time (sec)
4 9000 4 8000
0 0 0 0
0 100 200 300 400 500 600 0 200 400 600 800 1000 1200
Elapsed time (sec) Elapsed time (sec)
4 9000 4 8000
0 0 0 0
0 100 200 300 400 500 600 0 200 400 600 800 1000 1200
Elapsed time (sec) Elapsed time (sec)
Figure 3: Expected downtime due to last-round memory Figure 4: Expected downtime due to last-round memory
copy on traced page dirtying of a Linux kernel compile. copy on traced page dirtying of OLTP.
Effect of Bandwidth and Pre−Copy Iterations on Migration Downtime Effect of Bandwidth and Pre−Copy Iterations on Migration Downtime
(Based on a page trace of Quake 3 Server) (Based on a page trace of SPECweb)
0.5 Migration throughput: 128 Mbit/sec 600 Migration throughput: 128 Mbit/sec
Rate of page dirtying (pages/sec)
9 14000
8
Expected downtime (sec)
500 12000
0.4
7
400 10000
6
0.3
5 8000
300
0.2 4 6000
200 3
4000
0.1 2
100
2000
1
0 0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 600 700
Elapsed time (sec) Elapsed time (sec)
0.5 600
Rate of page dirtying (pages/sec)
500 12000
0.4
7
10000
400 6
0.3
5 8000
300
4 6000
0.2
200 3
4000
0.1 2
100
2000
1
0 0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 600 700
Elapsed time (sec) Elapsed time (sec)
500 12000
0.4 7
10000
400 6
0.3
5 8000
300
4 6000
0.2
200 3
4000
0.1 2
100 2000
1
0 0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 600 700
Elapsed time (sec) Elapsed time (sec)
Figure 5: Expected downtime due to last-round memory Figure 6: Expected downtime due to last-round memory
copy on traced page dirtying of a Quake 3 server. copy on traced page dirtying of SPECweb.
278 NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association
ple the writable working set size for a variety of bench- the first thing to observe is that pre-copy migration al-
marks. Xen was running on a dual processor Intel Xeon ways performs considerably better than naive stop-and-
2.4GHz machine, and the virtual machine being measured copy. For a 512MB virtual machine this latter approach
had a memory allocation of 512MB. In each case we started would require 32, 16, and 8 seconds downtime for the
the relevant benchmark in one virtual machine and read 128Mbit/sec, 256Mbit/sec and 512Mbit/sec bandwidths re-
the dirty bitmap every 50ms from another virtual machine, spectively. Even in the worst case (the starting phase of
cleaning it every 8 seconds – in essence this allows us to SPECweb), a single pre-copy iteration reduces downtime
compute the WWS with a (relatively long) 8 second win- by a factor of four. In most cases we can expect to do
dow, but estimate it at a finer (50ms) granularity. considerably better – for example both the Linux kernel
compile and the OLTP benchmark typically experience a
The benchmarks we ran were SPEC CINT2000, a Linux
reduction in downtime of at least a factor of sixteen.
kernel compile, the OSDB OLTP benchmark using Post-
greSQL and SPECweb99 using Apache. We also measured The remaining three lines show, in order, the effect of per-
a Quake 3 server as we are particularly interested in highly forming a total of two, three or four pre-copy iterations
interactive workloads. prior to the final stop-and-copy round. In most cases we
see an increased reduction in downtime from performing
Figure 2 illustrates the writable working set curve produced these additional iterations, although with somewhat dimin-
for the SPEC CINT2000 benchmark run. This benchmark ishing returns, particularly in the higher bandwidth cases.
involves running a series of smaller programs in order and
measuring the overall execution time. The x-axis measures This is because all the observed workloads exhibit a small
elapsed time, and the y-axis shows the number of 4KB but extremely frequently updated set of ‘hot’ pages. In
pages of memory dirtied within the corresponding 8 sec- practice these pages will include the stack and local vari-
ond interval; the graph is annotated with the names of the ables being accessed within the currently executing pro-
sub-benchmark programs. cesses as well as pages being used for network and disk
traffic. The hottest pages will be dirtied at least as fast as
From this data we observe that the writable working set we can transfer them, and hence must be transferred in the
varies significantly between the different sub-benchmarks. final stop-and-copy phase. This puts a lower bound on the
For programs such as ‘eon’ the WWS is a small fraction of best possible service downtime for a particular benchmark,
the total working set and hence is an excellent candidate for network bandwidth and migration start time.
migration. In contrast, ‘gap’ has a consistently high dirty-
ing rate and would be problematic to migrate. The other This interesting tradeoff suggests that it may be worthwhile
benchmarks go through various phases but are generally increasing the amount of bandwidth used for page transfer
amenable to live migration. Thus performing a migration in later (and shorter) pre-copy iterations. We will describe
of an operating system will give different results depending our rate-adaptive algorithm based on this observation in
on the workload and the precise moment at which migra- Section 5, and demonstrate its effectiveness in Section 6.
tion begins.
5 Implementation Issues
4.2 Estimating Migration Effectiveness
We designed and implemented our pre-copying migration
We observed that we could use the trace data acquired to engine to integrate with the Xen virtual machine moni-
estimate the effectiveness of iterative pre-copy migration tor [1]. Xen securely divides the resources of the host ma-
for various workloads. In particular we can simulate a par- chine amongst a set of resource-isolated virtual machines
ticular network bandwidth for page transfer, determine how each running a dedicated OS instance. In addition, there is
many pages would be dirtied during a particular iteration, one special management virtual machine used for the ad-
and then repeat for successive iterations. Since we know ministration and control of the machine.
the approximate WWS behaviour at every point in time, we
can estimate the overall amount of data transferred in the fi- We considered two different methods for initiating and
nal stop-and-copy round and hence estimate the downtime. managing state transfer. These illustrate two extreme points
in the design space: managed migration is performed
Figures 3–6 show our results for the four remaining work- largely outside the migratee, by a migration daemon run-
loads. Each figure comprises three graphs, each of which ning in the management VM; in contrast, self migration is
corresponds to a particular network bandwidth limit for implemented almost entirely within the migratee OS with
page transfer; each individual graph shows the WWS his- only a small stub required on the destination machine.
togram (in light gray) overlaid with four line plots estimat-
In the following sections we describe some of the imple-
ing service downtime for up to four pre-copying rounds.
mentation details of these two approaches. We describe
Looking at the topmost line (one pre-copy iteration), how we use dynamic network rate-limiting to effectively
USENIX Association NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation 279
balance network contention against OS downtime. We then time for remaining inconsistent memory pages, and these
proceed to describe how we ameliorate the effects of rapid are transferred to the destination together with the VM’s
page dirtying, and describe some performance enhance- checkpointed CPU-register state.
ments that become possible when the OS is aware of its
Once this final information is received at the destination,
migration — either through the use of self migration, or by
the VM state on the source machine can safely be dis-
adding explicit paravirtualization interfaces to the VMM.
carded. Control software on the destination machine scans
the memory map and rewrites the guest’s page tables to re-
flect the addresses of the memory pages that it has been
5.1 Managed Migration allocated. Execution is then resumed by starting the new
VM at the point that the old VM checkpointed itself. The
Managed migration is performed by migration daemons OS then restarts its virtual device drivers and updates its
running in the management VMs of the source and destina- notion of wallclock time.
tion hosts. These are responsible for creating a new VM on
the destination machine, and coordinating transfer of live Since the transfer of pages is OS agnostic, we can easily
system state over the network. support any guest operating system – all that is required is
a small paravirtualized stub to handle resumption. Our im-
When transferring the memory image of the still-running plementation currently supports Linux 2.4, Linux 2.6 and
OS, the control software performs rounds of copying in NetBSD 2.0.
which it performs a complete scan of the VM’s memory
pages. Although in the first round all pages are transferred
to the destination machine, in subsequent rounds this copy- 5.2 Self Migration
ing is restricted to pages that were dirtied during the pre-
vious round, as indicated by a dirty bitmap that is copied In contrast to the managed method described above, self
from Xen at the start of each round. migration [18] places the majority of the implementation
During normal operation the page tables managed by each within the OS being migrated. In this design no modifi-
guest OS are the ones that are walked by the processor’s cations are required either to Xen or to the management
MMU to fill the TLB. This is possible because guest OSes software running on the source machine, although a migra-
are exposed to real physical addresses and so the page ta- tion stub must run on the destination machine to listen for
bles they create do not need to be mapped to physical ad- incoming migration requests, create an appropriate empty
dresses by Xen. VM, and receive the migrated system state.
To log pages that are dirtied, Xen inserts shadow page ta- The pre-copying scheme that we implemented for self mi-
bles underneath the running OS. The shadow tables are gration is conceptually very similar to that for managed mi-
populated on demand by translating sections of the guest gration. At the start of each pre-copying round every page
page tables. Translation is very simple for dirty logging: mapping in every virtual address space is write-protected.
all page-table entries (PTEs) are initially read-only map- The OS maintains a dirty bitmap tracking dirtied physical
pings in the shadow tables, regardless of what is permitted pages, setting the appropriate bits as write faults occur. To
by the guest tables. If the guest tries to modify a page of discriminate migration faults from other possible causes
memory, the resulting page fault is trapped by Xen. If write (for example, copy-on-write faults, or access-permission
access is permitted by the relevant guest PTE then this per- faults) we reserve a spare bit in each PTE to indicate that it
mission is extended to the shadow PTE. At the same time, is write-protected only for dirty-logging purposes.
we set the appropriate bit in the VM’s dirty bitmap. The major implementation difficulty of this scheme is to
transfer a consistent OS checkpoint. In contrast with a
When the bitmap is copied to the control software at the
managed migration, where we simply suspend the migra-
start of each pre-copying round, Xen’s bitmap is cleared
tee to obtain a consistent checkpoint, self migration is far
and the shadow page tables are destroyed and recreated as
harder because the OS must continue to run in order to
the migratee OS continues to run. This causes all write per-
transfer its final state. We solve this difficulty by logically
missions to be lost: all pages that are subsequently updated
checkpointing the OS on entry to a final two-stage stop-
are then added to the now-clear dirty bitmap.
and-copy phase. The first stage disables all OS activity ex-
When it is determined that the pre-copy phase is no longer cept for migration and then peforms a final scan of the dirty
beneficial, using heuristics derived from the analysis in bitmap, clearing the appropriate bit as each page is trans-
Section 4, the OS is sent a control message requesting that ferred. Any pages that are dirtied during the final scan, and
it suspend itself in a state suitable for migration. This that are still marked as dirty in the bitmap, are copied to a
causes the OS to prepare for resumption on the destina- shadow buffer. The second and final stage then transfers the
tion machine; Xen informs the control software once the contents of the shadow buffer — page updates are ignored
OS has done this. The dirty bitmap is scanned one last during this transfer.
280 NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association
5.3 Dynamic Rate-Limiting 10000
Transferred pages
4kB pages
analysis in Section 4 showed that we must eventually pay
in the form of an extended downtime because the hottest 4000
pages in the writable working set are not amenable to pre-
copy migration. The downtime can be reduced by increas-
2000
ing the bandwidth limit, albeit at the cost of additional net-
work contention.
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Our solution to this impasse is to dynamically adapt the Iterations
USENIX Association NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation 281
one must be careful not to stun important interactive ser- pass transfers 776MB and lasts for 62 seconds, at which
vices. point the migration algorithm described in Section 5 in-
creases its rate over several iterations and finally suspends
the VM after a further 9.8 seconds. The final stop-and-copy
Freeing Page Cache Pages. A typical operating system phase then transfer the remaining pages and the web server
will have a number of ‘free’ pages at any time, ranging resumes at full rate after a 165ms outage.
from truly free (page allocator) to cold buffer cache pages.
When informed a migration is to begin, the OS can sim- This simple example demonstrates that a highly loaded
ply return some or all of these pages to Xen in the same server can be migrated with both controlled impact on live
way it would when using the ballooning mechanism de- services and a short downtime. However, the working set
scribed in [1]. This means that the time taken for the first of the server in this case is rather small, and so this should
“full pass” iteration of pre-copy migration can be reduced, be expected to be a relatively easy case for live migration.
sometimes drastically. However should the contents of
these pages be needed again, they will need to be faulted
back in from disk, incurring greater overall cost. 6.3 Complex Web Workload: SPECweb99
282 NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association
Effect of Migration on Web Server Transmission Rate
1000 1st precopy, 62 secs further iterations
870 Mbit/sec
9.8 secs
765 Mbit/sec
Throughput (Mbit/sec)
800
600
694 Mbit/sec
400
165ms total downtime
200
512Kb files Sample over 100ms
100 concurrent clients Sample over 500ms
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130
Elapsed time (secs)
600
Iterative Progress of Live Migration: SPECweb99
350 Clients (90% of max load), 800MB VM
Total Data Transmitted: 960MB (x1.20) In the final iteration, the domain is suspended. The remaining
18.2 MB of dirty pages are sent and the VM resumes execution
500 Area of Bars: on the remote machine. In addition to the 201ms required to 18.2 MB
VM memory transfered
copy the last round of data, an additional 9ms elapse while the 15.3 MB
Memory dirtied during this iteration
VM starts up. The total downtime for this experiment is 210ms.
14.2 MB
Transfer Rate (Mbit/sec)
400
16.7 MB
24.2 MB
300
The first iteration involves a long, relatively low-rate transfer of
the VM’s memory. In this example, 676.8 MB are transfered in
54.1 seconds. These early phases allow non-writable working
200
set data to be transfered with a low impact on active services. 28.4 MB
100
676.8 MB 126.7 MB 39.0 MB
0
0 50 55 60 65 70
Elapsed Time (sec)
with a long period of low-rate transmission as a first pass conformant clients. This result is an excellent validation of
is made through the memory of the virtual machine. This our approach: a heavily (90% of maximum) loaded server
first round takes 54.1 seconds and transmits 676.8MB of is migrated to a separate physical host with a total migra-
memory. Two more low-rate rounds follow, transmitting tion time of seventy-one seconds. Furthermore the migra-
126.7MB and 39.0MB respectively before the transmission tion does not interfere with the quality of service demanded
rate is increased. by SPECweb’s workload. This illustrates the applicability
of migration as a tool for administrators of demanding live
The remainder of the graph illustrates how the adaptive al-
services.
gorithm tracks the page dirty rate over successively shorter
iterations before finally suspending the VM. When suspen-
sion takes place, 18.2MB of memory remains to be sent.
6.4 Low-Latency Server: Quake 3
This transmission takes 201ms, after which an additional
9ms is required for the domain to resume normal execu-
Another representative application for hosting environ-
tion.
ments is a multiplayer on-line game server. To determine
The total downtime of 210ms experienced by the the effectiveness of our approach in this case we config-
SPECweb clients is sufficiently brief to maintain the 350 ured a virtual machine with 64MB of memory running a
USENIX Association NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation 283
Packet interarrival time during Quake 3 migration
Packet flight time (secs)
0.12
Migration 2
Migration 1
downtime: 50ms
downtime: 48ms
0.1
0.08
0.06
0.04
0.02
0
0 10 20 30 40 50 60 7
Elapsed time (secs)
Figure 10: Effect on packet response time of migrating a running Quake 3 server VM.
450
Iterative Progress of Live Migration: Quake 3 Server 0.1 MB
6 Clients, 64MB VM 0.2 MB
Total Data Transmitted: 88MB (x1.37) The final iteration in this case leaves only 148KB of data to
0.8 MB
400 transmit. In addition to the 20ms required to copy this last
Area of Bars: round, an additional 40ms are spent on start-up overhead. The
VM memory transfered total downtime experienced is 60ms.
350
Memory dirtied during this iteration
Transfer Rate (Mbit/sec)
300
1.1 MB
250
1.2 MB
200
0.9 MB
1.2 MB
150
1.6 MB
50
0
0 4.5 5 5.5 6 6.5 7
Elapsed Time (sec)
Quake 3 server. Six players joined the game and started to a transient increase in response time of 50ms. In neither
play within a shared arena, at which point we initiated a case was this perceptible to the players.
migration to another machine. A detailed analysis of this
migration is shown in Figure 11.
The trace illustrates a generally similar progression as for 6.5 A Diabolical Workload: MMuncher
SPECweb, although in this case the amount of data to be
transferred is significantly smaller. Once again the trans- As a final point in our evaluation, we consider the situation
fer rate increases as the trace progresses, although the final in which a virtual machine is writing to memory faster than
stop-and-copy phase transfers so little data (148KB) that can be transferred across the network. We test this diaboli-
the full bandwidth is not utilized. cal case by running a 512MB host with a simple C program
that writes constantly to a 256MB region of memory. The
Overall, we are able to perform the live migration with a to- results of this migration are shown in Figure 12.
tal downtime of 60ms. To determine the effect of migration
on the live players, we performed an additional experiment In the first iteration of this workload, we see that half of
in which we migrated the running Quake 3 server twice the memory has been transmitted, while the other half is
and measured the inter-arrival time of packets received by immediately marked dirty by our test program. Our algo-
clients. The results are shown in Figure 10. As can be seen, rithm attempts to adapt to this by scaling itself relative to
from the client point of view migration manifests itself as the perceived initial rate of dirtying; this scaling proves in-
284 NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association
1000
Iterative Progress of Live Migration: Diabolical Workload 7.2 Wide Area Network Redirection
512MB VM, Constant writes to 256MB region.
Total Data Transmitted: 638MB (x1.25)
Transfer Rate (Mbit/sec)
7 Future Work
7.3 Migrating Block Devices
Although our solution is well-suited for the environment
we have targeted – a well-connected data-center or cluster Although NAS prevails in the modern data center, some
with network-accessed storage – there are a number of ar- environments may still make extensive use of local disks.
eas in which we hope to carry out future work. This would These present a significant problem for migration as they
allow us to extend live migration to wide-area networks, are usually considerably larger than volatile memory. If the
and to environments that cannot rely solely on network- entire contents of a disk must be transferred to a new host
attached storage. before migration can complete, then total migration times
may be intolerably extended.
USENIX Association NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation 285
8 Conclusion kernel-based systems. In Proceedings of the sixteenth
ACM Symposium on Operating System Principles,
pages 66–77. ACM Press, 1997.
By integrating live OS migration into the Xen virtual ma-
chine monitor we enable rapid movement of interactive [9] VMWare, Inc. VMWare VirtualCenter Version 1.2
workloads within clusters and data centers. Our dynamic User’s Manual. 2004.
network-bandwidth adaptation allows migration to proceed
[10] Michael L. Powell and Barton P. Miller. Process mi-
with minimal impact on running services, while reducing
gration in DEMOS/MP. In Proceedings of the ninth
total downtime to below discernable thresholds.
ACM Symposium on Operating System Principles,
Our comprehensive evaluation shows that realistic server pages 110–119. ACM Press, 1983.
workloads such as SPECweb99 can be migrated with just
[11] Marvin M. Theimer, Keith A. Lantz, and David R.
210ms downtime, while a Quake3 game server is migrated
Cheriton. Preemptable remote execution facilities for
with an imperceptible 60ms outage.
the V-system. In Proceedings of the tenth ACM Sym-
posium on Operating System Principles, pages 2–12.
ACM Press, 1985.
References
[12] Eric Jul, Henry Levy, Norman Hutchinson, and An-
[1] Paul Barham, Boris Dragovic, Keir Fraser, Steven drew Black. Fine-grained mobility in the emerald sys-
Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian tem. ACM Trans. Comput. Syst., 6(1):109–133, 1988.
Pratt, and Andrew Warfield. Xen and the art of virtu-
[13] Fred Douglis and John K. Ousterhout. Transparent
alization. In Proceedings of the nineteenth ACM sym-
process migration: Design alternatives and the Sprite
posium on Operating Systems Principles (SOSP19),
implementation. Software - Practice and Experience,
pages 164–177. ACM Press, 2003.
21(8):757–785, 1991.
[2] D. Milojicic, F. Douglis, Y. Paindaveine, R. Wheeler,
[14] A. Barak and O. La’adan. The MOSIX multicom-
and S. Zhou. Process migration. ACM Computing
puter operating system for high performance cluster
Surveys, 32(3):241–299, 2000.
computing. Journal of Future Generation Computer
[3] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, Systems, 13(4-5):361–372, March 1998.
M. S. Lam, and M.Rosenblum. Optimizing the mi- [15] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N.
gration of virtual computers. In Proc. of the 5th Sym- Nelson, and B. B. Welch. The Sprite network oper-
posium on Operating Systems Design and Implemen- ating system. Computer Magazine of the Computer
tation (OSDI-02), December 2002. Group News of the IEEE Computer Group Society, ;
[4] M. Kozuch and M. Satyanarayanan. Internet sus- ACM CR 8905-0314, 21(2), 1988.
pend/resume. In Proceedings of the IEEE Work- [16] E. Zayas. Attacking the process migration bottle-
shop on Mobile Computing Systems and Applications, neck. In Proceedings of the eleventh ACM Symposium
2002. on Operating systems principles, pages 13–24. ACM
[5] Andrew Whitaker, Richard S. Cox, Marianne Shaw, Press, 1987.
and Steven D. Gribble. Constructing services with [17] Peter J. Denning. Working Sets Past and Present.
interposable virtual hardware. In Proceedings of the IEEE Transactions on Software Engineering, SE-
First Symposium on Networked Systems Design and 6(1):64–84, January 1980.
Implementation (NSDI ’04), 2004.
[18] Jacob G. Hansen and Eric Jul. Self-migration of op-
[6] S. Osman, D. Subhraveti, G. Su, and J. Nieh. The de- erating systems. In Proceedings of the 11th ACM
sign and implementation of zap: A system for migrat- SIGOPS European Workshop (EW 2004), pages 126–
ing computing environments. In Proc. 5th USENIX 130, 2004.
Symposium on Operating Systems Design and Im-
plementation (OSDI-02), pages 361–376, December [19] C. E. Perkins and A. Myles. Mobile IP. Pro-
2002. ceedings of International Telecommunications Sym-
posium, pages 415–419, 1997.
[7] Jacob G. Hansen and Asger K. Henriksen. Nomadic
operating systems. Master’s thesis, Dept. of Com- [20] Alex C. Snoeren and Hari Balakrishnan. An end-to-
puter Science, University of Copenhagen, Denmark, end approach to host mobility. In Proceedings of the
2002. 6th annual international conference on Mobile com-
puting and networking, pages 155–166. ACM Press,
[8] Hermann Härtig, Michael Hohmuth, Jochen Liedtke, 2000.
and Sebastian Schönberg. The performance of micro-
286 NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association