The Nanokernel: David L. Mills University of Delaware and Poul-Henning Kamp Freebsd Project
The Nanokernel: David L. Mills University of Delaware and Poul-Henning Kamp Freebsd Project
The Nanokernel: David L. Mills University of Delaware and Poul-Henning Kamp Freebsd Project
David L. Mills2
University of Delaware
and
Poul-Henning Kamp
FreeBSD Project
Abstract
Internet timekeeping has come a long way since first demonstrated almost two decades ago. In
that era most computer clocks were driven by the power grid and wandered several seconds per
day relative to UTC. As computers and the Internet became ever faster, hardware and software
synchronization technology became much more sophisticated. The Network Time Protocol
(NTP) evolved over four versions with ever better accuracy now limited only by the underlying
computer hardware clock and adjustment mechanism.
The clock frequency in modern workstations is stabilized by an uncompensated quartz or surface
acoustic wave (SAW) resonator, which are sensitive to temperature, power supply and component
variations. Using NTP and traditional Unix kernels, incidental timing errors with an uncompensated clock oscillator is in the order of a few hundred microseconds relative to a precision
source. Using new kernel software described in this paper, much better performance can be
achieved. Experiments described in this paper demonstrate that errors with a modern workstation and uncompensated clock oscillator are in the order of a microsecond relative to a GPS
receiver or other precision timing source.
1. Introduction
Several years ago the software algorithms to discipline the Unix system clock were overhauled to provide
improved accuracy, stability and resolution [5]. In addition, means were added to discipline the clock
directly from a precision timing source, such as a GPS receiver or cesium oscillator. The software was integrated with several operating system kernels of the day and eventually adopted as standard in Digital
Tru64 (Alpha), Sun Solaris, Linux and FreeBSD. The best performance achieved with workstations of the
day was a few hundred microseconds in time and a few parts-per-million (PPM) in frequency, so a clock
resolution of one microsecond seemed completely adequate.
With workstations and networks of today reaching speeds in the gigahertz range, it is clear the solution of
several years ago is rapidly becoming obsolete. Improved modelling techniques have resulted in better discipline algorithms which are more responsive to phase and frequency characteristics of computer clocks
[3]. Faster processors and a standardized application program interface (API) allow more flexible and precise timing of external signals [7]. Faster network speeds and lower jitter provide more accurate timekeeping over the Internet [4].
1. Sponsored by: DARPA Information Technology Office Contract F30602-98-1-0225 and Digital
Equipment Corporation Research Agreement 1417.
2. David L. Mills is with the Electrical and Computer Engineering Department, University of Delaware,
Newark, DE 19716, [email protected], http://www.eecis.udel.edu/~mills; Poul-Henning Kamp is with
the FreeBSD Project, Valbygrdsvej 8, DK-4200 Slagelse, Denmark. [email protected].
This paper describes new algorithms and kernel software providing much improved time and frequency
resolution, together with a more agile and precise clock discipline mechanism. It discusses the analysis and
design of the algorithms and the results of proof-of-performance experiments. The software has been
implemented and tested in all the kernels mentioned above and is now standard in the Linux and FreeBSD
public distributions.
The kernel software replaces the clock discipline algorithm in a synchronization daemon, such as the Network Time Protocol [6], with equivalent functionality in the kernel. It provides a resolution of 1 ns in time
and .001 PPM in frequency. While clock corrections are recomputed about once per minute in the daemon,
they are recomputed once per second and amortized at every tick interrupt in the kernel. This avoids errors
that accumulate between updates due to the intrinsic hardware clock frequency error.
The new software can be compiled for 64-bit machines using native instructions or for 32-bit machines
using a macro package for double precision arithmetic. The software can be compiled for kernels where
the time variable is represented in seconds and nanoseconds and for kernels in which this variable is represented in seconds and microseconds. In either case the resolution of the clock is limited only by the resolution of the clock hardware. Even if the resolution is only to the microsecond, the software provides
extensive signal grooming and averaging to minimize reading errors.
The remaining sections of this paper are organized as follows. Section 2 describes the characteristics of
typical computer clock oscillators, which are based on the Allan deviation statistic used in the most recent
NTP algorithms. Section 3 describes the software design, which is based on two interacting hybrid phaselock/frequency-lock (PLL/FLL) feedback loops. Section 4 describes the software implementation, which is
integrated in the kernels mentioned above. Section 5 summarizes the results of proof-of-performance
experiments which validate the claims in this paper. Section 6 concludes with suggestions for further
improvements.
same frequency and, in the case of multiprocessor systems, there may be more than one PCC, the kernel
must carefully mitigate the differences and develop a stable, monotonically increasing timescale.
It is well known that the behavior of an oscillator can be characterized in terms of its Allan deviation,
which is a function of stability, interpreted as first-order frequency differences, and averaging interval [1].
In order to determine this statistic for a typical uncompensated computer oscillator, sample offsets relative
to a cesium standard were measured with the computer oscillator allowed to free-run over periods ranging
from 1.5 to 10 days. These data were saved in files and later used to construct plots in log-log coordinates
showing stability versus averaging interval.
In [3] a simple model is developed which characterizes the performance of each individual time server.
The model characterizes each combination of synchronization source and clock oscillator by two intersecting straight lines in log-log coordinates. In general, network and computer latency variations produce jitter,
which is modelled as white phase noise and appears as a straight line with slope 1 on the plot. On the
other hand, oscillator frequency variations produce wander, which is modelled as random-walk frequency
noise and appears as a straight line with slope +0.5. The intersection of the two straight lines is called the
Allan intercept, which serves to characterize the particular combination of source and oscillator. It represents the optimum averaging interval for the best oscillator stability. If the averaging interval is less than
this, errors due to source jitter dominate, while if greater, errors due to oscillator wander dominate.
The averaging interval is roughly equal to the frequency time constant used in the clock discipline algorithm, and this is related to the interval between NTP poll messages sent across the network. With a minimum poll interval of 16 s in the current NTP design, the averaging interval is about 4,000 s, which is on the
high side of the optimum range, and the match gets worse with larger poll intervals. Thus, the best accuracy is achieved at the minimum poll interval, but this may result in unacceptable network overhead.
Therefore, when the NTP daemon is started, it uses a relatively small poll interval in order to respond
quickly to the particular oscillator frequency offset, then gradually increases the interval to an upper limit.
Depending on desired accuracy and allowable network overhead, the upper limit could be a small as a few
seconds or as large as a day or more.
A phase-lock loop (PLL) functions best with poll intervals below the Allan intercept where jitter predominates, while a frequency-lock loop (FLL) functions best above the intercept where wander predominates.
As the result of previous research [2][3], a hybrid PLL/FLL clock discipline algorithm has been designed,
implemented and tested in the NTP version 4 software for Unix, Windows and VMS. A kernel implementation based on this design is described in the following section.
3. Software Design
The nanokernel software design is based on the NTP implementation, but includes two separate but interlocking feedback loops. The PLL/FLL discipline operates with periodic updates produced by a synchronization daemon such as NTP, while the PPS discipline operates with an external PPS signal and modified
serial or parallel port driver. Both algorithms include grooming provisions that significantly reduce the
impact of source selection jitter or clockhopping and network delay transients. In addition, the PPS algorithm can continue to discipline the clock frequency even if other synchronization sources or the daemon
itself crash.
second. The kernel discipline has an inherent resolution of 1 ns in time and .001 PPM in frequency and
amortizes adjustments at every tick interrupt.
Both the kernel discipline and NTP discipline operate
as a hybrid of phase-lock and frequency-lock feedback loops. Figure 1 shows the functional components of the kernel discipline. In the NTP discipline
the components below the dotted line are implemented in the daemon. The phase difference Vd
Figure 1. Clock Discipline Feedback Loop
between the reference source r and clock c is determined by the NTP daemon. The value is then
groomed by the NTP clock filter and related algorithms to produce the phase update Vs used by the loop
filter in the kernel to produce the phase prediction x and frequency prediction y. These predictions are used
to produce clock adjustment updates at intervals of 1 s which result in the correction term Vc. This value
represents the increment in time necessary to correct the clock at the end of the next second. The various
performance data displayed later were derived from the phase update Vs, since this is a common measuring
point for both the daemon and kernel.
The x and y predictions are developed from the phase
update Vs as shown in Figure 2. As in the NTP algorithm, the phase and frequency are disciplined separately in both PLL and FLL modes. In both modes x is
the value Vs, but the actual phase adjustment is calculated by the clock adjust process using an exponential
average with an adjustable weight factor. The weight
Figure 2. FLL/PLL Prediction Functions
factor is calculated as the reciprocal of the time constant specified by the API. The value can range from
1 s to an upper limit determined by the Allan intercept. In PLL mode it is important for the best stability
that the update interval does not significantly exceed the time constant for an extended period.
In PLL mode, y is computed using an integration process as required by PLL engineering principles; however, the integration gain is reduced by the square of the time constant, so adjustments become essentially
ineffective with poll intervals above 1024 s. In FLL mode, y is computed directly using an exponential
average with weight 0.25. This value, which was determined from simulation with real and synthetic data,
is a compromise between rapid frequency adaptation and adequate glitch suppression. In operation, PLL
mode is preferred at small update intervals and time constants and FLL mode at large intervals and time
constants. The optimum crossover point between the PLL and FLL modes, as determined by simulation
and analysis, is the Allan intercept. As a compromise, the PLL/FLL algorithm operates in PLL mode for
update intervals of 256 s and smaller and in FLL mode for intervals of 1024 s and larger. Between 256 s
and 1024 s the mode is specified by the API.
5. Performance Evaluation
Following previous practice [3], the ktime.c and micro.c routines have been embedded in a special
purpose, discrete event simulator. In this context it is possible to verify correct operation over a wide range
of operating conditions likely to be found in current and future computer systems and networks and which
cannot be easily duplicated with in-situ implementations. It operates with internally synthesized data or
raw data files produced by the NTP daemon during regular operation. For this purpose raw time offsets are
recorded with NTP operating in an open-loop configuration and later played back to the simulator. Synthetic data having similar statistics are generated as described in [3]. The simulator can measure the
response to time and frequency transients, monitor for unexpected interactions between the simulated
clock oscillator, PCC and PPS signals, and verify correct monotonic behavior as the various counters interact due to small frequency variations.
In order to calibrate the performance of the routines in a functioning system, they were implemented in the
kernels for several architectures, including Alpha, Intel and SPARC. Detailed performance data have been
collected for three systems: Rackety is a busy SPARC IPC time server running SunOS 4.1.3 and connected
to four radio clocks - dual redundant GPS receivers and dual redundant WWVB receivers. The PPS signal
is derived from one of the GPS receivers. Churchy is a Digital Alpha 433au personal workstation running
Tru64 4.0d and connected to a GPS receiver with PPS signal. Hepzibah is an Intel Pentium II 233 laboratory machine running FreeBSD 3.4 and connected to a GPS receiver with PPS signal.
Figure 5 shows the typical behavior of hepzibah. In
this particular configuration the PPS signal was connected via a parallel port and a special kernel driver.
The characteristic is decidedly spikey, in spite of the
signal grooming algorithms used in the PPS discipline. The jitter budget includes contributions from
the source (less than 100 ns), clock resolution (about
4 ns) and the hardware and software interrupt latencies. The interesting thing about this figure is that the
jitter spikes are as often positive as negative. If due
only to interrupt latencies, the spikes would be negative. There is no obvious explanation for this behavior
other than to remark the standard (RMS) error is less
than a microsecond.
Figure 5. Time Offset for Hepzibah
While hepzibah has no applications or services other
than NTP, rackety is a much slower machine dedicated to NTP service. It services an arrival stream of
some 15 packets per second from an estimated client
population well over 1000. The radio clocks are connected to a an 8-input multiplexor which services
other ancillary devices as well. The hardware interrupt load produced by the multiplexor and network
interface is severe, especially since the SPARC IPC is
only a 25-MHz machine. The large negative time offset spikes shown in Figure 6 are clearly the result of
interrupt latencies for the four radio clocks, the PPS
signal and the network interface.
Figure 6. Time Offset for Rackety
Figure 7 shows the typical behavior for churchy, the
fastest machine of the bunch. The PCC for this
machine is derived from a SAW oscillator. Ordinarily,
one would expect low phase noise from this type of
oscillator, but the characteristics shown in the figure
argue otherwise. To the trained eye, the characteristic
is dominated by flicker noise. The source of this
unexpected behavior is yet to be determined.
hybrid PLL/FLL discipline loop is used for NTP control together with separate time and frequency loops
for PPS discipline. The level of performance is probably near the best that can be achieved with an unstabilized clock oscillator. Where a fast computer with precision hardware clock is available, the performance
can be improved to the order of a few tens of nanoseconds at the API. This was verified using a machine
where the system clock was derived from a Rubidium oscillator and FPGA counter; however, this setup
would not ordinarily be considered practical. The practical accuracy expectations of individual applications will vary depending on the mix of applications and operating system scheduling latencies.
Observations of the kernel disciplines in actual operation suggest a few areas where further improvements
may be possible. One of these is the grooming algorithm used in the PPS discipline. The complexity of the
median calculation increases rapidly with the number of register stages, which is only three in the current
design. However, the NTP discipline operates in user space, so its resource commitments are more flexible. The NTP daemon includes a PPS driver with a 60-stage register. The algorithm sorts the offsets, then
iteratively trims off the sample furthest from the median until a prespecified fraction of the original samples are left. Finally, it presents the average of these samples to the kernel PLL/FLL discipline.
The PPS driver provides significantly less jitter than the kernel PPS discipline; however, the performance
advantage due to the quick response of the kernel discipline is lost. While the current minimum daemon
update interval is currently limited to 16 s in the interest of minimizing kernel overhead, it might be acceptable in fast machines to reduce that interval to 1 s. Should this be done, it would be practical to do almost
all discipline loop processing in user space and move the per-second processing to the daemon, where
more flexible processor and memory resource commitments are possible.
7. References
Note: Papers and reports by D.L. Mills can be found in PostScript and PDF forma at www.eecis.udel.edu/
~mills.
1. Allan, D.W. Time and frequency (time-domain) estimation and prediction of precision clocks and
oscillators. IEEE Trans. on Ultrasound, Ferroelectrics, and Frequency Control UFFC-34, 6 (November 1987), 647-654. Also in: Sullivan, D.B., D.W. Allan, D.A. Howe and F.L. Walls (Eds.). Characterization of Clocks and Oscillators. NIST Technical Note 1337, U.S. Department of Commerce, 1990,
121-128.
2. Levine, J. An algorithm to synchronize the time of a computer to universal time. IEEE Trans. Networking 3, 1 (February 1995), 42-50.
3. Mills, D.L. Adaptive hybrid clock discipline algorithm for the Network Time Protocol. IEEE/ACM
Trans. Networking 6, 5 (October 1998), 505-514.
4. Mills, D.L. The network computer as precision timekeeper. Proc. Precision Time and Time Interval
(PTTI) Applications and Planning Meeting (Reston VA, December 1996), 96-108.
5. Mills, D.L. Unix kernel modifications for precision time synchronization. Electrical Engineering
Report 94-10-1, University of Delaware, October 1994, 24 pp.
6. Mills, D.L. Network Time Protocol (Version 3) specification, implementation and analysis. Network
Working Group Report RFC-1305, University of Delaware, March 1992, 113 pp.
7. Mogul, J., D. Mills, J. Brittenson, J. Stone and U. Windl. Pulse-per-second API for Unix-like operating
systems, version 1. Request for Comments RFC-2783, Internet Engineering Task Force, March 2000,
31 pp.
8. Network Time Protocol Version 4 software distribution, including sources and documentation. Available via the web at www.ntp.org.