When Poll Is Better Than Interrupt: Jisoo Yang Dave B. Minturn Frank Hady
When Poll Is Better Than Interrupt: Jisoo Yang Dave B. Minturn Frank Hady
When Poll Is Better Than Interrupt: Jisoo Yang Dave B. Minturn Frank Hady
1
the once negligible cost of I/O stack time becomes more tioned at highest point in the Linux’s block I/O subsys-
relevant [8,12]. Another important trend in operating with tem and called within the context of the process thread.
SSDs is that big, sequential, batched I/O requests need no
longer be favored over small, random I/O requests [17]. 3 Synchronous I/O completion model
In the traditional block I/O architecture, the operating When we say a process completes an I/O synchronously,
system’s block I/O subsystem performs the task of sche- we mean the kernel’s entire path handling an I/O request
duling I/O requests and forwarding them to block device stays within the process context that initiated the I/O. A
drivers. This subsystem processes kernel I/O requests necessary condition for this synchronous I/O completion
specifying the starting disk sector, target memory ad- is that the CPU poll the device for completion. This pol-
dress, and size of I/O transfer, and originating from a file ling must be realized by a spin loop, busy-waiting the
system, page cache, or user application using direct I/O. CPU while waiting for the completion.
The block I/O subsystem schedules kernel I/O requests Compared to the traditional asynchronous model, syn-
by queueing them in a kernel I/O queue and placing the chronous completion can reduce CPU clock cycles
I/O-issuing thread in an I/O wait state. The queued re- needed for a kernel to process an I/O request. This reduc-
quests are later forwarded to a low-level block device tion comes primarily from a shortened kernel path and
driver, which translates the requests into device I/O com- from the removal of interrupt handling, but synchronous
mands specific to the backing storage device. completion brings with it an extra clock cycles spent in
Upon finishing an I/O command, a storage device is ex- polling. In this section, we make the case for the syn-
pected to raise a hardware interrupt to inform the device chronous completion by quantifying these overheads. We
driver of the completion of a previously submitted com- then discuss problems with the asynchronous model and
mand. The device driver’s interrupt service routine then argue the correctness of synchronous model.
notifies the block I/O subsystem, which subsequently 3.1 Prototype hardware and device driver
ends the kernel I/O request by releasing the target memo-
ry and un-blocking the thread waiting on the completion For our measurements, we used a DRAM-based proto-
of the request. A storage device may handle multiple type block storage device connected to the system with
device commands concurrently using its own device an early prototype of an NVM Express* [5] interface to
queue [2,5,6], and may combine multiple completion serve as a model of a fast future SSD based on next-
interrupts, a technique called interrupt coalescing to re- generation NVM. The device was directly attached to
duce overhead. PCIe* Gen2 bus with eight lanes and with a device-based
DMA engine handling data transfers. As described by the
As described the traditional block I/O subsystem uses NVM Express specification the device communicates
asynchrony within the I/O path to save CPU cycles for with the device driver via segments of main memory,
other tasks while the storage device handles I/O com- through which the device receives commands and places
mands. Also, using I/O schedulers, the kernel can reorder completions. The device can instantiate multiple device
or combine multiple outstanding kernel I/O requests to queues and can be configured to generate hardware inter-
better utilize the underlying storage media. rupts upon command completion.
This description of the traditional block storage path cap-
I/O completion method 512B xfer 4KiB xfer
tures what we will refer to as the asynchronous I/O com-
pletion model. In this model, the kernel submits a device Interrupt, Gen2 bus, enters C-state 3.3 µs 4.6 µs
I/O command in a context distinct from the context of the Interrupt, Gen2 bus 2.6 µs 4.1 µs
process that originated the I/O. The hardware interrupt Polling, Gen2 bus 1.5 µs 2.9 µs
generated by the device upon command completion is Interrupt, 8Gbps bus projection 2.0 µs 2.6 µs
also handled, at first, by a separate kernel context. The
Polling, 8Gbps bus projection 0.9 µs 1.5 µs
original process is later awakened to resume its execu-
tion. Table 1. Time to finish an I/O command, excluding software
A block I/O subsystem typically provides a set of in- time, measured for our prototype device. The numbers measure
random-read performance with device queue depth of 1.
kernel interfaces for a device driver use. In Linux, a block
device driver is expected to implement a ‘request_fn’
Table 1 shows performance statistics for the prototype
callback that the kernel calls while executing in an inter-
device. The ‘C-state’ refers to the latency when the CPU
rupt context [7,10]. Linux provides another callback point
enters power-saving mode while the I/O is outstanding.
called ‘make_request’, which is intended to be used by
The performance measured is limited by prototype
pseudo block devices, such as a ramdisk. The latter call-
throughput, not by anything fundamental, future SSDs
back differs from the former one in that the latter is posi-
may well feature higher throughputs. The improved per-
2
14
formance projection assumes a higher throughput SSD Hardware device
12
on a saturated PCIe Gen3 bus (8Gbps). 10.78
10.00 Operating system
3
1968
1797
Async IOPS (Thousand) 1648
1387 1389
Sync IOPS (Thousand) 1223
1114 1073
895
823
704
557 532
305 357
181
4
CPU power-state complications of execution. Therefore, it is guaranteed that B reaches to
the device after A.
Power management used in conjunction with the asyn-
chronous model for the short I/O-wait of our device may Let us relax A1. The application order requires the thread
not only reduce the power saving, but also increase I/O to submit A before B using non-blocking interface or AIO
completion latency. A modern processor may enter a [4]. With the synchronous model, this means that the
power-saving ‘C-state’ when not loaded or lightly loaded. device has already completed the I/O for A at the moment
Transition among C-states incurs latency. For the asyn- that the application makes another non-blocking system
chronous model, the CPU enters into a power saving C- calls for B. Therefore, the synchronous model guarantees
state when the scheduler fails to find a thread to run after that B reaches to the device after A with non-blocking I/O
sending an I/O command. The synchronous model does interface.
not automatically allow this transition to a lower C-state
Relaxing A2, let us imagine two threads T1 and T2, each
since the processor is busy.
performing A and B respectively. In order to respect the
We have measured a latency impact from C-state transi- application’s ordering requirement, T2 must synchronize
tion. When the processor enters into a C-state, the asyn- with T1 to avoid a race in such a way that T2 must wait
chronous path takes an additional 2µs in observed hard- for T1 before submitting B. The end result is that the ker-
ware latency with higher variability (Figure 1, labeled nel always sees B after kernel safely completes previous-
‘async C-state’). This additional latency is incurred only ly submitted A. Therefore, the synchronous model guar-
when the system has no other thread to schedule on the antees the ordering with multi-threaded applications.
CPU. The end result is that a thread performing I/Os runs
The above exercise shows that an I/O barrier is unneces-
slower when it is the only thread active on the CPU – we
sary in the synchronous model to guarantee I/O ordering.
confirmed this empirically.
This contrasts with asynchronous model where a pro-
It is hard for an asynchronous model driver to fine-tune gram has to rely on an I/O barrier when it needs to force
C-state transitions. In asynchronous path, the C-state ordering. Hence, synchronous model has a potential to
transition decision is primarily made by operating sys- further simplify storage I/O routines with respect to gua-
tem’s CPU scheduler or by the processor hardware itself. ranteeing data durability and consistency.
On the other hand, a device driver using synchronous
Our synchronous device driver written for Linux has
completion can directly construct its spin-wait loop using
been tested with multi-threaded applications using non-
instructions with power-state hints, such as mwait [3],
blocking system calls. For instance, the driver has with-
better controlling C-state transitions.
stood many hours of TPC-C* benchmark run. The driver
3.5 Correctness of synchronous model has also been heavily utilized as a system swap space.
We believe that the synchronous completion model is
A block I/O subsystem is deemed correct when it pre- correct and fully compatible with existing applications.
serves ordering requirements for I/O requests made by its
frontend clients. Ultimately, we want to address the fol- 4 Discussion
lowing problem:
The asynchronous model may work better in processing
A client performs I/O calls ‘A’ and ‘B’ in order, and I/O requests with large transfer sizes or handling hard-
its ordering requirement is that B should get to the ware stalls that cause long latencies. Hence, a favorable
device after A. Does synchronous model respect this solution would be a synchronous and asynchronous hybr-
requirement? id, where there are two kernel paths for a block device:
For brevity, we assume that the client to be a user appli- the synchronous path is the fast path for small transfers
cation using Linux I/O system calls. We also assume a and often used, whereas the asynchronous path is the
file system and the page cache are bypassed. In fact, file slow fallback path for large transfers or hardware stalls.
system and page cache themselves can be considered as We believe that existing applications have primarily as-
frontend clients using the block I/O subsystem. sumed the asynchronous completion model and tradition-
We start with two assumptions: al slow storage devices. Although the synchronous com-
pletion model requires little change to existing software
A1. Application uses blocking I/O system calls. to run correctly, some changes to the operating system
and to applications will allow for faster, more efficient
A2. Application is single threaded.
system operation when storage is used synchronously.
Let us consider a single thread is submitting A and B in We did not attempt to re-write applications, but do sug-
order. The operating system may preempt and schedule gest possible software changes.
the thread on a different CPU, but it does not affect the
ordering of I/O requests since there is only a single thread
5
Perhaps the most significant improvement that could be generation NVM in a more evolutionary way, preserving
achieved for I/O intensive applications is to avoid using the current hardware and software storage interface, in
the non-blocking user I/O interface such as AIO calls keeping with the huge body of existing applications.
when addressing a storage device synchronously. In this
Moneta [8] is a recent effort to evaluate the design and
case, using the non-blocking interface adds overhead and
impact of next-generation NVM-based SSDs. Moneta
complexity to the application without benefit because
hardware is akin to our prototype device in spirit because
operating system already completes the I/O upon the
it is a block device connected via PCIe bus. But imple-
return from a non-blocking I/O submission call. Al-
mentation differences enabled our hardware to perform
though applications that use the non-blocking interface
faster than Moneta. Moneta also examined spinning to
are functionally safe and correct with synchronous com-
cut the kernel cost, but its description is limited to latency
pletion, the use of non-blocking interface negates the
aspect. In contrast, this paper studied issues relevant to
latency and scalability gains achievable in kernel with the
the viability of synchronous completion, such as IOPS
synchronous completion model.
scalability, interrupt thrashing, power state, etc.
When the backing storage device is fast enough to com-
Interrupt-driven asynchronous completion has long been
plete an I/O synchronously, applications that have tradi-
the only I/O model used by kernel to perform real storage
tionally self-managed I/O buffers must reevaluate their
I/Os. Storage interface standards have thus embraced
buffering strategy. We observe that many I/O intensive
hardware queueing techniques that further improve per-
applications existing today, such as databases, the operat-
formance of asynchronous I/O operations [2,5,6]. How-
ing system’s page cache, and disk-swap algorithms, em-
ever, these are mostly effective for the devices with
ploy elaborate I/O buffering and prefetching schemes.
slower storage medium such as hard disk or NAND flash.
Such custom I/O schemes may add overhead with little
value for the synchronous completion model. Although It is a well-known strategy to choose a poll-based waiting
our work in the synchronous model greatly simplifies I/O primitive over an event-based one when the waiting time
processing overhead in the kernel, application complexity is short. A spinlock, for example, is preferred to a system
may still become a bottleneck. For instance, I/O prefetch- mutex lock if the duration of the lock is held is short.
ing becomes far less effective and could even hurt per- Another example is the optional use of polling [18,20] for
formance. We have found the performance of page cache network message passing among nodes when implement-
and disk-swapper to increase when we disabled page ing the MPI* library [13] used in high-performance com-
cache read-ahead and swap-in clustering. puting clusters. In such systems communication latencies
among nodes are just several microseconds due to the use
Informing applications of the presence of synchronous
of low-latency, high-bandwidth communication fabric
completions is therefore necessary. For example, an
along with a highly optimized network stack such as Re-
ioctl() extension to query underlying completion model
mote Direct Memory Access (RDMA*).
should help applications decide the best I/O strategy.
Operating system processor usage statistics must account 6 Conclusion
separately for the time spent at the driver’s spin-wait
loop. Currently there is no accepted method of account- This paper makes the case for the synchronous comple-
ing for this ‘spinning I/O wait’ cycles. In our prototype tion of storage I/Os. When performing storage I/O with
implementation, the time spent in the polling loop is ultra-low latency devices employing next-generation
simply accounted towards system time. This may mislead non-volatile memories, polling for completion performs
people to believe no I/O has been performed or to suspect better than the traditional interrupt-driven asynchronous
kernel inefficiency due to increased system time. I/O path. Our conclusion has a practical importance,
pointing to the need for kernel researchers to consider
5 Related work optimizations to the traditional kernel block storage inter-
face with next-generation SSDs, built of next-generation
Following the success of NAND-based storage, research
NVM elements in mind. It is our belief that non-dramatic
interest has surged on the next-generation non-volatile
changes can reap significant benefit.
memory (NVM) elements [11,14,16,19]. Although base
materials differ, these memory elements commonly Acknowledgements
promise faster and simpler media access than NAND.
We thank members of Storage Technology Group in Intel
Because of the DRAM-like random accessibility of many Corporation for supporting this work. We also thank our
next-generation NVM technologies, there is abundant shepherd David Patterson and the anonymous reviewers
research in storage-class memories (SCM), where NVM for their detailed feedback and guidance. The views and
is directly exposed as a physical address space. For in- conclusions in this paper are those of the authors and
stance, file systems have been proposed on SCM-based should not be interpreted as representing the official poli-
architectures [9,21]. In contrast, we approach next- cies, either expressed or implied, of Intel Corporation.
6
References agement Systems Using Modern Processor and Storage
Architectures (ADMS), Singapore, September 2010.
[1] Jen Axboe. Flexible I/O tester (fio). http://git.kernel.
[13] William Gropp, Ewing Lusk, Nathan Doss and Antho-
dk/?p=fio.git;a=summary. 2010.
ny Skjellum. A high-performance, portable implemen-
[2] Amber Huffman and Joni Clark. Serial ATA native tation of the MPI message passing interface standard.
command queueing. Technical white paper, Parallel Computing, 22:789-828, September 1996.
http://www.seagate.com/content/pdf/whitepaper/D2c_t
[14] S. Parkin. Racetrack memory: A storage class memory
ech_paper_intc-stx_sata_ncq.pdf, July 2003.
based on current controlled magnetic domain wall mo-
[3] Intel Corporation. Intel® 64 and IA-32 Architectures tion. In Device Research Conference (DRC), pages 3-
Software Developer’s Manual, Volume 1-3. Intel, 2008. 6, 2009.
[4] M. Tim Jones. Boost application performance using [15] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and
asynchronous I/O. http://www.ibm.com/developer Jude A. Rivers. Scalable high performance main mem-
works/linux/library/l-async/, 2006. ory system using Ph ase-Change Memory technology.
[5] NVMHCI Work Group. NVM Express. http://www. In Proceedings of the 36th International Symposium of
nvmexpress.org/, 2011. Computer Architecture (ISCA), Austin, TX, June 2009.
[6] SCSI Tagged Command Queueing, SCSI Architecture [16] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner,
Model – 3, 2007. Y.-C. Chen, R. M. Shelby, M. Salinga, D. Krebs, S.-H.
Chen, H.-L. Lung, and C. H. Lam. Phase-change ran-
[7] Daniel P. Bovet and Marco Cesati. Understanding the
dom access memory: A scalable technology. IBM
Linux Kernel, 3rd Ed., O’Reilly, 2005.
Journal of Research and Development, 52:465-480,
[8] Adrian M. Caufield, Arup De, Joel Coburn, Todor I. 2008.
Mollov, Rajesh K. Gupta, and Steven Swanson. Mone-
[17] Dongjun Shin. SSD. In Linux Storage and Filesystem
ta: A high-performance storage array architecture for
Workshop, San Jose, CA, February 2008.
next-generation, non-volatile memories, In Proceedings
of the 43rd International Symposium of Microarchitec- [18] David Sitsky and Kenichi Hayashi. An MPI library
ture (MICRO), Atlanta, GA, December 2010. which uses polling, interrupts and remote copying for
the Fujitsu AP1000+. In Proceedings of the 2nd Interna-
[9] Jeremy Condit, Edmund B. Nightingale, Christopher
tional Symposium on Parallel Architectures, Algo-
Frost, Engin Ipek, Benjamin Lee, Doug Burger, and
rithms, and Networks (ISPAN), Beijing, China, June
Derrick Coetzee. Better I/O through byte-addressable,
1996.
persistent memory. In Proceedings of the Symposium
on Operating Systems Principles (SOSP), pages 133– [19] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S.
146, Big Sky, MT, October 2009. Williams. The missing memristor found. Nature,
453(7191):80-83, May 2008.
[10] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-
Hartman. Linux Device Drivers, 3rd Ed., O’Reilly, [20] Sayantan Sur, Hyun-Wook Jin, Lei Chai, and Dhaba-
2005. leswar K. Panda. RDMA read based rendezvous proto-
col for MPI over InfiniBand: design alternatives and
[11] B. Dieny, R. Sousa, G. Prenat, and U. Ebels, Spin-
benefits. In Proceedings of the 11th Symposium on
dependent phenomena and their implementation in
Principles and Practice of Parallel Programming
spintronic devices. In International Symposium on
(PPoPP), pages 32-39, New York, NY, March 2006.
VLSI Technology, Systems and Applications (VLSI-
TSA), 2008. [21] Xiaojian Wu and Narasimha Reddy. SCMFS: A file
system for storage class memory. In Proceedings of the
[12] Annie Foong, Bryan Veal, and Frank Hady. Towards
International Conference for High Performance Com-
SSD-ready enterprise platforms. In Proceedings of the
puting, Networking, Storage and Analysis (SC11), Seat-
1st International Workshop on Accelerating Data Man-
tle, WA, November 2011.
_____________________________
* Other names and brands may be claimed as the property of others.