When Poll Is Better Than Interrupt: Jisoo Yang Dave B. Minturn Frank Hady

You are on page 1of 7

When Poll is Better than Interrupt

Jisoo Yang Dave B. Minturn Frank Hady


{jisoo.yang | dave.b.minturn | frank.hady} (at) intel.com
Intel Corporation

Abstract pass the kernel’s heavyweight asynchronous block I/O


subsystem, reducing CPU clock cycles needed to process
In a traditional block I/O path, the operating system com- I/Os. However, a necessary condition is that the CPU has
pletes virtually all I/Os asynchronously via interrupts. to spin-wait for the completion from the device, increas-
However, performing storage I/O with ultra-low latency ing the cycles used.
devices using next-generation non-volatile memory, it
can be shown that polling for the completion – hence Using a prototype DRAM-based storage device to mimic
wasting clock cycles during the I/O – delivers higher the potential performance of a very fast next-generation
performance than traditional interrupt-driven I/O. This SSD, we verified that the synchronous model completes
paper thus argues for the synchronous completion of an individual I/O faster and consumes less CPU clock
block I/O first by presenting strong empirical evidence cycles despite having to poll. The device is fast enough
showing a stack latency advantage, second by delineating that the spinning time is smaller than the overhead of the
limits with the current interrupt-driven path, and third by asynchronous I/O completion model.
proving that synchronous completion is indeed safe and Interrupt-driven asynchronous completion introduces
correct. This paper further discusses challenges and op- additional performance issues when used with very fast
portunities introduced by synchronous I/O completion SSDs such as our prototype. Asynchronous completion
model for both operating system kernels and user appli- may suffer from lower I/O rates even when scaled to
cations. many outstanding I/Os across many threads. We empiri-
1 Introduction cally confirmed this with Linux,* and examine the sys-
tem overheads of interrupt handling, cache pollution,
When an operating system kernel processes a block sto- CPU power-state transitions associated with the asyn-
rage I/O request, the kernel usually submits and com- chronous model.
pletes the I/O request asynchronously, releasing the CPU
to perform other tasks while the hardware device com- We also demonstrate that the synchronous completion
pletes the storage operation. In addition to the CPU model is correct and simple with respect to maintaining
cycles saved, the asynchrony provides opportunities to I/O ordering when used with application interfaces such
reorder and merge multiple I/O requests to better match as non-blocking I/O and multithreading.
the characteristics of the backing device and achieve We suggest that current applications may further benefit
higher performance. Indeed, this asynchronous I/O strat- from the synchronous model by avoiding the non-
egy has worked well for traditional rotating devices and blocking storage I/O interface and by reassessing buffer-
even for NAND-based solid-state drives (SSDs). ing strategies such as I/O prefetching. We conclude that
Future SSD devices may well utilize high-performance with future SSDs built of next-generation NVM ele-
next-generation non-volatile memory (NVM), calling for ments, introducing the synchronous completion model
a re-examination of the traditional asynchronous comple- could reap significant performance benefits.
tion model. The high performance of such devices both 2 Background
diminish the CPU cycles saved by asynchrony and re-
duce the I/O scheduling advantage. The commercial success of SSDs coupled with reported
advancements of NVM technology is significantly reduc-
This paper thus argues for the synchronous I/O comple- ing the performance gap between mass-storage and
tion model by which the kernel path handling an I/O re- memory [15]. Experimental storages device that com-
quest stays within the process context that initiated the plete an I/O within a few microseconds have been dem-
I/O. Synchronous completion allows I/O requests to by- onstrated [8]. One of the implications of this trend is that

1
the once negligible cost of I/O stack time becomes more tioned at highest point in the Linux’s block I/O subsys-
relevant [8,12]. Another important trend in operating with tem and called within the context of the process thread.
SSDs is that big, sequential, batched I/O requests need no
longer be favored over small, random I/O requests [17]. 3 Synchronous I/O completion model
In the traditional block I/O architecture, the operating When we say a process completes an I/O synchronously,
system’s block I/O subsystem performs the task of sche- we mean the kernel’s entire path handling an I/O request
duling I/O requests and forwarding them to block device stays within the process context that initiated the I/O. A
drivers. This subsystem processes kernel I/O requests necessary condition for this synchronous I/O completion
specifying the starting disk sector, target memory ad- is that the CPU poll the device for completion. This pol-
dress, and size of I/O transfer, and originating from a file ling must be realized by a spin loop, busy-waiting the
system, page cache, or user application using direct I/O. CPU while waiting for the completion.
The block I/O subsystem schedules kernel I/O requests Compared to the traditional asynchronous model, syn-
by queueing them in a kernel I/O queue and placing the chronous completion can reduce CPU clock cycles
I/O-issuing thread in an I/O wait state. The queued re- needed for a kernel to process an I/O request. This reduc-
quests are later forwarded to a low-level block device tion comes primarily from a shortened kernel path and
driver, which translates the requests into device I/O com- from the removal of interrupt handling, but synchronous
mands specific to the backing storage device. completion brings with it an extra clock cycles spent in
Upon finishing an I/O command, a storage device is ex- polling. In this section, we make the case for the syn-
pected to raise a hardware interrupt to inform the device chronous completion by quantifying these overheads. We
driver of the completion of a previously submitted com- then discuss problems with the asynchronous model and
mand. The device driver’s interrupt service routine then argue the correctness of synchronous model.
notifies the block I/O subsystem, which subsequently 3.1 Prototype hardware and device driver
ends the kernel I/O request by releasing the target memo-
ry and un-blocking the thread waiting on the completion For our measurements, we used a DRAM-based proto-
of the request. A storage device may handle multiple type block storage device connected to the system with
device commands concurrently using its own device an early prototype of an NVM Express* [5] interface to
queue [2,5,6], and may combine multiple completion serve as a model of a fast future SSD based on next-
interrupts, a technique called interrupt coalescing to re- generation NVM. The device was directly attached to
duce overhead. PCIe* Gen2 bus with eight lanes and with a device-based
DMA engine handling data transfers. As described by the
As described the traditional block I/O subsystem uses NVM Express specification the device communicates
asynchrony within the I/O path to save CPU cycles for with the device driver via segments of main memory,
other tasks while the storage device handles I/O com- through which the device receives commands and places
mands. Also, using I/O schedulers, the kernel can reorder completions. The device can instantiate multiple device
or combine multiple outstanding kernel I/O requests to queues and can be configured to generate hardware inter-
better utilize the underlying storage media. rupts upon command completion.
This description of the traditional block storage path cap-
I/O completion method 512B xfer 4KiB xfer
tures what we will refer to as the asynchronous I/O com-
pletion model. In this model, the kernel submits a device Interrupt, Gen2 bus, enters C-state 3.3 µs 4.6 µs
I/O command in a context distinct from the context of the Interrupt, Gen2 bus 2.6 µs 4.1 µs
process that originated the I/O. The hardware interrupt Polling, Gen2 bus 1.5 µs 2.9 µs
generated by the device upon command completion is Interrupt, 8Gbps bus projection 2.0 µs 2.6 µs
also handled, at first, by a separate kernel context. The
Polling, 8Gbps bus projection 0.9 µs 1.5 µs
original process is later awakened to resume its execu-
tion. Table 1. Time to finish an I/O command, excluding software
A block I/O subsystem typically provides a set of in- time, measured for our prototype device. The numbers measure
random-read performance with device queue depth of 1.
kernel interfaces for a device driver use. In Linux, a block
device driver is expected to implement a ‘request_fn’
Table 1 shows performance statistics for the prototype
callback that the kernel calls while executing in an inter-
device. The ‘C-state’ refers to the latency when the CPU
rupt context [7,10]. Linux provides another callback point
enters power-saving mode while the I/O is outstanding.
called ‘make_request’, which is intended to be used by
The performance measured is limited by prototype
pseudo block devices, such as a ramdisk. The latter call-
throughput, not by anything fundamental, future SSDs
back differs from the former one in that the latter is posi-
may well feature higher throughputs. The improved per-

2
14
formance projection assumes a higher throughput SSD Hardware device
12
on a saturated PCIe Gen3 bus (8Gbps). 10.78
10.00 Operating system

I/O completion latency in usec


10 9.01
We wrote a Linux device driver for the prototype hard-
7.64
ware supporting both asynchronous and synchronous 8 4.57 3.33
4.10
completion models. For the asynchronous model the 6 2.63
4.38
driver implements Linux’s ‘request_fn’ callback, thus 2.90
4
taking the traditional path of using the stock kernel I/O 6.21 6.67
2.91
4.91 5.01 1.48
queue. In this model, the driver uses a hardware interrupt. 2
1.47 1.42
The driver executes within the interrupt context for both 0
4KiB 512B 4KiB 512B 4KiB 512B
the I/O request submission and the completion. For the Async Async Async Async Sync Sync
synchronous model, the driver implements Linux’s (C-state) (C-state)

‘make_request’ callback, bypassing most of the Linux’s


block I/O infrastructure. In this model the driver polls for Figure 1. Storage stack block I/O subsystem cost comparison.
completion from device and hence executes within the Each bar measures application-observed I/O completion latency,
context of the thread that issued the I/O. which is broken into device hardware latency and non-
overlapping operating system latency. Error bars represent +/-
For this study, we assume that hardware never triggers one standard deviation.
internal events that incur substantially longer latency than
average. We expect that such events are rare and can be  How fast does each completion path complete appli-
easily dealt with by having operating system fall back to cation I/O requests?
traditional asynchronous model.
 How much CPU time is spent by the kernel in each
3.2 Experimental setup and methodology completion model?
We used 64bit Fedora* 13 running 2.6.33 kernel on an  How much CPU time is available to another user
x86 dual-socket server with 12GiB of main memory. process scheduled in during an asynchronous I/O?
Each processor socket was populated with quad-core
Figure 1 shows that the synchronous model completes an
2.93GHz Intel® Xeon® with 8MiB of shared L3 cache
I/O faster than asynchronous path in terms of absolute
and 256KiB of per-core L2 cache. Intel® Hyper-
latency. The figure shows actual measured latency for the
Threading Technology was enabled totaling 16 architec-
user application performing 4KiB and 512B random
tural CPUs available to software. CPU frequency-scaling
reads. For our fast prototype storage device the CPU
was disabled.
spin-wait cost in the synchronous path is lower than the
For measurements we used a combination of the CPU code-path reduction achieved by the synchronous path,
timestamp counter and reports from user-level programs. completing a 4KiB I/O synchronously in 4.4µs versus
Upon events of interest in kernel, the device driver ex- 7.6µs for the asynchronous case. The figure breaks the
ecuted the ‘rdtsc’ instruction to read the CPU timestamp latency into hardware time and non-hardware overlap-
counter, whose values were later processed offline to ping kernel time. The hardware time for the asynchron-
produce kernel path latencies. For application IOPS (I/O ous path is slightly greater than that of the synchronous
Operations Per Second) and I/O system call completion path due to interrupt delivery latency.
latency, we used the numbers reported by ‘fio’ [1] I/O
Figure 2 details the latency component breakdown of the
micro-benchmark running in user mode.
asynchronous kernel path. In the figure, Tu indicates the
We bypassed the file system and the buffer cache to iso- CPU time actually available to another user process dur-
late the cost of the block I/O subsystem. Note that our ing the time slot vacated during asynchronous path I/O
objective is to measure the difference between the two completion. To measure this time as accurately as possi-
completion models when exercising the back-end block ble, we implemented a separate user-level program sche-
I/O subsystem whose performance is not changed by the duled to run on the same CPU as the I/O benchmark.
use of the file system or the buffer cache and would thus This program continuously checked CPU timestamps to
be additive to either completion model. The kernel was detect its scheduled period at a sub-microsecond granu-
compiled with -O3 optimization and kernel preemption larity. Using this program, we measured Tu to be 2.7µs
was enabled. The I/O scheduler was disabled for the with 4KiB transfer that the device takes 4.1µs to finish.
asynchronous path by selecting ‘noop’ scheduler in order
The conclusion of the stack latency measurements is a
to make the asynchronous path as fast as possible.
strong one: the synchronous path completes I/Os faster
3.3 Storage stack latency comparison and more efficiently uses the CPU. This is true despite
spin-waiting for the duration of the I/O because the work
Our measurement answers following questions: the CPU performs in asynchronous path (i.e., Ta + Tb =

3
1968
1797
Async IOPS (Thousand) 1648
1387 1389
Sync IOPS (Thousand) 1223
1114 1073
895
823
704
557 532
305 357
181

Figure 2. Latency component breakdown of asynchronous ker- 1 2 3 4 5 6 7 8


Number of CPUs
nel path. Ta (= Ta’ + Ta”) indicates the cost of kernel path that
does not overlap with Td, which is the interval during which the Figure 3. Scaling of storage I/Os per second (IOPS) with in-
device is active. Scheduling a user process P2 during the I/O creased number of CPUs. For asynchronous IOPS, I/O threads
interval incurs kernel scheduling cost, which is Tb. The CPU are added until the utilization of each CPU reaches 100%.
time available for P2 to make progress is Tu. For a 4KiB trans-
fer, Ta, Td, Tb, and Tu measure 4.9, 4.1, 1.4 and 2.7µs, respec- of our prototype device. Even with its larger number of
tively. threads per CPU, the asynchronous model displays a
significantly lower I/O rate, achieving only 60-70% of
6.3µs) is greater than the spin-waiting time of the syn-
the synchronous model. This lower I/O rate is a result of
chronous path (4.38µs) with this fast prototype SSD. For
inefficiencies inherent in the use of the asynchronous
smaller-sized transfers, synchronous completion by pol-
model when accessing such a low latency storage device.
ling wins over asynchronous completion by an even
We discuss these inefficiencies in the following sections.
greater margin.
It should be noted that this discussion is correct only for a
With the synchronous completion model, improvement very low latency storage device, like the one used here:
in hardware latency directly translates to improvement in traditional higher latency storage devices gain compelling
software stack overhead. However, the same does not efficiencies from the use the asynchronous model.
hold for the asynchronous model. For instance, using
Interrupt overhead
projected PCIe Gen3 bus performance, the spin-wait time
is expected to be reduced from current 2.9µs to 1.5µs, The asynchronous model necessarily includes generation
making the synchronous path time be 3.0µs, while the and service of an interrupt. This interrupt brings with it
asynchronous path overhead remains the same at 6.3µs. extra, otherwise unnecessary work increasing CPU utili-
Of course the converse is also true, slow SSDs will be zation and therefore decreasing I/O rate on a fully loaded
felt by the synchronous model, but not by the asynchron- system. Another problem is that the kernel processes
ous model – clearly these results are most relevant for hardware interrupts at high priority. Our prototype device
very low latency NVM. can deliver hundreds of thousands interrupts per second.
Even if the asynchronous model driver completes mul-
This measurement study also sets a lower bound on the
tiple outstanding I/Os during a single hardware interrupt
SSD latency for which the asynchronous completion
invocation, the device generates interrupts fast enough to
model recovers absolutely no useful time for other
saturate the system and cause user noticeable delays.
processes: 1.4µs (Tb in Figure 2).
Further while coalescing interrupts reduces CPU utiliza-
3.4 Further issues with interrupt-driven I/O tion overhead, it also increases completion latencies for
individual I/Os.
The increased stack efficiency gained with the synchron-
ous model for low latency storage devices does not just Cache and TLB pollution
result in lower latency, but also in higher IOPS. Figure 3 The short I/O-wait period in asynchronous model can
shows the IOPS scaling for increasing number of CPUs cause a degenerative task schedule, polluting hardware
performing 512B randomly addressed reads. For this test, cache and TLBs. This is because the default task schedu-
both the synchronous and asynchronous models use ler eagerly finds any runnable thread to fill in the slot
100% of each included CPU. The synchronous model vacated by an I/O. With our prototype, the available time
does so with just a single thread per CPU, while the for a schedule in thread is only 2.7µs, which equals 8000
asynchronous model required up to 8 threads per CPU to CPU clock cycles. If the thread scheduled is lower priori-
achieve maximum IOPS. In the asynchronous model, the ty than the original thread, the original thread will likely
total number of threads needed increases with number of be re-scheduled upon the completion of the I/O – lots of
processors to compensate for the larger per-I/O latency. state swapping for little work done. Worse, thread data
The synchronous model shows the best per-CPU I/O held in hardware resources such as memory cache and
performance, scaling linearly with the increased number TLBs are replaced, only to be re-populated again when
of CPUs up to 2 million IOPS – the hardware limitation the original thread is scheduled back.

4
CPU power-state complications of execution. Therefore, it is guaranteed that B reaches to
the device after A.
Power management used in conjunction with the asyn-
chronous model for the short I/O-wait of our device may Let us relax A1. The application order requires the thread
not only reduce the power saving, but also increase I/O to submit A before B using non-blocking interface or AIO
completion latency. A modern processor may enter a [4]. With the synchronous model, this means that the
power-saving ‘C-state’ when not loaded or lightly loaded. device has already completed the I/O for A at the moment
Transition among C-states incurs latency. For the asyn- that the application makes another non-blocking system
chronous model, the CPU enters into a power saving C- calls for B. Therefore, the synchronous model guarantees
state when the scheduler fails to find a thread to run after that B reaches to the device after A with non-blocking I/O
sending an I/O command. The synchronous model does interface.
not automatically allow this transition to a lower C-state
Relaxing A2, let us imagine two threads T1 and T2, each
since the processor is busy.
performing A and B respectively. In order to respect the
We have measured a latency impact from C-state transi- application’s ordering requirement, T2 must synchronize
tion. When the processor enters into a C-state, the asyn- with T1 to avoid a race in such a way that T2 must wait
chronous path takes an additional 2µs in observed hard- for T1 before submitting B. The end result is that the ker-
ware latency with higher variability (Figure 1, labeled nel always sees B after kernel safely completes previous-
‘async C-state’). This additional latency is incurred only ly submitted A. Therefore, the synchronous model guar-
when the system has no other thread to schedule on the antees the ordering with multi-threaded applications.
CPU. The end result is that a thread performing I/Os runs
The above exercise shows that an I/O barrier is unneces-
slower when it is the only thread active on the CPU – we
sary in the synchronous model to guarantee I/O ordering.
confirmed this empirically.
This contrasts with asynchronous model where a pro-
It is hard for an asynchronous model driver to fine-tune gram has to rely on an I/O barrier when it needs to force
C-state transitions. In asynchronous path, the C-state ordering. Hence, synchronous model has a potential to
transition decision is primarily made by operating sys- further simplify storage I/O routines with respect to gua-
tem’s CPU scheduler or by the processor hardware itself. ranteeing data durability and consistency.
On the other hand, a device driver using synchronous
Our synchronous device driver written for Linux has
completion can directly construct its spin-wait loop using
been tested with multi-threaded applications using non-
instructions with power-state hints, such as mwait [3],
blocking system calls. For instance, the driver has with-
better controlling C-state transitions.
stood many hours of TPC-C* benchmark run. The driver
3.5 Correctness of synchronous model has also been heavily utilized as a system swap space.
We believe that the synchronous completion model is
A block I/O subsystem is deemed correct when it pre- correct and fully compatible with existing applications.
serves ordering requirements for I/O requests made by its
frontend clients. Ultimately, we want to address the fol- 4 Discussion
lowing problem:
The asynchronous model may work better in processing
A client performs I/O calls ‘A’ and ‘B’ in order, and I/O requests with large transfer sizes or handling hard-
its ordering requirement is that B should get to the ware stalls that cause long latencies. Hence, a favorable
device after A. Does synchronous model respect this solution would be a synchronous and asynchronous hybr-
requirement? id, where there are two kernel paths for a block device:
For brevity, we assume that the client to be a user appli- the synchronous path is the fast path for small transfers
cation using Linux I/O system calls. We also assume a and often used, whereas the asynchronous path is the
file system and the page cache are bypassed. In fact, file slow fallback path for large transfers or hardware stalls.
system and page cache themselves can be considered as We believe that existing applications have primarily as-
frontend clients using the block I/O subsystem. sumed the asynchronous completion model and tradition-
We start with two assumptions: al slow storage devices. Although the synchronous com-
pletion model requires little change to existing software
A1. Application uses blocking I/O system calls. to run correctly, some changes to the operating system
and to applications will allow for faster, more efficient
A2. Application is single threaded.
system operation when storage is used synchronously.
Let us consider a single thread is submitting A and B in We did not attempt to re-write applications, but do sug-
order. The operating system may preempt and schedule gest possible software changes.
the thread on a different CPU, but it does not affect the
ordering of I/O requests since there is only a single thread

5
Perhaps the most significant improvement that could be generation NVM in a more evolutionary way, preserving
achieved for I/O intensive applications is to avoid using the current hardware and software storage interface, in
the non-blocking user I/O interface such as AIO calls keeping with the huge body of existing applications.
when addressing a storage device synchronously. In this
Moneta [8] is a recent effort to evaluate the design and
case, using the non-blocking interface adds overhead and
impact of next-generation NVM-based SSDs. Moneta
complexity to the application without benefit because
hardware is akin to our prototype device in spirit because
operating system already completes the I/O upon the
it is a block device connected via PCIe bus. But imple-
return from a non-blocking I/O submission call. Al-
mentation differences enabled our hardware to perform
though applications that use the non-blocking interface
faster than Moneta. Moneta also examined spinning to
are functionally safe and correct with synchronous com-
cut the kernel cost, but its description is limited to latency
pletion, the use of non-blocking interface negates the
aspect. In contrast, this paper studied issues relevant to
latency and scalability gains achievable in kernel with the
the viability of synchronous completion, such as IOPS
synchronous completion model.
scalability, interrupt thrashing, power state, etc.
When the backing storage device is fast enough to com-
Interrupt-driven asynchronous completion has long been
plete an I/O synchronously, applications that have tradi-
the only I/O model used by kernel to perform real storage
tionally self-managed I/O buffers must reevaluate their
I/Os. Storage interface standards have thus embraced
buffering strategy. We observe that many I/O intensive
hardware queueing techniques that further improve per-
applications existing today, such as databases, the operat-
formance of asynchronous I/O operations [2,5,6]. How-
ing system’s page cache, and disk-swap algorithms, em-
ever, these are mostly effective for the devices with
ploy elaborate I/O buffering and prefetching schemes.
slower storage medium such as hard disk or NAND flash.
Such custom I/O schemes may add overhead with little
value for the synchronous completion model. Although It is a well-known strategy to choose a poll-based waiting
our work in the synchronous model greatly simplifies I/O primitive over an event-based one when the waiting time
processing overhead in the kernel, application complexity is short. A spinlock, for example, is preferred to a system
may still become a bottleneck. For instance, I/O prefetch- mutex lock if the duration of the lock is held is short.
ing becomes far less effective and could even hurt per- Another example is the optional use of polling [18,20] for
formance. We have found the performance of page cache network message passing among nodes when implement-
and disk-swapper to increase when we disabled page ing the MPI* library [13] used in high-performance com-
cache read-ahead and swap-in clustering. puting clusters. In such systems communication latencies
among nodes are just several microseconds due to the use
Informing applications of the presence of synchronous
of low-latency, high-bandwidth communication fabric
completions is therefore necessary. For example, an
along with a highly optimized network stack such as Re-
ioctl() extension to query underlying completion model
mote Direct Memory Access (RDMA*).
should help applications decide the best I/O strategy.
Operating system processor usage statistics must account 6 Conclusion
separately for the time spent at the driver’s spin-wait
loop. Currently there is no accepted method of account- This paper makes the case for the synchronous comple-
ing for this ‘spinning I/O wait’ cycles. In our prototype tion of storage I/Os. When performing storage I/O with
implementation, the time spent in the polling loop is ultra-low latency devices employing next-generation
simply accounted towards system time. This may mislead non-volatile memories, polling for completion performs
people to believe no I/O has been performed or to suspect better than the traditional interrupt-driven asynchronous
kernel inefficiency due to increased system time. I/O path. Our conclusion has a practical importance,
pointing to the need for kernel researchers to consider
5 Related work optimizations to the traditional kernel block storage inter-
face with next-generation SSDs, built of next-generation
Following the success of NAND-based storage, research
NVM elements in mind. It is our belief that non-dramatic
interest has surged on the next-generation non-volatile
changes can reap significant benefit.
memory (NVM) elements [11,14,16,19]. Although base
materials differ, these memory elements commonly Acknowledgements
promise faster and simpler media access than NAND.
We thank members of Storage Technology Group in Intel
Because of the DRAM-like random accessibility of many Corporation for supporting this work. We also thank our
next-generation NVM technologies, there is abundant shepherd David Patterson and the anonymous reviewers
research in storage-class memories (SCM), where NVM for their detailed feedback and guidance. The views and
is directly exposed as a physical address space. For in- conclusions in this paper are those of the authors and
stance, file systems have been proposed on SCM-based should not be interpreted as representing the official poli-
architectures [9,21]. In contrast, we approach next- cies, either expressed or implied, of Intel Corporation.

6
References agement Systems Using Modern Processor and Storage
Architectures (ADMS), Singapore, September 2010.
[1] Jen Axboe. Flexible I/O tester (fio). http://git.kernel.
[13] William Gropp, Ewing Lusk, Nathan Doss and Antho-
dk/?p=fio.git;a=summary. 2010.
ny Skjellum. A high-performance, portable implemen-
[2] Amber Huffman and Joni Clark. Serial ATA native tation of the MPI message passing interface standard.
command queueing. Technical white paper, Parallel Computing, 22:789-828, September 1996.
http://www.seagate.com/content/pdf/whitepaper/D2c_t
[14] S. Parkin. Racetrack memory: A storage class memory
ech_paper_intc-stx_sata_ncq.pdf, July 2003.
based on current controlled magnetic domain wall mo-
[3] Intel Corporation. Intel® 64 and IA-32 Architectures tion. In Device Research Conference (DRC), pages 3-
Software Developer’s Manual, Volume 1-3. Intel, 2008. 6, 2009.
[4] M. Tim Jones. Boost application performance using [15] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and
asynchronous I/O. http://www.ibm.com/developer Jude A. Rivers. Scalable high performance main mem-
works/linux/library/l-async/, 2006. ory system using Ph ase-Change Memory technology.
[5] NVMHCI Work Group. NVM Express. http://www. In Proceedings of the 36th International Symposium of
nvmexpress.org/, 2011. Computer Architecture (ISCA), Austin, TX, June 2009.
[6] SCSI Tagged Command Queueing, SCSI Architecture [16] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner,
Model – 3, 2007. Y.-C. Chen, R. M. Shelby, M. Salinga, D. Krebs, S.-H.
Chen, H.-L. Lung, and C. H. Lam. Phase-change ran-
[7] Daniel P. Bovet and Marco Cesati. Understanding the
dom access memory: A scalable technology. IBM
Linux Kernel, 3rd Ed., O’Reilly, 2005.
Journal of Research and Development, 52:465-480,
[8] Adrian M. Caufield, Arup De, Joel Coburn, Todor I. 2008.
Mollov, Rajesh K. Gupta, and Steven Swanson. Mone-
[17] Dongjun Shin. SSD. In Linux Storage and Filesystem
ta: A high-performance storage array architecture for
Workshop, San Jose, CA, February 2008.
next-generation, non-volatile memories, In Proceedings
of the 43rd International Symposium of Microarchitec- [18] David Sitsky and Kenichi Hayashi. An MPI library
ture (MICRO), Atlanta, GA, December 2010. which uses polling, interrupts and remote copying for
the Fujitsu AP1000+. In Proceedings of the 2nd Interna-
[9] Jeremy Condit, Edmund B. Nightingale, Christopher
tional Symposium on Parallel Architectures, Algo-
Frost, Engin Ipek, Benjamin Lee, Doug Burger, and
rithms, and Networks (ISPAN), Beijing, China, June
Derrick Coetzee. Better I/O through byte-addressable,
1996.
persistent memory. In Proceedings of the Symposium
on Operating Systems Principles (SOSP), pages 133– [19] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S.
146, Big Sky, MT, October 2009. Williams. The missing memristor found. Nature,
453(7191):80-83, May 2008.
[10] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-
Hartman. Linux Device Drivers, 3rd Ed., O’Reilly, [20] Sayantan Sur, Hyun-Wook Jin, Lei Chai, and Dhaba-
2005. leswar K. Panda. RDMA read based rendezvous proto-
col for MPI over InfiniBand: design alternatives and
[11] B. Dieny, R. Sousa, G. Prenat, and U. Ebels, Spin-
benefits. In Proceedings of the 11th Symposium on
dependent phenomena and their implementation in
Principles and Practice of Parallel Programming
spintronic devices. In International Symposium on
(PPoPP), pages 32-39, New York, NY, March 2006.
VLSI Technology, Systems and Applications (VLSI-
TSA), 2008. [21] Xiaojian Wu and Narasimha Reddy. SCMFS: A file
system for storage class memory. In Proceedings of the
[12] Annie Foong, Bryan Veal, and Frank Hady. Towards
International Conference for High Performance Com-
SSD-ready enterprise platforms. In Proceedings of the
puting, Networking, Storage and Analysis (SC11), Seat-
1st International Workshop on Accelerating Data Man-
tle, WA, November 2011.

_____________________________
* Other names and brands may be claimed as the property of others.

You might also like