Parallel Programming
Parallel Programming
Edited by:
Paul E. McKenney
Linux Technology Center
IBM Beaverton
paulmck@linux.vnet.ibm.com
January 2, 2017
ii
Legal Statement
This work represents the views of the editor and the authors and does not necessarily
represent the view of their respective employers.
Trademarks:
• IBM, zSeries, and PowerPC are trademarks or registered trademarks of Interna-
tional Business Machines Corporation in the United States, other countries, or
both.
• Linux is a registered trademark of Linus Torvalds.
The non-source-code text and images in this document are provided under the terms
of the Creative Commons Attribution-Share Alike 3.0 United States license.1 In brief,
you may use the contents of this document for any purpose, personal, commercial, or
otherwise, so long as attribution to the authors is maintained. Likewise, the document
may be modified, and derivative works and translations made available, so long as
such modifications and derivations are offered to the public on equal terms as the
non-source-code text and images in the original document.
Source code is covered by various versions of the GPL.2 Some of this code is
GPLv2-only, as it derives from the Linux kernel, while other code is GPLv2-or-later.
See the comment headers of the individual source files within the CodeSamples directory
in the git archive3 for the exact licenses. If you are unsure of the license for a given
code fragment, you should assume GPLv2-only.
Combined work © 2005-2016 by Paul E. McKenney.
1 http://creativecommons.org/licenses/by-sa/3.0/us/
2 http://www.gnu.org/licenses/gpl-2.0.html
3 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
Contents
2 Introduction 7
2.1 Historic Parallel Programming Difficulties . . . . . . . . . . . . . . . . 7
2.2 Parallel Programming Goals . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Alternatives to Parallel Programming . . . . . . . . . . . . . . . . . . 14
2.3.1 Multiple Instances of a Sequential Application . . . . . . . . 15
2.3.2 Use Existing Parallel Software . . . . . . . . . . . . . . . . . 15
2.3.3 Performance Optimization . . . . . . . . . . . . . . . . . . . 15
2.4 What Makes Parallel Programming Hard? . . . . . . . . . . . . . . . 16
2.4.1 Work Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Parallel Access Control . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Resource Partitioning and Replication . . . . . . . . . . . . . 18
2.4.4 Interacting With Hardware . . . . . . . . . . . . . . . . . . . 19
2.4.5 Composite Capabilities . . . . . . . . . . . . . . . . . . . . . 19
2.4.6 How Do Languages and Environments Assist With These Tasks? 19
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iii
iv CONTENTS
3.3.1 3D Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Novel Materials and Processes . . . . . . . . . . . . . . . . . . 31
3.3.3 Light, Not Electrons . . . . . . . . . . . . . . . . . . . . . . 32
3.3.4 Special-Purpose Accelerators . . . . . . . . . . . . . . . . . 32
3.3.5 Existing Parallel Software . . . . . . . . . . . . . . . . . . . 33
3.4 Software Design Implications . . . . . . . . . . . . . . . . . . . . . . 33
5 Counting 55
5.1 Why Isn’t Concurrent Counting Trivial? . . . . . . . . . . . . . . . . 56
5.2 Statistical Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 Array-Based Implementation . . . . . . . . . . . . . . . . . . 59
5.2.3 Eventually Consistent Implementation . . . . . . . . . . . . . 60
5.2.4 Per-Thread-Variable-Based Implementation . . . . . . . . . . 63
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Approximate Limit Counters . . . . . . . . . . . . . . . . . . . . . . 64
5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.2 Simple Limit Counter Implementation . . . . . . . . . . . . . 65
5.3.3 Simple Limit Counter Discussion . . . . . . . . . . . . . . . . 71
5.3.4 Approximate Limit Counter Implementation . . . . . . . . . 72
5.3.5 Approximate Limit Counter Discussion . . . . . . . . . . . . 72
5.4 Exact Limit Counters . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.1 Atomic Limit Counter Implementation . . . . . . . . . . . . 73
5.4.2 Atomic Limit Counter Discussion . . . . . . . . . . . . . . . . 77
5.4.3 Signal-Theft Limit Counter Design . . . . . . . . . . . . . . . 77
5.4.4 Signal-Theft Limit Counter Implementation . . . . . . . . . . 79
5.4.5 Signal-Theft Limit Counter Discussion . . . . . . . . . . . . 85
5.5 Applying Specialized Parallel Counters . . . . . . . . . . . . . . . . 85
5.6 Parallel Counting Discussion . . . . . . . . . . . . . . . . . . . . . . 86
5.6.1 Parallel Counting Performance . . . . . . . . . . . . . . . . . 86
5.6.2 Parallel Counting Specializations . . . . . . . . . . . . . . . . 87
CONTENTS v
7 Locking 135
7.1 Staying Alive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.1.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1.2 Livelock and Starvation . . . . . . . . . . . . . . . . . . . . 145
7.1.3 Unfairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.4 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 Types of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 Exclusive Locks . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.2 Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . 148
7.2.3 Beyond Reader-Writer Locks . . . . . . . . . . . . . . . . . 148
7.2.4 Scoped Locking . . . . . . . . . . . . . . . . . . . . . . . . 150
7.3 Locking Implementation Issues . . . . . . . . . . . . . . . . . . . . . 152
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic
Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.2 Other Exclusive-Locking Implementations . . . . . . . . . . 153
7.4 Lock-Based Existence Guarantees . . . . . . . . . . . . . . . . . . . 155
7.5 Locking: Hero or Villain? . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5.1 Locking For Applications: Hero! . . . . . . . . . . . . . . . . . 157
7.5.2 Locking For Parallel Libraries: Just Another Tool . . . . . . . . 157
7.5.3 Locking For Parallelizing Sequential Libraries: Villain! . . . . . 161
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
vi CONTENTS
11 Validation 271
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
11.1.1 Where Do Bugs Come From? . . . . . . . . . . . . . . . . . 272
11.1.2 Required Mindset . . . . . . . . . . . . . . . . . . . . . . . . 273
11.1.3 When Should Validation Start? . . . . . . . . . . . . . . . . . 274
CONTENTS vii
E Credits 679
E.1 Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
E.2 Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
E.3 Machine Owners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
E.4 Original Publications . . . . . . . . . . . . . . . . . . . . . . . . . . 680
E.5 Figure Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
E.6 Other Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
Chapter 1
The purpose of this book is to help you program shared-memory parallel machines
without risking your sanity.1 We hope that this book’s design principles will help
you avoid at least some parallel-programming pitfalls. That said, you should think
of this book as a foundation on which to build, rather than as a completed cathedral.
Your mission, if you choose to accept, is to help make further progress in the exciting
field of parallel programming—progress that will in time render this book obsolete.
Parallel programming is not as hard as some say, and we hope that this book makes your
parallel-programming projects easier and more fun.
In short, where parallel programming once focused on science, research, and grand-
challenge projects, it is quickly becoming an engineering discipline. We therefore
examine specific parallel-programming tasks and describe how to approach them. In
some surprisingly common cases, they can even be automated.
This book is written in the hope that presenting the engineering discipline underlying
successful parallel-programming projects will free a new generation of parallel hackers
from the need to slowly and painstakingly reinvent old wheels, enabling them to instead
focus their energy and creativity on new frontiers. We sincerely hope that parallel
programming brings you at least as much fun, excitement, and challenge that it has
brought to us!
1.1 Roadmap
This book is a handbook of widely applicable and heavily used design techniques, rather
than a collection of optimal algorithms with tiny areas of applicability. You are currently
reading Chapter 1, but you knew that already. Chapter 2 gives a high-level overview of
parallel programming.
Chapter 3 introduces shared-memory parallel hardware. After all, it is difficult
to write good parallel code unless you understand the underlying hardware. Because
hardware constantly evolves, this chapter will always be out of date. We will nevertheless
do our best to keep up. Chapter 4 then provides a very brief overview of common shared-
memory parallel-programming primitives.
Chapter 5 takes an in-depth look at parallelizing one of the simplest problems
imaginable, namely counting. Because almost everyone has an excellent grasp of
1 Or, perhaps more accurately, without much greater risk to your sanity than that incurred by non-parallel
programming. Which, come to think of it, might not be saying all that much.
1
2 CHAPTER 1. HOW TO USE THIS BOOK
counting, this chapter is able to delve into many important parallel-programming issues
without the distractions of more-typical computer-science problems. My impression is
that this chapter has seen the greatest use in parallel-programming coursework.
Chapter 6 introduces a number of design-level methods of addressing the issues
identified in Chapter 5. It turns out that it is important to address parallelism at the
design level when feasible: To paraphrase Dijkstra [Dij68], “retrofitted parallelism
considered grossly suboptimal” [McK12b].
The next three chapters examine three important approaches to synchronization.
Chapter 7 covers locking, which in 2014 is not only the workhorse of production-quality
parallel programming, but is also widely considered to be parallel programming’s worst
villain. Chapter 8 gives a brief overview of data ownership, an often overlooked but
remarkably pervasive and powerful approach. Finally, Chapter 9 introduces a number
of deferred-processing mechanisms, including reference counting, hazard pointers,
sequence locking, and RCU.
Chapter 10 applies the lessons of previous chapters to hash tables, which are heavily
used due to their excellent partitionability, which (usually) leads to excellent perfor-
mance and scalability.
As many have learned to their sorrow, parallel programming without validation is a
sure path to abject failure. Chapter 11 covers various forms of testing. It is of course
impossible to test reliability into your program after the fact, so Chapter 12 follows up
with a brief overview of a couple of practical approaches to formal verification.
Chapter 13 contains a series of moderate-sized parallel programming problems.
The difficulty of these problems vary, but should be appropriate for someone who has
mastered the material in the previous chapters.
Chapter 14 looks at advanced synchronization methods, including memory barriers
and non-blocking synchronization, while Chapter 15 looks at the nascent field of
parallel real-time computing. Chapter 16 follows up with some ease-of-use advice.
Finally, Chapter 17 looks at a few possible future directions, including shared-memory
parallel system design, software and hardware transactional memory, and functional
programming for parallelism.
This chapter is followed by a number of appendices. The most popular of these
appears to be Appendix B, which covers memory barriers. Appendix C contains the
answers to the infamous Quick Quizzes, which are discussed in the next section.
In short, if you need a deep understanding of the material, then you should invest
some time into answering the Quick Quizzes. Don’t get me wrong, passively reading
the material can be quite valuable, but gaining full problem-solving capability really
does require that you practice solving problems.
I learned this the hard way during coursework for my late-in-life Ph.D. I was
studying a familiar topic, and was surprised at how few of the chapter’s exercises I
could answer off the top of my head.2 Forcing myself to answer the questions greatly
increased my retention of the material. So with these Quick Quizzes I am not asking
you to do anything that I have not been doing myself!
Finally, the most common learning disability is thinking that you already know. The
quick quizzes can be an extremely effective cure.
2 So I suppose that it was just as well that my professors refused to let me waive that class!
4 CHAPTER 1. HOW TO USE THIS BOOK
5. If your primary focus is scientific and technical computing, and you prefer a
patternist approach, you might try Mattson et al.’s textbook [MSM05]. It covers
Java, C/C++, OpenMP, and MPI. Its patterns are admirably focused first on design,
then on implementation.
6. If your primary focus is scientific and technical computing, and you are interested
in GPUs, CUDA, and MPI, you might check out Norm Matloff’s “Programming
on Parallel Machines” [Mat13].
7. If you are interested in POSIX Threads, you might take a look at David R. Buten-
hof’s book [But97]. In addition, W. Richard Stevens’s book [Ste92] covers UNIX
and POSIX, and Stewart Weiss’s lecture notes [Wei13] provide an thorough and
accessible introduction with a good set of examples.
8. If you are interested in C++11, you might like Anthony Williams’s “C++ Concur-
rency in Action: Practical Multithreading” [Wil12].
9. If you are interested in C++, but in a Windows environment, you might try Herb
Sutter’s “Effective Concurrency” series in Dr. Dobbs Journal [Sut08]. This series
does a reasonable job of presenting a commonsense approach to parallelism.
10. If you want to try out Intel Threading Building Blocks, then perhaps James
Reinders’s book [Rei07] is what you are looking for.
11. Those interested in learning how various types of multi-processor hardware cache
organizations affect the implementation of kernel internals should take a look at
Curt Schimmel’s classic treatment of this subject [Sch94].
12. Finally, those using Java might be well-served by Doug Lea’s textbooks [Lea97,
GPB+ 07].
However, if you are interested in principles of parallel design for low-level software,
especially software written in C, read on!
This command will locate the file rcu_rcpls.c, which is called out in Sec-
tion 9.5.5. Other types of systems have well-known ways of locating files by filename.
To create patches or git pull requests, you will need the LATEX source to the
book, which is at git://git.kernel.org/pub/scm/linux/kernel/git/
paulmck/perfbook.git. You will of course also need git and LATEX, which
are available as part of most mainstream Linux distributions. Other packages may be
required, depending on the distribution you use. The required list of packages for a few
popular distributions is listed in the file FAQ-BUILD.txt in the LATEX source to the
book.
To create and display a current LATEX source tree of this book, use the list of Linux
commands shown in Figure 1.1. In some environments, the evince command that
displays perfbook.pdf may need to be replaced, for example, with acroread. The
git clone command need only be used the first time you create a PDF, subsequently,
you can run the commands shown in Figure 1.2 to pull in any updates and generate an
updated PDF. The commands in Figure 1.2 must be run within the perfbook directory
created by the commands shown in Figure 1.1.
PDFs of this book are sporadically posted at http://kernel.org/pub/linux/
kernel/people/paulmck/perfbook/perfbook.html and at http://www.
rdrop.com/users/paulmck/perfbook/.
The actual process of contributing patches and sending git pull requests is
similar to that of the Linux kernel, which is documented in the Documentation/
SubmittingPatches file in the Linux source tree. One important requirement is
that each patch (or commit, in the case of a git pull request) must contain a valid
Signed-off-by: line, which has the following format:
1. The contribution was created in whole or in part by me and I have the right to
submit it under the open source license indicated in the file; or
2. The contribution is based upon previous work that, to the best of my knowledge,
is covered under an appropriate open source License and I have the right under
6 CHAPTER 1. HOW TO USE THIS BOOK
that license to submit that work with modifications, whether created in whole
or in part by me, under the same open source license (unless I am permitted to
submit under a different license), as indicated in the file; or
3. The contribution was provided directly to me by some other person who certified
(a), (b) or (c) and I have not modified it.
4. I understand and agree that this project and the contribution are public and that
a record of the contribution (including all personal information I submit with
it, including my sign-off) is maintained indefinitely and may be redistributed
consistent with this project or the open source license(s) involved.
This is similar to the Developer’s Certificate of Origin (DCO) 1.1 used by the
Linux kernel. The only addition is item #4. This added item says that you wrote the
contribution yourself, as opposed to having (say) copied it from somewhere. If multiple
people authored a contribution, each should have a Signed-off-by: line.
You must use your real name: I unfortunately cannot accept pseudonymous or
anonymous contributions.
The language of this book is American English, however, the open-source nature
of this book permits translations, and I personally encourage them. The open-source
licenses covering this book additionally allow you to sell your translation, if you wish. I
do request that you send me a copy of the translation (hardcopy if available), but this
is a request made as a professional courtesy, and is not in any way a prerequisite to
the permission that you already have under the Creative Commons and GPL licenses.
Please see the FAQ.txt file in the source tree for a list of translations currently in
progress. I consider a translation effort to be “in progress” once at least one chapter has
been fully translated.
As noted at the beginning of this section, I am this book’s editor. However, if you
choose to contribute, it will be your book as well. With that, I offer you Chapter 2, our
introduction.
If parallel programming is so hard, why are there any
parallel programs?
Unknown
Chapter 2
Introduction
Parallel programming has earned a reputation as one of the most difficult areas a hacker
can tackle. Papers and textbooks warn of the perils of deadlock, livelock, race conditions,
non-determinism, Amdahl’s-Law limits to scaling, and excessive realtime latencies. And
these perils are quite real; we authors have accumulated uncounted years of experience
dealing with them, and all of the emotional scars, grey hairs, and hair loss that go with
such experiences.
However, new technologies that are difficult to use at introduction invariably become
easier over time. For example, the once-rare ability to drive a car is now commonplace
in many countries. This dramatic change came about for two basic reasons: (1) cars
became cheaper and more readily available, so that more people had the opportunity
to learn to drive, and (2) cars became easier to operate due to automatic transmissions,
automatic chokes, automatic starters, greatly improved reliability, and a host of other
technological improvements.
The same is true of a many other technologies, including computers. It is no
longer necessary to operate a keypunch in order to program. Spreadsheets allow
most non-programmers to get results from their computers that would have required
a team of specialists a few decades ago. Perhaps the most compelling example is
web-surfing and content creation, which since the early 2000s has been easily done
by untrained, uneducated people using various now-commonplace social-networking
tools. As recently as 1968, such content creation was a far-out research project [Eng68],
described at the time as “like a UFO landing on the White House lawn”[Gri00].
Therefore, if you wish to argue that parallel programming will remain as difficult as
it is currently perceived by many to be, it is you who bears the burden of proof, keeping
in mind the many centuries of counter-examples in a variety of fields of endeavor.
7
8 CHAPTER 2. INTRODUCTION
2. The typical researcher’s and practitioner’s lack of experience with parallel sys-
tems.
Many of these historic difficulties are well on the way to being overcome. First, over
the past few decades, the cost of parallel systems has decreased from many multiples of
that of a house to a fraction of that of a bicycle, courtesy of Moore’s Law. Papers calling
out the advantages of multicore CPUs were published as early as 1996 [ONH+ 96]. IBM
introduced simultaneous multi-threading into its high-end POWER family in 2000, and
multicore in 2001. Intel introduced hyperthreading into its commodity Pentium line
in November 2000, and both AMD and Intel introduced dual-core CPUs in 2005. Sun
followed with the multicore/multi-threaded Niagara in late 2005. In fact, by 2008, it
was becoming difficult to find a single-CPU desktop system, with single-core CPUs
being relegated to netbooks and embedded devices. By 2012, even smartphones were
starting to sport multiple CPUs.
Second, the advent of low-cost and readily available multicore systems means
that the once-rare experience of parallel programming is now available to almost all
researchers and practitioners. In fact, parallel systems are now well within the budget of
students and hobbyists. We can therefore expect greatly increased levels of invention
and innovation surrounding parallel systems, and that increased familiarity will over
time make the once prohibitively expensive field of parallel programming much more
friendly and commonplace.
Third, in the 20th century, large systems of highly parallel software were almost
always closely guarded proprietary secrets. In happy contrast, the 21st century has
seen numerous open-source (and thus publicly available) parallel software projects,
including the Linux kernel [Tor03], database systems [Pos08, MS08], and message-
passing systems [The08, UoC08]. This book will draw primarily from the Linux kernel,
but will provide much material suitable for user-level applications.
Fourth, even though the large-scale parallel-programming projects of the 1980s and
1990s were almost all proprietary projects, these projects have seeded other communities
with a cadre of developers who understand the engineering discipline required to develop
production-quality parallel code. A major purpose of this book is to present this
engineering discipline.
Unfortunately, the fifth difficulty, the high cost of communication relative to that
of processing, remains largely in force. Although this difficulty has been receiving
increasing attention during the new millennium, according to Stephen Hawking, the
finite speed of light and the atomic nature of matter is likely to limit progress in this
area [Gar07, Moo03]. Fortunately, this difficulty has been in force since the late 1980s,
so that the aforementioned engineering discipline has evolved practical and effective
strategies for handling it. In addition, hardware designers are increasingly aware of
these issues, so perhaps future hardware will be more friendly to parallel software as
discussed in Section 3.3.
Quick Quiz 2.1: Come on now!!! Parallel programming has been known to be
exceedingly hard for many decades. You seem to be hinting that it is not so hard. What
sort of game are you playing?
2.2. PARALLEL PROGRAMMING GOALS 9
1. Performance.
2. Productivity.
3. Generality.
Unfortunately, given the current state of the art, it is possible to achieve at best two
of these three goals for any given parallel program. These three goals therefore form the
iron triangle of parallel programming, a triangle upon which overly optimistic hopes all
too often come to grief.1
Quick Quiz 2.3: Oh, really??? What about correctness, maintainability, robustness,
and so on?
Quick Quiz 2.4: And if correctness, maintainability, and robustness don’t make the
list, why do productivity and generality?
Quick Quiz 2.5: Given that parallel programs are much harder to prove correct than
are sequential programs, again, shouldn’t correctness really be on the list?
Quick Quiz 2.6: What about just having fun?
Each of these goals is elaborated upon in the following sections.
2.2.1 Performance
Performance is the primary goal behind most parallel-programming effort. After all, if
performance is not a concern, why not do yourself a favor: Just write sequential code,
and be happy? It will very likely be easier and you will probably get done much more
quickly.
Quick Quiz 2.7: Are there no cases where parallel programming is about something
other than performance?
Note that “performance” is interpreted quite broadly here, including scalability
(performance per CPU) and efficiency (for example, performance per watt).
That said, the focus of performance has shifted from hardware to parallel software.
This change in focus is due to the fact that, although Moore’s Law continues to deliver
increases in transistor density, it has ceased to provide the traditional single-threaded
10000
100
10
0.1
1975
1980
1985
1990
1995
2000
2005
2010
2015
Year
performance increases. This can be seen in Figure 2.12 , which shows that writing
single-threaded code and simply waiting a year or two for the CPUs to catch up may
no longer be an option. Given the recent trends on the part of all major manufacturers
towards multicore/multithreaded systems, parallelism is the way to go for those wanting
the avail themselves of the full performance of their systems.
Even so, the first goal is performance rather than scalability, especially given that the
easiest way to attain linear scalability is to reduce the performance of each CPU [Tor01].
Given a four-CPU system, which would you prefer? A program that provides 100
transactions per second on a single CPU, but does not scale at all? Or a program that
provides 10 transactions per second on a single CPU, but scales perfectly? The first
program seems like a better bet, though the answer might change if you happened to
have a 32-CPU system.
That said, just because you have multiple CPUs is not necessarily in and of itself
a reason to use them all, especially given the recent decreases in price of multi-CPU
systems. The key point to understand is that parallel programming is primarily a
performance optimization, and, as such, it is one potential optimization of many. If your
program is fast enough as currently written, there is no reason to optimize, either by
parallelizing it or by applying any of a number of potential sequential optimizations.3
By the same token, if you are looking to apply parallelism as an optimization to a
sequential program, then you will need to compare parallel algorithms to the best
sequential algorithms. This may require some care, as far too many publications ignore
2 This plot shows clock frequencies for newer CPUs theoretically capable of retiring one or more
instructions per clock, and MIPS (millions of instructions per second, usually from the old Dhrystone
benchmark) for older CPUs requiring multiple clocks to execute even the simplest instruction. The reason for
shifting between these two measures is that the newer CPUs’ ability to retire multiple instructions per clock is
typically limited by memory-system performance. Furthermore, the benchmarks commonly used on the older
CPUs are obsolete, and it is difficult to run the newer benchmarks on systems containing the old CPUs, in part
because it is hard to find working instances of the old CPUs.
3 Of course, if you are a hobbyist whose primary interest is writing parallel software, that is more than
100000
10000
1000
10
0.1
1975
1980
1985
1990
1995
2000
2005
2010
2015
Year
2.2.2 Productivity
Quick Quiz 2.8: Why all this prattling on about non-technical issues??? And not just
any non-technical issue, but productivity of all things? Who cares?
Productivity has been becoming increasingly important in recent decades. To see
this, consider that the price of early computers was tens of millions of dollars at a time
when engineering salaries were but a few thousand dollars a year. If dedicating a team
of ten engineers to such a machine would improve its performance, even by only 10%,
then their salaries would be repaid many times over.
One such machine was the CSIRAC, the oldest still-intact stored-program computer,
which was put into operation in 1949 [Mus04, Dep06]. Because this machine was built
before the transistor era, it was constructed of 2,000 vacuum tubes, ran with a clock
frequency of 1kHz, consumed 30kW of power, and weighed more than three metric tons.
Given that this machine had but 768 words of RAM, it is safe to say that it did not suffer
from the productivity issues that often plague today’s large-scale software projects.
Today, it would be quite difficult to purchase a machine with so little computing
power. Perhaps the closest equivalents are 8-bit embedded microprocessors exemplified
by the venerable Z80 [Wik08], but even the old Z80 had a CPU clock frequency more
than 1,000 times faster than the CSIRAC. The Z80 CPU had 8,500 transistors, and could
be purchased in 2008 for less than $2 US per unit in 1,000-unit quantities. In stark
contrast to the CSIRAC, software-development costs are anything but insignificant for
the Z80.
The CSIRAC and the Z80 are two points in a long-term trend, as can be seen in
Figure 2.2. This figure plots an approximation to computational power per die over the
past three decades, showing a consistent four-order-of-magnitude increase. Note that
the advent of multicore CPUs has permitted this increase to continue unabated despite
the clock-frequency wall encountered in 2003.
One of the inescapable consequences of the rapid decrease in the cost of hardware
12 CHAPTER 2. INTRODUCTION
2.2.3 Generality
One way to justify the high cost of developing parallel software is to strive for maximal
generality. All else being equal, the cost of a more-general software artifact can be
spread over more users than that of a less-general one. In fact, this economic force
explains much of the maniacal focus on portability, which can be seen as an important
special case of generality.4
Unfortunately, generality often comes at the cost of performance, productivity, or
both. For example, portability is often achieved via adaptation layers, which inevitably
exact a performance penalty. To see this more generally, consider the following popular
parallel programming environments:
C/C++ “Locking Plus Threads” : This category, which includes POSIX Threads
(pthreads) [Ope97], Windows Threads, and numerous operating-system kernel
environments, offers excellent performance (at least within the confines of a
single SMP system) and also offers good generality. Pity about the relatively low
productivity.
MPI : This Message Passing Interface [MPI08] powers the largest scientific and
technical computing clusters in the world and offers unparalleled performance
and scalability. In theory, it is general purpose, but it is mainly used for scientific
and technical computing. Its productivity is believed by many to be even lower
than that of C/C++ “locking plus threads” environments.
OpenMP : This set of compiler directives can be used to parallelize loops. It is thus
quite specific to this task, and this specificity often limits its performance. It is,
however, much easier to use than MPI or C/C++ “locking plus threads.”
Productivity
Application
Middleware (e.g., DBMS)
Performance
Generality
System Libraries
Operating System Kernel
Firmware
Hardware
Special−Purpose
User 1 Env Productive User 2
for User 1
HW / Special−Purpose
Abs Environment
Productive for User 2
User 3
General−Purpose User 4
Environment
Special−Purpose Environment
Special−Purpose
Productive for User 3
Environment
Productive for User 4
It is important to note that a tradeoff between productivity and generality has existed
for centuries in many fields. For but one example, a nailgun is more productive than
a hammer for driving nails, but in contrast to the nailgun, a hammer can be used for
many things besides driving nails. It should therefore be no surprise to see similar
tradeoffs appear in the field of parallel computing. This tradeoff is shown schematically
in Figure 2.4. Here, users 1, 2, 3, and 4 have specific jobs that they need the computer to
help them with. The most productive possible language or environment for a given user is
one that simply does that user’s job, without requiring any programming, configuration,
or other setup.
Quick Quiz 2.10: This is a ridiculously unachievable ideal! Why not focus on
something that is achievable in practice?
Unfortunately, a system that does the job required by user 1 is unlikely to do
user 2’s job. In other words, the most productive languages and environments are
domain-specific, and thus by definition lacking generality.
Another option is to tailor a given programming language or environment to the
hardware system (for example, low-level languages such as assembly, C, C++, or Java)
or to some abstraction (for example, Haskell, Prolog, or Snobol), as is shown by the
circular region near the center of Figure 2.4. These languages can be considered to
be general in the sense that they are equally ill-suited to the jobs required by users 1,
2, 3, and 4. In other words, their generality is purchased at the expense of decreased
productivity when compared to domain-specific languages and environments. Worse yet,
a language that is tailored to a given abstraction is also likely to suffer from performance
and scalability problems unless and until someone figures out how to efficiently map
that abstraction to real hardware.
Is there no escape from iron triangle’s three conflicting goals of performance,
productivity, and generality?
It turns out that there often is an escape, for example, using the alternatives to
parallel programming discussed in the next section. After all, parallel programming can
be a great deal of fun, but it is not always the best tool for the job.
from parallelism is limited to roughly the number of CPUs (but see Section 6.5 for an
interesting exception). In contrast, the speedup available from traditional single-threaded
software optimizations can be much larger. For example, replacing a long linked list with
a hash table or a search tree can improve performance by many orders of magnitude. This
highly optimized single-threaded program might run much faster than its unoptimized
parallel counterpart, making parallelization unnecessary. Of course, a highly optimized
parallel program would be even better, aside from the added development effort required.
Furthermore, different programs might have different performance bottlenecks. For
example, if your program spends most of its time waiting on data from your disk drive,
using multiple CPUs will probably just increase the time wasted waiting for the disks.
In fact, if the program was reading from a single large file laid out sequentially on a
rotating disk, parallelizing your program might well make it a lot slower due to the
added seek overhead. You should instead optimize the data layout so that the file can be
smaller (thus faster to read), split the file into chunks which can be accessed in parallel
from different drives, cache frequently accessed data in main memory, or, if possible,
reduce the amount of data that must be read.
Quick Quiz 2.12: What other bottlenecks might prevent additional CPUs from
providing additional performance?
Parallelism can be a powerful optimization technique, but it is not the only such
technique, nor is it appropriate for all situations. Of course, the easier it is to parallelize
your program, the more attractive parallelization becomes as an optimization. Paral-
lelization has a reputation of being quite difficult, which leads to the question “exactly
what makes parallel programming so difficult?”
Performance Productivity
Work
Partitioning
Resource
Parallel
Partitioning and
Access Control Replication
Interacting
With Hardware
Generality
tasks. These tasks fall into the four categories shown in Figure 2.5, each of which is
covered in the following sections.
Performance Productivity
Work
Partitioning
Resource
Parallel
Partitioning and
Access Control Replication
Interacting
With Hardware
Generality
2.5 Discussion
This section has given an overview of the difficulties with, goals of, and alternatives
to parallel programming. This overview was followed by a discussion of what can
make parallel programming hard, along with a high-level approach for dealing with
parallel programming’s difficulties. Those who still insist that parallel programming
is impossibly difficult should review some of the older guides to parallel program-
mming [Seq88, Dig89, BK85, Inm85]. The following quote from Andrew Birrell’s
monograph [Dig89] is especially telling:
Writing concurrent programs has a reputation for being exotic and difficult.
I believe it is neither. You need a system that provides you with good
primitives and suitable libraries, you need a basic caution and carefulness,
you need an armory of useful techniques, and you need to know of the
common pitfalls. I hope that this paper has helped you towards sharing my
belief.
The authors of these older guides were well up to the parallel programming challenge
back in the 1980s. As such, there are simply no excuses for refusing to step up to the
parallel-programming challenge here in the 21st century!
We are now ready to proceed to the next chapter, which dives into the relevant
properties of the parallel hardware underlying our parallel software.
Premature abstraction is the root of all evil.
A cast of thousands
Chapter 3
Most people have an intuitive understanding that passing messages between systems is
considerably more expensive than performing simple calculations within the confines of
a single system. However, it is not always so clear that communicating among threads
within the confines of a single shared-memory system can also be quite expensive. This
chapter therefore looks at the cost of synchronization and communication within a
shared-memory system. These few pages can do no more than scratch the surface of
shared-memory parallel hardware design; readers desiring more detail would do well to
start with a recent edition of Hennessy and Patterson’s classic text [HP11, HP95].
Quick Quiz 3.1: Why should parallel programmers bother learning low-level prop-
erties of the hardware? Wouldn’t it be easier, better, and more general to remain at a
higher level of abstraction?
3.1 Overview
Careless reading of computer-system specification sheets might lead one to believe that
CPU performance is a footrace on a clear track, as illustrated in Figure 3.1, where the
race always goes to the swiftest.
Although there are a few CPU-bound benchmarks that approach the ideal shown
in Figure 3.1, the typical program more closely resembles an obstacle course than a
race track. This is because the internal architecture of CPUs has changed dramatically
over the past few decades, courtesy of Moore’s Law. These changes are described in the
following sections.
21
22 CHAPTER 3. HARDWARE AND ITS HABITS
matrices or vectors. The CPU can then correctly predict that the branch at the end of the
loop will be taken in almost all cases, allowing the pipeline to be kept full and the CPU
to execute at full speed.
However, branch prediction is not always so easy. For example, consider a program
with many loops, each of which iterates a small but random number of times. For
another example, consider an object-oriented program with many virtual objects that can
reference many different real objects, all with different implementations for frequently
invoked member functions. In these cases, it is difficult or even impossible for the
CPU to predict where the next branch might lead. Then either the CPU must stall
waiting for execution to proceed far enough to be certain where that branch leads, or
it must guess. Although guessing works extremely well for programs with predictable
control flow, for unpredictable branches (such as those in binary search) the guesses will
frequently be wrong. A wrong guess can be expensive because the CPU must discard
any speculatively executed instructions following the corresponding branch, resulting in
a pipeline flush. If pipeline flushes appear too frequently, they drastically reduce overall
performance, as fancifully depicted in Figure 3.3.
3.1. OVERVIEW 23
N
IO
CT
EDI
R
SP
PI MI
PE
LI CH
NE AN
ER BR
RO
R
Unfortunately, pipeline flushes are not the only hazards in the obstacle course that
modern CPUs must run. The next section covers the hazards of referencing memory.
1 It is only fair to add that each of these single cycles lasted no less than 1.6 microseconds.
24 CHAPTER 3. HARDWARE AND ITS HABITS
One such obstacle is atomic operations. The problem here is that the whole idea of an
atomic operation conflicts with the piece-at-a-time assembly-line operation of a CPU
pipeline. To hardware designers’ credit, modern CPUs use a number of extremely clever
tricks to make such operations look atomic even though they are in fact being executed
piece-at-a-time, with one common trick being to identify all the cachelines containing
the data to be atomically operated on, ensure that these cachelines are owned by the
CPU executing the atomic operation, and only then proceed with the atomic operation
while ensuring that these cachelines remained owned by this CPU. Because all the data
is private to this CPU, other CPUs are unable to interfere with the atomic operation
despite the piece-at-a-time nature of the CPU’s pipeline. Needless to say, this sort of
trick can require that the pipeline must be delayed or even flushed in order to perform
the setup operations that permit a given atomic operation to complete correctly.
In contrast, when executing a non-atomic operation, the CPU can load values from
cachelines as they appear and place the results in the store buffer, without the need
to wait for cacheline ownership. Fortunately, CPU designers have focused heavily on
atomic operations, so that as of early 2014 they have greatly reduced their overhead.
Even so, the resulting effect on performance is all too often as depicted in Figure 3.5.
Unfortunately, atomic operations usually apply only to single elements of data. Be-
cause many parallel algorithms require that ordering constraints be maintained between
updates of multiple data elements, most CPUs provide memory barriers. These memory
barriers also serve as performance-sapping obstacles, as described in the next section.
Quick Quiz 3.2: What types of machines would allow atomic operations on multiple
data elements?
3.1. OVERVIEW 25
Memory
Barrier
If the CPU were not constrained to execute these statements in the order shown, the
effect would be that the variable “a” would be incremented without the protection of
“mylock”, which would certainly defeat the purpose of acquiring it. To prevent such
destructive reordering, locking primitives contain either explicit or implicit memory
barriers. Because the whole purpose of these memory barriers is to prevent reorderings
26 CHAPTER 3. HARDWARE AND ITS HABITS
CACHE-
MISS
TOLL
BOOTH
that the CPU would otherwise undertake in order to increase performance, memory
barriers almost always reduce performance, as depicted in Figure 3.6.
As with atomic operations, CPU designers have been working hard to reduce
memory-barrier overhead, and have made substantial progress.
work being performed is a key design parameter. A major goal of parallel hardware de-
sign is to reduce this ratio as needed to achieve the relevant performance and scalability
goals. In turn, as will be seen in Chapter 6, a major goal of parallel software design is to
reduce the frequency of expensive operations like communications cache misses.
Of course, it is one thing to say that a given operation is an obstacle, and quite
another to show that the operation is a significant obstacle. This distinction is discussed
in the following sections.
3.2 Overheads
This section presents actual overheads of the obstacles to performance listed out in the
previous section. However, it is first necessary to get a rough view of hardware system
architecture, which is the subject of the next section.
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
1. CPU 0 checks its local cache, and does not find the cacheline.
2. The request is forwarded to CPU 0’s and 1’s interconnect, which checks CPU 1’s
local cache, and does not find the cacheline.
3. The request is forwarded to the system interconnect, which checks with the other
three dies, learning that the cacheline is held by the die containing CPU 6 and 7.
4. The request is forwarded to CPU 6’s and 7’s interconnect, which checks both
CPUs’ caches, finding the value in CPU 7’s cache.
5. CPU 7 forwards the cacheline to its interconnect, and also flushes the cacheline
from its cache.
6. CPU 6’s and 7’s interconnect forwards the cacheline to the system interconnect.
7. The system interconnect forwards the cacheline to CPU 0’s and 1’s interconnect.
8. CPU 0’s and 1’s interconnect forwards the cacheline to CPU 0’s cache.
9. CPU 0 can now perform the CAS operation on the value in its cache.
Quick Quiz 3.4: This is a simplified sequence of events? How could it possibly be
any more complex?
Quick Quiz 3.5: Why is it necessary to flush the cacheline from CPU 7’s cache?
This simplified sequence is just the beginning of a discipline called cache-coherency
protocols [HP95, CSG99, MHS12, SHW11].
Ratio
Operation Cost (ns) (cost/clock)
Clock period 0.6 1.0
Best-case CAS 37.9 63.2
Best-case lock 65.6 109.3
Single cache miss 139.5 232.5
CAS cache miss 306.0 510.0
Comms Fabric 5,000.0 8,330.0
Global Comms 195,000,000.0 325,000,000.0
the operations’s costs are nevertheless normalized to a clock period in the third column,
labeled “Ratio”. The first thing to note about this table is the large values of many of the
ratios.
The best-case compare-and-swap (CAS) operation consumes almost forty nanosec-
onds, a duration more than sixty times that of the clock period. Here, “best case” means
that the same CPU now performing the CAS operation on a given variable was the
last CPU to operate on this variable, so that the corresponding cache line is already
held in that CPU’s cache. Similarly, the best-case lock operation (a “round trip” pair
consisting of a lock acquisition followed by a lock release) consumes more than sixty
nanoseconds, or more than one hundred clock cycles. Again, “best case” means that
the data structure representing the lock is already in the cache belonging to the CPU
acquiring and releasing the lock. The lock operation is more expensive than CAS
because it requires two atomic operations on the lock data structure.
An operation that misses the cache consumes almost one hundred and forty nanosec-
onds, or more than two hundred clock cycles. The code used for this cache-miss
measurement passes the cache line back and forth between a pair of CPUs, so this cache
miss is satisfied not from memory, but rather from the other CPU’s cache. A CAS
operation, which must look at the old value of the variable as well as store a new value,
consumes over three hundred nanoseconds, or more than five hundred clock cycles.
Think about this a bit. In the time required to do one CAS operation, the CPU could
have executed more than five hundred normal instructions. This should demonstrate the
limitations not only of fine-grained locking, but of any other synchronization mechanism
relying on fine-grained global agreement.
Quick Quiz 3.6: Surely the hardware designers could be persuaded to improve
this situation! Why have they been content with such abysmal performance for these
single-instruction operations?
I/O operations are even more expensive. As shown in the “Comms Fabric” row,
high performance (and expensive!) communications fabric, such as InfiniBand or any
number of proprietary interconnects, has a latency of roughly five microseconds for an
end-to-end round trip, during which time more than eight thousand instructions might
have been executed. Standards-based communications networks often require some
sort of protocol processing, which further increases the latency. Of course, geographic
distance also increases latency, with the speed-of-light through optical fiber latency
around the world coming to roughly 195 milliseconds, or more than 300 million clock
30 CHAPTER 3. HARDWARE AND ITS HABITS
70 um
3 cm 1.5 cm
1. 3D integration,
2. Novel materials and processes,
3. Substituting light for electricity,
4. Special-purpose accelerators, and
5. Existing parallel software.
3.3.1 3D Integration
3-dimensional integration (3DI) is the practice of bonding very thin silicon dies to
each other in a vertical stack. This practice provides potential benefits, but also poses
significant fabrication challenges [Kni08].
Perhaps the most important benefit of 3DI is decreased path length through the
system, as shown in Figure 3.11. A 3-centimeter silicon die is replaced with a stack of
four 1.5-centimeter dies, in theory decreasing the maximum path through the system by
a factor of two, keeping in mind that each layer is quite thin. In addition, given proper
attention to design and placement, long horizontal electrical connections (which are
both slow and power hungry) can be replaced by short vertical electrical connections,
which are both faster and more power efficient.
However, delays due to levels of clocked logic will not be decreased by 3D in-
tegration, and significant manufacturing, testing, power-supply, and heat-dissipation
problems must be solved for 3D integration to reach production while still delivering on
its promise. The heat-dissipation problems might be solved using semiconductors based
on diamond, which is a good conductor for heat, but an electrical insulator. That said, it
remains difficult to grow large single diamond crystals, to say nothing of slicing them
into wafers. In addition, it seems unlikely that any of these technologies will be able to
deliver the exponential increases to which some people have become accustomed. That
said, they may be necessary steps on the path to the late Jim Gray’s “smoking hairy golf
balls” [Gra02].
limits, but there are nevertheless a few avenues of research and development focused on
working around these fundamental limits.
One workaround for the atomic nature of matter are so-called “high-K dielectric”
materials, which allow larger devices to mimic the electrical properties of infeasibly
small devices. These materials pose some severe fabrication challenges, but nevertheless
may help push the frontiers out a bit farther. Another more-exotic workaround stores
multiple bits in a single electron, relying on the fact that a given electron can exist at a
number of energy levels. It remains to be seen if this particular approach can be made
to work reliably in production semiconductor devices.
Another proposed workaround is the “quantum dot” approach that allows much
smaller device sizes, but which is still in the research stage.
The lesson should be quite clear: parallel algorithms must be explicitly designed with
these hardware properties firmly in mind. One approach is to run nearly independent
threads. The less frequently the threads communicate, whether by atomic operations,
locks, or explicit messages, the better the application’s performance and scalability will
be. This approach will be touched on in Chapter 5, explored in Chapter 6, and taken to
its logical extreme in Chapter 8.
Another approach is to make sure that any sharing be read-mostly, which allows the
CPUs’ caches to replicate the read-mostly data, in turn allowing all CPUs fast access.
This approach is touched on in Section 5.2.3, and explored more deeply in Chapter 9.
In short, achieving excellent parallel performance and scalability means striving for
embarrassingly parallel algorithms and implementations, whether by careful choice of
data structures and algorithms, use of existing parallel applications and environments, or
transforming the problem into one for which an embarrassingly parallel solution exists.
Quick Quiz 3.10: OK, if we are going to have to apply distributed-programming
techniques to shared-memory parallel programs, why not just always use these dis-
tributed techniques and dispense with shared memory?
So, to sum up:
1. The good news is that multicore systems are inexpensive and readily available.
2. More good news: The overhead of many synchronization operations is much
lower than it was on parallel systems from the early 2000s.
3. The bad news is that the overhead of cache misses is still high, especially on large
systems.
The remainder of this book describes ways of handling this bad news.
In particular, Chapter 4 will cover some of the low-level tools used for parallel
programming, Chapter 5 will investigate problems and solutions to parallel counting,
and Chapter 6 will discuss design disciplines that promote performance and scalability.
You are only as good as your tools, and your tools
are only as good as you are.
Unknown
Chapter 4
This chapter provides a brief introduction to some basic tools of the parallel-programming
trade, focusing mainly on those available to user applications running on operating
systems similar to Linux. Section 4.1 begins with scripting languages, Section 4.2
describes the multi-process parallelism supported by the POSIX API and touches on
POSIX threads, Section 4.3 presents analogous operations in other environments, and
finally, Section 4.4 helps to choose the tool that will get the job done.
Quick Quiz 4.1: You call these tools??? They look more like low-level synchro-
nization primitives to me!
Please note that this chapter provides but a brief introduction. More detail is available
from the references cited (and especially from Internet), and more information on how
best to use these tools will be provided in later chapters.
Lines 1 and 2 launch two instances of this program, redirecting their output to two
separate files, with the & character directing the shell to run the two instances of the
program in the background. Line 3 waits for both instances to complete, and lines 4
and 5 display their output. The resulting execution is as shown in Figure 4.1: the two
instances of compute_it execute in parallel, wait completes after both of them do,
and then the two instances of cat execute sequentially.
Quick Quiz 4.2: But this silly shell script isn’t a real parallel program! Why bother
with such trivia???
Quick Quiz 4.3: Is there a simpler way to create a parallel shell script? If so, how?
If not, why not?
For another example, the make software-build scripting language provides a -j
35
36 CHAPTER 4. TOOLS OF THE TRADE
wait
cat compute_it.1.out
cat compute_it.2.out
option that specifies how much parallelism should be introduced into the build process.
For example, typing make -j4 when building a Linux kernel specifies that up to four
parallel compiles be carried out concurrently.
It is hoped that these simple examples convince you that parallel programming need
not always be complex or difficult.
Quick Quiz 4.4: But if script-based parallel programming is so easy, why bother
with anything else?
1 pid = fork();
2 if (pid == 0) {
3 /* child */
4 } else if (pid < 0) {
5 /* parent, upon error */
6 perror("fork");
7 exit(-1);
8 } else {
9 /* parent, pid == child ID */
10 }
1 void waitall(void)
2 {
3 int pid;
4 int status;
5
6 for (;;) {
7 pid = wait(&status);
8 if (pid == -1) {
9 if (errno == ECHILD)
10 break;
11 perror("wait");
12 exit(-1);
13 }
14 }
15 }
Figure 4.2 (forkjoin.c). Line 1 executes the fork() primitive, and saves its return
value in local variable pid. Line 2 checks to see if pid is zero, in which case, this is the
child, which continues on to execute line 3. As noted earlier, the child may terminate via
the exit() primitive. Otherwise, this is the parent, which checks for an error return
from the fork() primitive on line 4, and prints an error and exits on lines 5-7 if so.
Otherwise, the fork() has executed successfully, and the parent therefore executes
line 9 with the variable pid containing the process ID of the child.
The parent process may use the wait() primitive to wait for its children to com-
plete. However, use of this primitive is a bit more complicated than its shell-script
counterpart, as each invocation of wait() waits for but one child process. It is there-
fore customary to wrap wait() into a function similar to the waitall() function
shown in Figure 4.3 (api-pthread.h), with this waitall() function having se-
mantics similar to the shell-script wait command. Each pass through the loop spanning
lines 6-15 waits on one child process. Line 7 invokes the wait() primitive, which
blocks until a child process exits, and returns that child’s process ID. If the process ID
is instead −1, this indicates that the wait() primitive was unable to wait on a child. If
so, line 9 checks for the ECHILD errno, which indicates that there are no more child
processes, so that line 10 exits the loop. Otherwise, lines 11 and 12 print an error and
exit.
Quick Quiz 4.5: Why does this wait() primitive need to be so complicated? Why
not just make it work like the shell-script wait does?
It is critically important to note that the parent and child do not share memory. This
is illustrated by the program shown in Figure 4.4 (forkjoinvar.c), in which the
child sets a global variable x to 1 on line 6, prints a message on line 7, and exits on
line 8. The parent continues at line 14, where it waits on the child, and on line 15 finds
that its copy of the variable x is still zero. The output is thus as follows:
38 CHAPTER 4. TOOLS OF THE TRADE
1 int x = 0;
2 int pid;
3
4 pid = fork();
5 if (pid == 0) { /* child */
6 x = 1;
7 printf("Child process set x=1\n");
8 exit(0);
9 }
10 if (pid < 0) { /* parent, upon error */
11 perror("fork");
12 exit(-1);
13 }
14 waitall();
15 printf("Parent process sees x=%d\n", x);
Quick Quiz 4.6: Isn’t there a lot more to fork() and wait() than discussed
here?
The finest-grained parallelism requires shared memory, and this is covered in Sec-
tion 4.2.2. That said, shared-memory parallelism can be significantly more complex
than fork-join parallelism.
Note that this program carefully makes sure that only one of the threads stores a
value to variable x at a time. Any situation in which one thread might be storing a
value to a given variable while some other thread either loads from or stores to that
4.2. POSIX MULTIPROCESSING 39
1 int x = 0;
2
3 void *mythread(void *arg)
4 {
5 x = 1;
6 printf("Child process set x=1\n");
7 return NULL;
8 }
9
10 int main(int argc, char *argv[])
11 {
12 pthread_t tid;
13 void *vp;
14
15 if (pthread_create(&tid, NULL,
16 mythread, NULL) != 0) {
17 perror("pthread_create");
18 exit(-1);
19 }
20 if (pthread_join(tid, &vp) != 0) {
21 perror("pthread_join");
22 exit(-1);
23 }
24 printf("Parent process sees x=%d\n", x);
25 return 0;
26 }
same variable is termed a “data race”. Because the C language makes no guarantee that
the results of a data race will be in any way reasonable, we need some way of safely
accessing and modifying data concurrently, such as the locking primitives discussed in
the following section.
Quick Quiz 4.8: If the C language makes no guarantees in presence of a data race,
then why does the Linux kernel have so many data races? Are you trying to tell me that
the Linux kernel is completely broken???
shared variable x.
Lines 5-28 defines a function lock_reader() which repeatedly reads the shared
variable x while holding the lock specified by arg. Line 10 casts arg to a pointer to a
pthread_mutex_t, as required by the pthread_mutex_lock() and pthread_
mutex_unlock() primitives.
Quick Quiz 4.10: Why not simply make the argument to lock_reader() on
line 5 of Figure 4.6 be a pointer to a pthread_mutex_t?
Lines 12-15 acquire the specified pthread_mutex_t, checking for errors and
exiting the program if any occur. Lines 16-23 repeatedly check the value of x, printing
the new value each time that it changes. Line 22 sleeps for one millisecond, which
allows this demonstration to run nicely on a uniprocessor machine. Lines 24-27 release
the pthread_mutex_t, again checking for errors and exiting the program if any
occur. Finally, line 28 returns NULL, again to match the function type required by
pthread_create().
Quick Quiz 4.11: Writing four lines of code for each acquisition and release of a
pthread_mutex_t sure seems painful! Isn’t there a better way?
Lines 31-49 of Figure 4.6 shows lock_writer(), which periodically update
the shared variable x while holding the specified pthread_mutex_t. As with
lock_reader(), line 34 casts arg to a pointer to pthread_mutex_t, lines 36-
39 acquires the specified lock, and lines 44-47 releases it. While holding the lock,
lines 40-43 increment the shared variable x, sleeping for five milliseconds between each
increment. Finally, lines 44-47 release the lock.
Figure 4.7 shows a code fragment that runs lock_reader() and lock_writer()
as threads using the same lock, namely, lock_a. Lines 2-6 create a thread running
lock_reader(), and then Lines 7-11 create a thread running lock_writer().
Lines 12-19 wait for both threads to complete. The output of this code fragment is as
follows:
Creating two threads using same lock:
lock_reader(): x = 0
Because both threads are using the same lock, the lock_reader() thread cannot
see any of the intermediate values of x produced by lock_writer() while holding
42 CHAPTER 4. TOOLS OF THE TRADE
the lock.
Quick Quiz 4.12: Is “x = 0” the only possible output from the code fragment shown
in Figure 4.7? If so, why? If not, what other output could appear, and why?
Figure 4.8 shows a similar code fragment, but this time using different locks: lock_
a for lock_reader() and lock_b for lock_writer(). The output of this code
fragment is as follows:
Creating two threads w/different locks:
lock_reader(): x = 0
lock_reader(): x = 1
lock_reader(): x = 2
lock_reader(): x = 3
Because the two threads are using different locks, they do not exclude each other,
and can run concurrently. The lock_reader() function can therefore see the inter-
mediate values of x stored by lock_writer().
Quick Quiz 4.13: Using different locks could cause quite a bit of confusion, what
with threads seeing each others’ intermediate states. So should well-written parallel
programs restrict themselves to using a single lock in order to avoid this kind of
confusion?
Quick Quiz 4.14: In the code shown in Figure 4.8, is lock_reader() guaran-
teed to see all the values produced by lock_writer()? Why or why not?
Quick Quiz 4.15: Wait a minute here!!! Figure 4.7 didn’t initialize shared variable
x, so why does it need to be initialized in Figure 4.8?
Although there is quite a bit more to POSIX exclusive locking, these primitives
provide a good start and are in fact sufficient in a great many situations. The next section
takes a brief look at POSIX reader-writer locking.
1.1
1
ideal
0.9
0.8
10M
0.5
0.4
0.3
1M
0.2 10K
0.1 100K
1K
0
0 20 40 60 80 100 120 140
Number of CPUs (Threads)
This variable is initially set to GOFLAG_INIT, then set to GOFLAG_RUN after all the
reader threads have started, and finally set to GOFLAG_STOP to terminate the test run.
Lines 12-41 define reader(), which is the reader thread. Line 18 atomically
increments the nreadersrunning variable to indicate that this thread is now running,
and lines 19-21 wait for the test to start. The READ_ONCE() primitive forces the
compiler to fetch goflag on each pass through the loop—the compiler would otherwise
be within its rights to assume that the value of goflag would never change.
Quick Quiz 4.16: Instead of using READ_ONCE() everywhere, why not just
declare goflag as volatile on line 10 of Figure 4.9?
Quick Quiz 4.17: READ_ONCE() only affects the compiler, not the CPU. Don’t we
also need memory barriers to make sure that the change in goflag’s value propagates
to the CPU in a timely fashion in Figure 4.9?
Quick Quiz 4.18: Would it ever be necessary to use READ_ONCE() when access-
ing a per-thread variable, for example, a variable declared using the gcc __thread
storage class?
The loop spanning lines 22-38 carries out the performance test. Lines 23-26 acquire
the lock, lines 27-29 hold the lock for the specified duration (and the barrier()
directive prevents the compiler from optimizing the loop out of existence), lines 30-33
release the lock, and lines 34-36 wait for the specified duration before re-acquiring the
lock. Line 37 counts this lock acquisition.
Line 39 moves the lock-acquisition count to this thread’s element of the readcounts[]
array, and line 40 returns, terminating this thread.
Figure 4.10 shows the results of running this test on a 64-core Power-5 system
with two hardware threads per core for a total of 128 software-visible CPUs. The
thinktime parameter was zero for all these tests, and the holdtime parameter set
to values ranging from one thousand (“1K” on the graph) to 100 million (“100M” on
the graph). The actual value plotted is:
LN
(4.1)
NL1
4.2. POSIX MULTIPROCESSING 45
succeeded and 0 if it failed, for example, if the prior value was not equal to the spec-
ified old value. The second variant returns the prior value of the location, which, if
equal to the specified old value, indicates that the operation succeeded. Either of the
compare-and-swap operation is “universal” in the sense that any atomic operation on a
single location can be implemented in terms of compare-and-swap, though the earlier
operations are often more efficient where they apply. The compare-and-swap operation
is also capable of serving as the basis for a wider set of atomic operations, though
the more elaborate of these often suffer from complexity, scalability, and performance
problems [Her90].
The __sync_synchronize() primitive issues a “memory barrier”, which con-
strains both the compiler’s and the CPU’s ability to reorder operations, as discussed in
Section 14.2. In some cases, it is sufficient to constrain the compiler’s ability to reorder
operations, while allowing the CPU free rein, in which case the barrier() primitive
may be used, as it in fact was on line 28 of Figure 4.9. In some cases, it is only necessary
to ensure that the compiler avoids optimizing away a given memory read, in which case
the READ_ONCE() primitive may be used, as it was on line 17 of Figure 4.6. Similarly,
the WRITE_ONCE() primitive may be used to prevent the compiler from optimizing a
way a given memory write. These last two primitives are not provided directly by gcc,
but may be implemented straightforwardly as follows:
Quick Quiz 4.24: Given that these atomic operations will often be able to generate
single atomic instructions that are directly supported by the underlying instruction set,
shouldn’t they be the fastest possible way to get things done?
int smp_thread_id(void)
thread_id_t create_thread(void *(*func)(void *), void *arg)
for_each_thread(t)
for_each_running_thread(t)
void *wait_thread(thread_id_t tid)
void wait_all_threads(void)
stop (which has no POSIX equivalent), kthread_stop() to wait for them to stop,
and schedule_timeout_interruptible() for a timed wait. There are quite
a few additional kthread-management APIs, but this provides a good start, as well as
good search terms.
The CodeSamples API focuses on “threads”, which are a locus of control.2 Each
such thread has an identifier of type thread_id_t, and no two threads running at a
given time will have the same identifier. Threads share everything except for per-thread
local state,3 which includes program counter and stack.
The thread API is shown in Figure 4.11, and members are described in the following
sections.
4.3.2.1 create_thread()
The create_thread() primitive creates a new thread, starting the new thread’s
execution at the function func specified by create_thread()’s first argument,
and passing it the argument specified by create_thread()’s second argument.
This newly created thread will terminate when it returns from the starting function
specified by func. The create_thread() primitive returns the thread_id_t
corresponding to the newly created child thread.
This primitive will abort the program if more than NR_THREADS threads are created,
counting the one implicitly created by running the program. NR_THREADS is a compile-
time constant that may be modified, though some systems may have an upper bound for
the allowable number of threads.
4.3.2.2 smp_thread_id()
Because the thread_id_t returned from create_thread() is system-dependent,
the smp_thread_id() primitive returns a thread index corresponding to the thread
making the request. This index is guaranteed to be less than the maximum number of
threads that have been in existence since the program started, and is therefore useful for
bitmasks, array indices, and the like.
4.3.2.3 for_each_thread()
The for_each_thread() macro loops through all threads that exist, including all
threads that would exist if created. This macro is useful for handling per-thread variables
as will be seen in Section 4.2.7.
2There are many other names for similar software constructs, including “process”, “task”, “fiber”,
“event”, and so on. Similar design principles apply to all of them.
3 How is that for a circular definition?
4.3. ALTERNATIVES TO POSIX OPERATIONS 49
4.3.2.4 for_each_running_thread()
4.3.2.5 wait_thread()
The wait_thread() primitive waits for completion of the thread specified by the
thread_id_t passed to it. This in no way interferes with the execution of the
specified thread; instead, it merely waits for it. Note that wait_thread() returns the
value that was returned by the corresponding thread.
4.3.2.6 wait_all_threads()
Figure 4.12 shows an example hello-world-like child thread. As noted earlier, each
thread is allocated its own stack, so each thread has its own private arg argument
and myarg variable. Each child simply prints its argument and its smp_thread_
id() before exiting. Note that the return statement on line 7 terminates the thread,
returning a NULL to whoever invokes wait_thread() on this thread.
The parent program is shown in Figure 4.13. It invokes smp_init() to initialize
the threading system on line 6, parses arguments on lines 7-14, and announces its
presence on line 15. It creates the specified number of child threads on lines 16-17, and
waits for them to complete on line 18. Note that wait_all_threads() discards
the threads return values, as in this case they are all NULL, which is not very interesting.
Quick Quiz 4.25: What happened to the Linux-kernel equivalents to fork() and
wait()?
4.3.3 Locking
A good starting subset of the Linux kernel’s locking API is shown in Figure 4.14, each
API element being described in the following sections. This book’s CodeSamples
locking API closely follows that of the Linux kernel.
50 CHAPTER 4. TOOLS OF THE TRADE
4.3.3.1 spin_lock_init()
The spin_lock_init() primitive initializes the specified spinlock_t variable,
and must be invoked before this variable is passed to any other spinlock primitive.
4.3.3.2 spin_lock()
The spin_lock() primitive acquires the specified spinlock, if necessary, waiting
until the spinlock becomes available. In some environments, such as pthreads, this
waiting will involve “spinning”, while in others, such as the Linux kernel, it will involve
blocking.
The key point is that only one thread may hold a spinlock at any given time.
4.3.3.3 spin_trylock()
The spin_trylock() primitive acquires the specified spinlock, but only if it is
immediately available. It returns true if it was able to acquire the spinlock and false
otherwise.
4.3.3.4 spin_unlock()
The spin_unlock() primitive releases the specified spinlock, allowing other threads
to acquire it.
spin_lock(&mutex);
counter++;
spin_unlock(&mutex);
Quick Quiz 4.26: What problems could occur if the variable counter were
incremented without the protection of mutex?
However, the spin_lock() and spin_unlock() primitives do have perfor-
mance consequences, as will be seen in Section 4.3.6.
Quick Quiz 4.27: How could you work around the lack of a per-thread-variable
API on systems that do not provide it?
4 You could instead use __thread or _Thread_local.
52 CHAPTER 4. TOOLS OF THE TRADE
4.3.5.1 DEFINE_PER_THREAD()
The DEFINE_PER_THREAD() primitive defines a per-thread variable. Unfortunately,
it is not possible to provide an initializer in the way permitted by the Linux kernel’s
DEFINE_PER_THREAD() primitive, but there is an init_per_thread() primi-
tive that permits easy runtime initialization.
4.3.5.2 DECLARE_PER_THREAD()
The DECLARE_PER_THREAD() primitive is a declaration in the C sense, as opposed
to a definition. Thus, a DECLARE_PER_THREAD() primitive may be used to access a
per-thread variable defined in some other file.
4.3.5.3 per_thread()
The per_thread() primitive accesses the specified thread’s variable.
4.3.5.4 __get_thread_var()
The __get_thread_var() primitive accesses the current thread’s variable.
4.3.5.5 init_per_thread()
The init_per_thread() primitive sets all threads’ instances of the specified vari-
able to the specified value. The Linux kernel accomplishes this via normal C initializa-
tion, relying in clever use of linker scripts and code executed during the CPU-online
process.
The value of the counter is then the sum of its instances. A snapshot of the value of
the counter can thus be collected as follows:
for_each_thread(i)
sum += per_thread(counter, i);
Again, it is possible to gain a similar effect using other mechanisms, but per-thread
variables combine convenience and high performance.
4.4. THE RIGHT TOOL FOR THE JOB: HOW TO CHOOSE? 53
4.3.6 Performance
It is instructive to compare the performance of the locked increment shown in Sec-
tion 4.3.4 to that of per-CPU (or per-thread) variables (see Section 4.3.5), as well as to
conventional increment (as in “counter++”).
The difference in performance is quite large, to put it mildly. The purpose of this
book is to help you write SMP programs, perhaps with realtime response, while avoiding
such performance pitfalls. Chapter 5 starts this process by describing a few parallel
counting algorithms.
Of course, the actual overheads will depend not only on your hardware, but most
critically on the manner in which you use the primitives. In particular, randomly hacking
multi-threaded code is a spectacularly bad idea, especially given that shared-memory
parallel systems use your own intelligence against you: The smarter you are, the deeper
a hole you will dig for yourself before you realize that you are in trouble [Pok16].
Therefore, it is necessary to make the right design choices as well as the correct choice
of individual primitives, as is discussed at length in subsequent chapters.
54 CHAPTER 4. TOOLS OF THE TRADE
As easy as 1, 2, 3!
Unknown
Chapter 5
Counting
Counting is perhaps the simplest and most natural thing a computer can do. However,
counting efficiently and scalably on a large shared-memory multiprocessor can be quite
challenging. Furthermore, the simplicity of the underlying concept of counting allows
us to explore the fundamental issues of concurrency without the distractions of elaborate
data structures or complex synchronization primitives. Counting therefore provides an
excellent introduction to parallel programming.
This chapter covers a number of special cases for which there are simple, fast, and
scalable counting algorithms. But first, let us find out how much you already know
about concurrent counting.
Quick Quiz 5.1: Why on earth should efficient and scalable counting be hard? After
all, computers have special hardware for the sole purpose of doing counting, addition,
subtraction, and lots more besides, don’t they???
Quick Quiz 5.2: Network-packet counting problem. Suppose that you need
to collect statistics on the number of networking packets (or total number of bytes)
transmitted and/or received. Packets might be transmitted or received by any CPU on
the system. Suppose further that this large machine is capable of handling a million
packets per second, and that there is a systems-monitoring package that reads out the
count every five seconds. How would you implement this statistical counter?
Quick Quiz 5.3: Approximate structure-allocation limit problem. Suppose
that you need to maintain a count of the number of structures allocated in order to
fail any allocations once the number of structures in use exceeds a limit (say, 10,000).
Suppose further that these structures are short-lived, that the limit is rarely exceeded,
and that a “sloppy” approximate limit is acceptable.
Quick Quiz 5.4: Exact structure-allocation limit problem. Suppose that you
need to maintain a count of the number of structures allocated in order to fail any
allocations once the number of structures in use exceeds an exact limit (again, say
10,000). Suppose further that these structures are short-lived, and that the limit is rarely
exceeded, that there is almost always at least one structure in use, and suppose further
still that it is necessary to know exactly when this counter reaches zero, for example, in
order to free up some memory that is not required unless there is at least one structure
in use.
Quick Quiz 5.5: Removable I/O device access-count problem. Suppose that
you need to maintain a reference count on a heavily used removable mass-storage device,
so that you can tell the user when it is safe to remove the device. This device follows
the usual removal procedure where the user indicates a desire to remove the device, and
55
56 CHAPTER 5. COUNTING
1 long counter = 0;
2
3 void inc_count(void)
4 {
5 counter++;
6 }
7
8 long read_count(void)
9 {
10 return counter;
11 }
1 Interestingly enough, a pair of threads non-atomically incrementing a counter will cause the counter to
increase more quickly than a pair of threads atomically incrementing the counter. Of course, if your only goal
is to make the counter increase quickly, an easier approach is to simply assign a large value to the counter.
Nevertheless, there is likely to be a role for algorithms that use carefully relaxed notions of correctness in
5.1. WHY ISN’T CONCURRENT COUNTING TRIVIAL? 57
800
700
600
500
400
300
200
100
0
1 2 3 4 5 6 7 8
Number of CPUs (Threads)
This poor performance should not be a surprise, given the discussion in Chapter 3,
nor should it be a surprise that the performance of atomic increment gets slower as
the number of CPUs and threads increase, as shown in Figure 5.3. In this figure, the
horizontal dashed line resting on the x axis is the ideal performance that would be
achieved by a perfectly scalable algorithm: with such an algorithm, a given increment
would incur the same overhead that it would in a single-threaded program. Atomic
increment of a single global variable is clearly decidedly non-ideal, and gets worse as
you add CPUs.
Quick Quiz 5.8: Why doesn’t the dashed line on the x axis meet the diagonal line
at x = 1?
Quick Quiz 5.9: But atomic increment is still pretty fast. And incrementing a single
variable in a tight loop sounds pretty unrealistic to me, after all, most of the program’s
execution should be devoted to actually doing work, not accounting for the work it has
done! Why should I care about making this go faster?
For another perspective on global atomic increment, consider Figure 5.4. In order
for each CPU to get a chance to increment a given global variable, the cache line
containing that variable must circulate among all the CPUs, as shown by the red arrows.
Such circulation will take significant time, resulting in the poor performance seen in
Figure 5.3, which might be thought of as shown in Figure 5.5.
The following sections discuss high-performance counting, which avoids the delays
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
5.2.1 Design
Statistical counting is typically handled by providing a counter per thread (or CPU,
when running in the kernel), so that each thread updates its own counter. The aggregate
value of the counters is read out by simply summing up all of the threads’ counters,
relying on the commutative and associative properties of addition. This is an example
5.2. STATISTICAL COUNTERS 59
1 DEFINE_PER_THREAD(long, counter);
2
3 void inc_count(void)
4 {
5 __get_thread_var(counter)++;
6 }
7
8 long read_count(void)
9 {
10 int t;
11 long sum = 0;
12
13 for_each_thread(t)
14 sum += per_thread(counter, t);
15 return sum;
16 }
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
that the counter is being incremented at rate r counts per unit time, and that read_
count()’s execution consumes ∆ units of time. What is the expected error in the
return value?
However, this excellent update-side scalability comes at great read-side expense for
large numbers of threads. The next section shows one way to reduce read-side expense
while still retaining the update-side scalability.
This approach gives extremely fast counter read-out while still supporting linear
counter-update performance. However, this excellent read-side performance and update-
side scalability comes at the cost of the additional thread running eventual().
Quick Quiz 5.17: Why doesn’t inc_count() in Figure 5.8 need to use atomic
instructions? After all, we now have multiple threads accessing the per-thread counters!
Quick Quiz 5.18: Won’t the single global thread in the function eventual() of
Figure 5.8 be just as severe a bottleneck as a global lock would be?
Quick Quiz 5.19: Won’t the estimate returned by read_count() in Figure 5.8
become increasingly inaccurate as the number of threads rises?
thread exits.
Quick Quiz 5.25: Fine, but the Linux kernel doesn’t have to acquire a lock when
reading out the aggregate value of per-CPU counters. So why should user-space code
need to do this???
5.2.5 Discussion
These three implementations show that it is possible to obtain uniprocessor performance
for statistical counters, despite running on a parallel machine.
Quick Quiz 5.26: What fundamental difference is there between counting packets
and counting the total number of bytes in the packets, given that the packets vary in
size?
Quick Quiz 5.27: Given that the reader must sum all the threads’ counters, this
could take a long time given large numbers of threads. Is there any way that the
increment operation can remain fast and scalable while allowing readers to also enjoy
reasonable performance and scalability?
Given what has been presented in this section, you should now be able to answer the
Quick Quiz about statistical counters for networking near the beginning of this chapter.
5.3.1 Design
One possible design for limit counters is to divide the limit of 10,000 by the number
of threads, and give each thread a fixed pool of structures. For example, given 100
threads, each thread would manage its own pool of 100 structures. This approach is
simple, and in some cases works well, but it does not handle the common case where
a given structure is allocated by one thread and freed by another [MS93]. On the one
hand, if a given thread takes credit for any structures it frees, then the thread doing
most of the allocating runs out of structures, while the threads doing most of the freeing
have lots of credits that they cannot use. On the other hand, if freed structures are
credited to the CPU that allocated them, it will be necessary for CPUs to manipulate
each others’ counters, which will require expensive atomic instructions or other means
of communicating between threads.2
In short, for many important workloads, we cannot fully partition the counter.
Given that partitioning the counters was what brought the excellent update-side perfor-
mance for the three schemes discussed in Section 5.2, this might be grounds for some
pessimism. However, the eventually consistent algorithm presented in Section 5.2.3 pro-
vides an interesting hint. Recall that this algorithm kept two sets of books, a per-thread
2 That said, if each structure will always be freed by the same CPU (or thread) that allocated it, then this
counter variable for updaters and a global_count variable for readers, with an
eventual() thread that periodically updated global_count to be eventually con-
sistent with the values of the per-thread counter. The per-thread counter perfectly
partitioned the counter value, while global_count kept the full value.
For limit counters, we can use a variation on this theme, in that we partially partition
the counter. For example, each of four threads could have a per-thread counter, but
each could also have a per-thread maximum value (call it countermax).
But then what happens if a given thread needs to increment its counter, but
counter is equal to its countermax? The trick here is to move half of that thread’s
counter value to a globalcount, then increment counter. For example, if a
given thread’s counter and countermax variables were both equal to 10, we do
the following:
1. Acquire a global lock.
2. Add five to globalcount.
3. To balance out the addition, subtract five from this thread’s counter.
4. Release the global lock.
5. Increment this thread’s counter, resulting in a value of six.
Although this procedure still requires a global lock, that lock need only be ac-
quired once for every five increment operations, greatly reducing that lock’s level of
contention. We can reduce this contention as low as we wish by increasing the value
of countermax. However, the corresponding penalty for increasing the value of
countermax is reduced accuracy of globalcount. To see this, note that on a
four-CPU system, if countermax is equal to ten, globalcount will be in error by
at most 40 counts. In contrast, if countermax is increased to 100, globalcount
might be in error by as much as 400 counts.
This raises the question of just how much we care about globalcount’s de-
viation from the aggregate value of the counter, where this aggregate value is the
sum of globalcount and each thread’s counter variable. The answer to this
question depends on how far the aggregate value is from the counter’s limit (call it
globalcountmax). The larger the difference between these two values, the larger
countermax can be without risk of exceeding the globalcountmax limit. This
means that the value of a given thread’s countermax variable can be set based on this
difference. When far from the limit, the countermax per-thread variables are set to
large values to optimize for performance and scalability, while when close to the limit,
these same variables are set to small values to minimize the error in the checks against
the globalcountmax limit.
This design is an example of parallel fastpath, which is an important design pattern
in which the common case executes with no expensive instructions and no interactions
between threads, but where occasional use is also made of a more conservatively
designed (and higher overhead) global algorithm. This design pattern is covered in more
detail in Section 6.4.
countermax 3
counter 3
globalreserve
countermax 2 counter 2
globalcountmax
countermax 1 counter 1
countermax 0
counter 0
globalcount
globalcountmax variable on line 3 contains the upper bound for the aggregate
counter, and the globalcount variable on line 4 is the global counter. The sum of
globalcount and each thread’s counter gives the aggregate value of the overall
counter. The globalreserve variable on line 5 is the sum of all of the per-thread
countermax variables. The relationship among these variables is shown by Fig-
ure 5.11:
2. The sum of all threads’ countermax values must be less than or equal to
globalreserve.
3. Each thread’s counter must be less than or equal to that thread’s countermax.
If the test on line 3 fails, we must access global variables, and thus must acquire
gblcnt_mutex on line 7, which we release on line 11 in the failure case or on line 16
in the success case. Line 8 invokes globalize_count(), shown in Figure 5.13,
which clears the thread-local variables, adjusting the global variables as needed, thus
simplifying global processing. (But don’t take my word for it, try coding it yourself!)
Lines 9 and 10 check to see if addition of delta can be accommodated, with the
meaning of the expression preceding the less-than sign shown in Figure 5.11 as the
difference in height of the two red (leftmost) bars. If the addition of delta cannot be
accommodated, then line 11 (as noted earlier) releases gblcnt_mutex and line 12
returns indicating failure.
Otherwise, we take the slowpath. Line 14 adds delta to globalcount, and then
line 15 invokes balance_count() (shown in Figure 5.13) in order to update both the
global and the per-thread variables. This call to balance_count() will usually set
this thread’s countermax to re-enable the fastpath. Line 16 then releases gblcnt_
mutex (again, as noted earlier), and, finally, line 17 returns indicating success.
Quick Quiz 5.30: Why does globalize_count() zero the per-thread variables,
only to later call balance_count() to refill them in Figure 5.12? Why not just leave
the per-thread variables non-zero?
Lines 20-36 show sub_count(), which subtracts the specified delta from the
counter. Line 22 checks to see if the per-thread counter can accommodate this subtrac-
tion, and, if so, line 23 does the subtraction and line 24 returns success. These lines
form sub_count()’s fastpath, and, as with add_count(), this fastpath executes
no costly operations.
If the fastpath cannot accommodate subtraction of delta, execution proceeds to
the slowpath on lines 26-35. Because the slowpath must access global state, line 26
acquires gblcnt_mutex, which is released either by line 29 (in case of failure) or
by line 34 (in case of success). Line 27 invokes globalize_count(), shown in
Figure 5.13, which again clears the thread-local variables, adjusting the global variables
as needed. Line 28 checks to see if the counter can accommodate subtracting delta,
and, if not, line 29 releases gblcnt_mutex (as noted earlier) and line 30 returns
5.3. APPROXIMATE LIMIT COUNTERS 69
failure.
Quick Quiz 5.31: Given that globalreserve counted against us in add_
count(), why doesn’t it count for us in sub_count() in Figure 5.12?
Quick Quiz 5.32: Suppose that one thread invokes add_count() shown in
Figure 5.12, and then another thread invokes sub_count(). Won’t sub_count()
return failure even though the value of the counter is non-zero?
If, on the other hand, line 28 finds that the counter can accommodate subtracting
delta, we complete the slowpath. Line 32 does the subtraction and then line 33
invokes balance_count() (shown in Figure 5.13) in order to update both global
and per-thread variables (hopefully re-enabling the fastpath). Then line 34 releases
gblcnt_mutex, and line 35 returns success.
Quick Quiz 5.33: Why have both add_count() and sub_count() in Fig-
ure 5.12? Why not simply pass a negative number to add_count()?
Lines 38-50 show read_count(), which returns the aggregate value of the
counter. It acquires gblcnt_mutex on line 43 and releases it on line 48, excluding
global operations from add_count() and sub_count(), and, as we will see, also
excluding thread creation and exit. Line 44 initializes local variable sum to the value of
globalcount, and then the loop spanning lines 45-47 sums the per-thread counter
variables. Line 49 then returns the sum.
Figure 5.13 shows a number of utility functions used by the add_count(), sub_
count(), and read_count() primitives shown in Figure 5.12.
70 CHAPTER 5. COUNTING
Lines 1-7 show globalize_count(), which zeros the current thread’s per-
thread counters, adjusting the global variables appropriately. It is important to note that
this function does not change the aggregate value of the counter, but instead changes how
the counter’s current value is represented. Line 3 adds the thread’s counter variable to
globalcount, and line 4 zeroes counter. Similarly, line 5 subtracts the per-thread
countermax from globalreserve, and line 6 zeroes countermax. It is helpful
to refer to Figure 5.11 when reading both this function and balance_count(),
which is next.
Lines 9-19 show balance_count(), which is roughly speaking the inverse of
globalize_count(). This function’s job is to set the current thread’s countermax
variable to the largest value that avoids the risk of the counter exceeding the globalcountmax
limit. Changing the current thread’s countermax variable of course requires corre-
sponding adjustments to counter, globalcount and globalreserve, as can
be seen by referring back to Figure 5.11. By doing this, balance_count() max-
imizes use of add_count()’s and sub_count()’s low-overhead fastpaths. As
with globalize_count(), balance_count() is not permitted to change the
aggregate value of the counter.
Lines 11-13 compute this thread’s share of that portion of globalcountmax that
is not already covered by either globalcount or globalreserve, and assign the
computed quantity to this thread’s countermax. Line 14 makes the corresponding ad-
justment to globalreserve. Line 15 sets this thread’s counter to the middle of the
range from zero to countermax. Line 16 checks to see whether globalcount can
in fact accommodate this value of counter, and, if not, line 17 decreases counter
accordingly. Finally, in either case, line 18 makes the corresponding adjustment to
globalcount.
Quick Quiz 5.34: Why set counter to countermax / 2 in line 15 of Fig-
ure 5.13? Wouldn’t it be simpler to just take countermax counts?
It is helpful to look at a schematic depicting how the relationship of the coun-
ters changes with the execution of first globalize_count() and then balance_
count, as shown in Figure 5.14. Time advances from left to right, with the leftmost
configuration roughly that of Figure 5.11. The center configuration shows the rela-
tionship of these same counters after globalize_count() is executed by thread 0.
As can be seen from the figure, thread 0’s counter (“c 0” in the figure) is added
to globalcount, while the value of globalreserve is reduced by this same
amount. Both thread 0’s counter and its countermax (“cm 0” in the figure) are
reduced to zero. The other three threads’ counters are unchanged. Note that this
change did not affect the overall value of the counter, as indicated by the bottommost
dotted line connecting the leftmost and center configurations. In other words, the
sum of globalcount and the four threads’ counter variables is the same in both
configurations. Similarly, this change did not affect the sum of globalcount and
globalreserve, as indicated by the upper dotted line.
The rightmost configuration shows the relationship of these counters after balance_
count() is executed, again by thread 0. One-quarter of the remaining count, denoted
by the vertical line extending up from all three configurations, is added to thread 0’s
countermax and half of that to thread 0’s counter. The amount added to thread 0’s
counter is also subtracted from globalcount in order to avoid changing the
overall value of the counter (which is again the sum of globalcount and the three
threads’ counter variables), again as indicated by the lowermost of the two dotted
lines connecting the center and rightmost configurations. The globalreserve vari-
5.3. APPROXIMATE LIMIT COUNTERS 71
globalize_count() balance_count()
cm 3
c3
globalreserve
cm 3 cm 3
globalreserve
c3 c3
globalreserve
cm 2
c2
cm 2 cm 2
c2 c2
cm 1 c1
cm 1 c1 cm 1 c1
cm 0
c0
cm 0 c0
globalcount
globalcount
globalcount
able is also adjusted so that this variable remains equal to the sum of the four threads’
countermax variables. Because thread 0’s counter is less than its countermax,
thread 0 can once again increment the counter locally.
Quick Quiz 5.35: In Figure 5.14, even though a quarter of the remaining count up
to the limit is assigned to thread 0, only an eighth of the remaining count is consumed,
as indicated by the uppermost dotted line connecting the center and the rightmost
configurations. Why is that?
Lines 21-28 show count_register_thread(), which sets up state for newly
created threads. This function simply installs a pointer to the newly created thread’s
counter variable into the corresponding entry of the counterp[] array under the
protection of gblcnt_mutex.
Finally, lines 30-38 show count_unregister_thread(), which tears down
state for a soon-to-be-exiting thread. Line 34 acquires gblcnt_mutex and line 37
releases it. Line 35 invokes globalize_count() to clear out this thread’s counter
state, and line 36 clears this thread’s entry in the counterp[] array.
be an approximate limit, there is usually a limit to exactly how much approximation can
be tolerated. One way to limit the degree of approximation is to impose an upper limit
on the value of the per-thread countermax instances. This task is undertaken in the
next section.
way to do this is to use atomic instructions. Of course, atomic instructions will slow
down the fastpath, but on the other hand, it would be silly not to at least give them a try.
Lines 1-22 on Figure 5.21 show the code for balance_count(), which refills
the calling thread’s local ctrandmax variable. This function is quite similar to that
of the preceding algorithms, with changes required to handle the merged ctrandmax
variable. Detailed analysis of the code is left as an exercise for the reader, as it is with
the count_register_thread() function starting on line 24 and the count_
5.4. EXACT LIMIT COUNTERS 77
the IDLE state, and when add_count() or sub_count() find that the combination
of the local thread’s count and the global count cannot accommodate the request, the
corresponding slowpath sets each thread’s theft state to REQ (unless that thread has
no count, in which case it transitions directly to READY). Only the slowpath, which
holds the gblcnt_mutex lock, is permitted to transition from the IDLE state, as
indicated by the green color.3 The slowpath then sends a signal to each thread, and the
corresponding signal handler checks the corresponding thread’s theft and counting
variables. If the theft state is not REQ, then the signal handler is not permitted to
change the state, and therefore simply returns. Otherwise, if the counting variable is
set, indicating that the current thread’s fastpath is in progress, the signal handler sets the
theft state to ACK, otherwise to READY.
If the theft state is ACK, only the fastpath is permitted to change the theft
state, as indicated by the blue color. When the fastpath completes, it sets the theft
state to READY.
Once the slowpath sees a thread’s theft state is READY, the slowpath is permitted
3 For those with black-and-white versions of this book, IDLE and READY are green, REQ is red, and
ACK is blue.
5.4. EXACT LIMIT COUNTERS 79
IDLE
need no
flushed
flush count
!counting
REQ READY
done
counting
counting
ACK
to steal that thread’s count. The slowpath then sets that thread’s theft state to IDLE.
Quick Quiz 5.46: In Figure 5.22, why is the REQ theft state colored red?
Quick Quiz 5.47: In Figure 5.22, what is the point of having separate REQ and
ACK theft states? Why not simplify the state machine by collapsing them into a
single REQACK state? Then whichever of the signal handler or the fastpath gets there
first could set the state to READY.
the reader. Similarly, the structure of sub_count() on Figure 5.26 is the same as that
of add_count(), so the analysis of sub_count() is also left as an exercise for the
reader, as is the analysis of read_count() in Figure 5.27.
1 void count_init(void)
2 {
3 struct sigaction sa;
4
5 sa.sa_handler = flush_local_count_sig;
6 sigemptyset(&sa.sa_mask);
7 sa.sa_flags = 0;
8 if (sigaction(SIGUSR1, &sa, NULL) != 0) {
9 perror("sigaction");
10 exit(-1);
11 }
12 }
13
14 void count_register_thread(void)
15 {
16 int idx = smp_thread_id();
17
18 spin_lock(&gblcnt_mutex);
19 counterp[idx] = &counter;
20 countermaxp[idx] = &countermax;
21 theftp[idx] = &theft;
22 spin_unlock(&gblcnt_mutex);
23 }
24
25 void count_unregister_thread(int nthreadsexpected)
26 {
27 int idx = smp_thread_id();
28
29 spin_lock(&gblcnt_mutex);
30 globalize_count();
31 counterp[idx] = NULL;
32 countermaxp[idx] = NULL;
33 theftp[idx] = NULL;
34 spin_unlock(&gblcnt_mutex);
35 }
The signal-theft implementation runs more than twice as fast as the atomic implementa-
tion on my Intel Core Duo laptop. Is it always preferable?
The signal-theft implementation would be vastly preferable on Pentium-4 systems,
given their slow atomic instructions, but the old 80386-based Sequent Symmetry sys-
tems would do much better with the shorter path length of the atomic implementation.
However, this increased update-side performance comes at the prices of higher read-side
overhead: Those POSIX signals are not free. If ultimate performance is of the essence,
you will need to measure them both on the system that your application is to be deployed
on.
Quick Quiz 5.53: Not only are POSIX signals slow, sending one to each thread
simply does not scale. What would you do if you had (say) 10,000 threads and needed
the read side to be fast?
This is but one reason why high-quality APIs are so important: they permit imple-
mentations to be changed as required by ever-changing hardware performance charac-
teristics.
Quick Quiz 5.54: What if you want an exact limit counter to be exact only for its
lower limit, but to allow the upper limit to be inexact?
Although the exact limit counter implementations in Section 5.4 can be very useful, they
are not much help if the counter’s value remains near zero at all times, as it might when
counting the number of outstanding accesses to an I/O device. The high overhead of
such near-zero counting is especially painful given that we normally don’t care how
many references there are. As noted in the removable I/O device access-count problem
posed by Quick Quiz 5.5, the number of accesses is irrelevant except in those rare cases
when someone is actually trying to remove the device.
One simple solution to this problem is to add a large “bias” (for example, one
billion) to the counter in order to ensure that the value is far enough from zero that
the counter can operate efficiently. When someone wants to remove the device, this
bias is subtracted from the counter value. Counting the last few accesses will be quite
inefficient, but the important point is that the many prior accesses will have been counted
at full speed.
Quick Quiz 5.55: What else had you better have done when using a biased counter?
Although a biased counter can be quite helpful and useful, it is only a partial
solution to the removable I/O device access-count problem called out on page 55. When
attempting to remove a device, we must not only know the precise number of current
I/O accesses, we also need to prevent any future accesses from starting. One way to
accomplish this is to read-acquire a reader-writer lock when updating the counter, and to
write-acquire that same reader-writer lock when checking the counter. Code for doing
I/O might be as follows:
86 CHAPTER 5. COUNTING
1 read_lock(&mylock);
2 if (removing) {
3 read_unlock(&mylock);
4 cancel_io();
5 } else {
6 add_count(1);
7 read_unlock(&mylock);
8 do_io();
9 sub_count(1);
10 }
Line 1 read-acquires the lock, and either line 3 or 7 releases it. Line 2 checks to
see if the device is being removed, and, if so, line 3 releases the lock and line 4 cancels
the I/O, or takes whatever action is appropriate given that the device is to be removed.
Otherwise, line 6 increments the access count, line 7 releases the lock, line 8 performs
the I/O, and line 9 decrements the access count.
Quick Quiz 5.56: This is ridiculous! We are read-acquiring a reader-writer lock to
update the counter? What are you playing at???
The code to remove the device might be as follows:
1 write_lock(&mylock);
2 removing = 1;
3 sub_count(mybias);
4 write_unlock(&mylock);
5 while (read_count() != 0) {
6 poll(NULL, 0, 1);
7 }
8 remove_device();
Line 1 write-acquires the lock and line 4 releases it. Line 2 notes that the device is
being removed, and the loop spanning lines 5-7 wait for any I/O operations to complete.
Finally, line 8 does any additional processing needed to prepare for device removal.
Quick Quiz 5.57: What other issues would need to be accounted for in a real
system?
Reads
Algorithm Section Updates 1 Core 32 Cores
count_stat.c 5.2.2 11.5 ns 408 ns 409 ns
count_stat_eventual.c 5.2.3 11.6 ns 1 ns 1 ns
count_end.c 5.2.4 6.3 ns 389 ns 51,200 ns
count_end_rcu.c 13.3.1 5.7 ns 354 ns 501 ns
Reads
Algorithm Section Exact? Updates 1 Core 64 Cores
count_lim.c 5.3.2 N 3.6 ns 375 ns 50,700 ns
count_lim_app.c 5.3.4 N 11.7 ns 369 ns 51,000 ns
count_lim_atomic.c 5.4.1 Y 51.4 ns 427 ns 49,400 ns
count_lim_sig.c 5.4.4 Y 10.2 ns 370 ns 54,000 ns
consider the C-language ++ operator. The fact is that it does not work in general, only
for a restricted range of numbers. If you need to deal with 1,000-digit decimal numbers,
the C-language ++ operator will not work for you.
Quick Quiz 5.62: The ++ operator works just fine for 1,000-digit numbers! Haven’t
you heard of operator overloading???
This problem is not specific to arithmetic. Suppose you need to store and query
data. Should you use an ASCII file? XML? A relational database? A linked list? A
dense array? A B-tree? A radix tree? Or one of the plethora of other data structures and
environments that permit data to be stored and queried? It depends on what you need
to do, how fast you need it done, and how large your data set is—even on sequential
systems.
Similarly, if you need to count, your solution will depend on how large of numbers
you need to work with, how many CPUs need to be manipulating a given number
concurrently, how the number is to be used, and what level of performance and scalability
you will need.
Nor is this problem specific to software. The design for a bridge meant to allow
people to walk across a small brook might be a simple as a single wooden plank. But
you would probably not use a plank to span the kilometers-wide mouth of the Columbia
River, nor would such a design be advisable for bridges carrying concrete trucks. In
short, just as bridge design must change with increasing span and load, so must software
design change as the number of CPUs increases. That said, it would be good to automate
this process, so that the software adapts to changes in hardware configuration and in
workload. There has in fact been some research into this sort of automation [AHS+ 03,
SAH+ 03], and the Linux kernel does some boot-time reconfiguration, including limited
binary rewriting. This sort of adaptation will become increasingly important as the
number of CPUs on mainstream systems continues to increase.
In short, as discussed in Chapter 3, the laws of physics constrain parallel software
just as surely as they constrain mechanical artifacts such as bridges. These constraints
force specialization, though in the case of software it might be possible to automate the
choice of specialization to fit the hardware and workload in question.
Of course, even generalized counting is quite specialized. We need to do a great
number of other things with computers. The next section relates what we have learned
from counters to topics taken up later in this book.
The partially partitioned counting algorithms used locking to guard the global data,
and locking is the subject of Chapter 7. In contrast, the partitioned data tended to be fully
under the control of the corresponding thread, so that no synchronization whatsoever
was required. This data ownership will be introduced in Section 6.3.4 and discussed in
more detail in Chapter 8.
Because integer addition and subtraction are extremely cheap operations compared
to typical synchronization operations, achieving reasonable scalability requires synchro-
nization operations be used sparingly. One way of achieving this is to batch the addition
and subtraction operations, so that a great many of these cheap operations are handled
by a single synchronization operation. Batching optimizations of one sort or another are
used by each of the counting algorithms listed in Tables 5.1 and 5.2.
Finally, the eventually consistent statistical counter discussed in Section 5.2.3
showed how deferring activity (in that case, updating the global counter) can pro-
vide substantial performance and scalability benefits. This approach allows common
case code to use much cheaper synchronization operations than would otherwise be
possible. Chapter 9 will examine a number of additional ways that deferral can improve
performance, scalability, and even real-time response.
Summarizing the summary:
2. Partial partitioning, that is, partitioning applied only to common code paths, works
almost as well.
3. Partial partitioning can be applied to code (as in Section 5.2’s statistical counters’
partitioned updates and non-partitioned reads), but also across time (as in Sec-
tion 5.3’s and Section 5.4’s limit counters running fast when far from the limit,
but slowly when close to the limit).
4. Partitioning across time often batches updates locally in order to reduce the num-
ber of expensive global operations, thereby decreasing synchronization overhead,
in turn improving performance and scalability. All the algorithms shown in
Tables 5.1 and 5.2 make heavy use of batching.
8. Different levels of performance and scalability will affect algorithm and data-
structure design, as do a large number of other factors. Figure 5.3 illustrates this
point: Atomic increment might be completely acceptable for a two-CPU system,
but be completely inadequate for an eight-CPU system.
Summarizing still further, we have the “big three” methods of increasing perfor-
mance and scalability, namely (1) partitioning over CPUs or threads, (2) batching
90 CHAPTER 5. COUNTING
Batch
Work
Partitioning
Resource
Parallel
Partitioning and
Access Control Replication
Interacting
With Hardware
Weaken Partition
so that more work can be done by each expensive synchronization operations, and
(3) weakening synchronization operations where feasible. As a rough rule of thumb, you
should apply these methods in this order, as was noted earlier in the discussion of Fig-
ure 2.6 on page 19. The partitioning optimization applies to the “Resource Partitioning
and Replication” bubble, the batching optimization to the “Work Partitioning” bubble,
and the weakening optimization to the “Parallel Access Control” bubble, as shown in
Figure 5.29. Of course, if you are using special-purpose hardware such as digital signal
processors (DSPs), field-programmable gate arrays (FPGAs), or general-purpose graph-
ical processing units (GPGPUs), you may need to pay close attention to the “Interacting
With Hardware” bubble throughout the design process. For example, the structure of a
GPGPU’s hardware threads and memory connectivity might richly reward very careful
partitioning and batching design decisions.
In short, as noted at the beginning of this chapter, the simplicity of counting have
allowed us to explore many fundamental concurrency issues without the distraction of
complex synchronization primitives or elaborate data structures. Such synchronization
primitives and data structures are covered in later chapters.
Divide and rule.
Philip II of Macedon
Chapter 6
Partitioning and
Synchronization Design
This chapter describes how to design software to take advantage of the multiple CPUs
that are increasingly appearing in commodity systems. It does this by presenting a
number of idioms, or “design patterns” [Ale79, GHJV95, SSRB00] that can help you
balance performance, scalability, and response time. As noted in earlier chapters, the
most important decision you will make when creating parallel software is how to carry
out the partitioning. Correctly partitioned problems lead to simple, scalable, and high-
performance solutions, while poorly partitioned problems result in slow and complex
solutions. This chapter will help you design partitioning into your code, with some
discussion of batching and weakening as well. The word “design” is very important:
You should partition first, batch second, weaken third, and code fourth. Changing this
order often leads to poor performance and scalability along with great frustration.
To this end, Section 6.1 presents partitioning exercises, Section 6.2 reviews partition-
ability design criteria, Section 6.3 discusses selecting an appropriate synchronization
granularity, Section 6.4 gives an overview of important parallel-fastpath designs that
provide speed and scalability in the common case with a simpler but less-scalable
fallback “slow path” for unusual situations, and finally Section 6.5 takes a brief look
beyond partitioning.
91
92 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
P1
P5 P2
P4 P3
1 Readers who have difficulty imagining a food that requires two forks are invited to instead think in
terms of chopsticks.
2 It is all too easy to denigrate Dijkstra from the viewpoint of the year 2012, more than 40 years after the
fact. If you still feel the need to denigrate Dijkstra, my advice is to publish something, wait 40 years, and then
see how your words stood the test of time.
6.1. PARTITIONING EXERCISES 93
P1
5 1
P5 P2
4 2
P4 P3
the highest-numbered fork. The philosopher sitting in the uppermost position in the
diagram thus picks up the leftmost fork first, then the rightmost fork, while the rest of the
philosophers instead pick up their rightmost fork first. Because two of the philosophers
will attempt to pick up fork 1 first, and because only one of those two philosophers will
succeed, there will be five forks available to four philosophers. At least one of these
four will be guaranteed to have two forks, and thus be able to proceed eating.
This general technique of numbering resources and acquiring them in numerical
order is heavily used as a deadlock-prevention technique. However, it is easy to imagine
a sequence of events that will result in only one philosopher eating at a time even though
all are hungry:
2. P3 picks up fork 2.
3. P4 picks up fork 3.
4. P5 picks up fork 4.
In short, this algorithm can result in only one philosopher eating at a given time,
even when all five philosophers are hungry, despite the fact that there are more than
enough forks for two philosophers to eat concurrently.
Please think about ways of partitioning the Dining Philosophers Problem before
reading further.
94 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
P1
P4 P2
P3
One approach is shown in Figure 6.4, which includes four philosophers rather than
five to better illustrate the partition technique. Here the upper and rightmost philosophers
share a pair of forks, while the lower and leftmost philosophers share another pair of
forks. If all philosophers are simultaneously hungry, at least two will always be able to
eat concurrently. In addition, as shown in the figure, the forks can now be bundled so
that the pair are picked up and put down simultaneously, simplifying the acquisition and
release algorithms.
Quick Quiz 6.1: Is there a better solution to the Dining Philosophers Problem?
This is an example of “horizontal parallelism” [Inm85] or “data parallelism”, so
named because there is no dependency among the pairs of philosophers. In a horizontally
parallel data-processing system, a given item of data would be processed by only one of
a replicated set of software components.
Quick Quiz 6.2: And in just what sense can this “horizontal parallelism” be said to
be “horizontal”?
Lock L Lock R
Header L Header R
Lock L Lock R
Header L 0 Header R
Lock L Lock R
Header L 0 1 Header R
Lock L Lock R
Header L 0 1 2 Header R
Lock L Lock R
Header L 0 1 2 3 Header R
Lock L Lock R
DEQ L DEQ R
four elements on the list. This overlap is due to the fact that removing any given element
affects not only that element, but also its left- and right-hand neighbors. These domains
are indicated by color in the figure, with blue with downward stripes indicating the
domain of the left-hand lock, red with upward stripes indicating the domain of the
right-hand lock, and purple (with no stripes) indicating overlapping domains. Although
it is possible to create an algorithm that works this way, the fact that it has no fewer than
five special cases should raise a big red flag, especially given that concurrent activity at
the other end of the list can shift the queue from one special case to another at any time.
It is far better to consider other designs.
Index L
Index R
Lock L Lock R
1. If holding the right-hand lock, release it and acquire the left-hand lock.
R1
Index L Index R
Enq 3R
R4 R1 R2 R3
Index L Index R
Enq 3L1R
R4 R5 R2 R3
L0 R1 L −2 L −1
Index L Index R
incremented to reference hash chain 2. The middle portion of this same figure shows
the state after three more elements have been right-enqueued. As you can see, the
indexes are back to their initial states (see Figure 6.7), however, each hash chain is
now non-empty. The lower portion of this figure shows the state after three additional
elements have been left-enqueued and an additional element has been right-enqueued.
From the last state shown in Figure 6.8, a left-dequeue operation would return
element “L−2 ” and leave the left-hand index referencing hash chain 2, which would
then contain only a single element (“R2 ”). In this state, a left-enqueue running concur-
rently with a right-enqueue would result in lock contention, but the probability of such
contention can be reduced to arbitrarily low levels by using a larger hash table.
Figure 6.9 shows how 16 elements would be organized in a four-hash-bucket parallel
double-ended queue. Each underlying single-lock double-ended queue holds a one-
quarter slice of the full parallel double-ended queue.
6.1. PARTITIONING EXERCISES 99
R4 R5 R6 R7
L0 R1 R2 R3
Figure 6.10 shows the corresponding C-language data structure, assuming an existing
struct deq that provides a trivially locked double-ended-queue implementation.
This data structure contains the left-hand lock on line 2, the left-hand index on line 3,
the right-hand lock on line 4 (which is cache-aligned in the actual implementation),
the right-hand index on line 5, and, finally, the hashed array of simple lock-based
double-ended queues on line 6. A high-performance implementation would of course
use padding or special alignment directives to avoid false sharing.
Figure 6.11 (lockhdeq.c) shows the implementation of the enqueue and de-
queue functions.3 Discussion will focus on the left-hand operations, as the right-hand
operations are trivially derived from them.
Lines 1-13 show pdeq_pop_l(), which left-dequeues and returns an element if
possible, returning NULL otherwise. Line 6 acquires the left-hand spinlock, and line 7
computes the index to be dequeued from. Line 8 dequeues the element, and, if line 9
finds the result to be non-NULL, line 10 records the new left-hand index. Either way,
line 11 releases the lock, and, finally, line 12 returns the element if there was one, or
NULL otherwise.
Lines 29-38 shows pdeq_push_l(), which left-enqueues the specified element.
Line 33 acquires the left-hand lock, and line 34 picks up the left-hand index. Line 35 left-
enqueues the specified element onto the double-ended queue indexed by the left-hand
index. Line 36 then updates the left-hand index and line 37 releases the lock.
As noted earlier, the right-hand operations are completely analogous to their left-
handed counterparts, so their analysis is left as an exercise for the reader.
Quick Quiz 6.4: Is the hashed double-ended queue a good solution? Why or why
not?
3 One could easily create a polymorphic implementation in any number of languages, but doing so is left
The compound implementation is somewhat more complex than the hashed variant
presented in Section 6.1.2.3, but is still reasonably simple. Of course, a more intelligent
rebalancing scheme could be arbitrarily complex, but the simple scheme shown here
has been shown to perform well compared to software alternatives [DCW+ 11] and even
compared to algorithms using hardware assist [DLM+ 10]. Nevertheless, the best we
can hope for from such a scheme is 2x scalability, as at most two threads can be holding
the dequeue’s locks concurrently. This limitation also applies to algorithms based on
non-blocking synchronization, such as the compare-and-swap-based dequeue algorithm
102 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
of Michael [Mic03].4
Quick Quiz 6.9: Why are there not one but two solutions to the double-ended queue
problem?
In fact, as noted by Dice et al. [DLM+ 10], an unsynchronized single-threaded
double-ended queue significantly outperforms any of the parallel implementations they
studied. Therefore, the key point is that there can be significant overhead enqueuing to
or dequeuing from a shared queue, regardless of implementation. This should come as
no surprise given the material in Chapter 3, given the strict FIFO nature of these queues.
Furthermore, these strict FIFO queues are strictly FIFO only with respect to lin-
earization points [HW90]5 that are not visible to the caller, in fact, in these examples,
the linearization points are buried in the lock-based critical sections. These queues
are not strictly FIFO with respect to (say) the times at which the individual operations
started [HKLP12]. This indicates that the strict FIFO property is not all that valuable in
concurrent programs, and in fact, Kirsch et al. present less-strict queues that provide
improved performance and scalability [KLP12].6 All that said, if you are pushing all
the data used by your concurrent program through a single queue, you really need to
rethink your overall design.
not needed for lock-free implementations of double-ended queues. Instead, the common compare-and-swap
(e.g., x86 cmpxchg) suffices.
5 In short, a linearization point is a single point within a given function where that function can be said
to have taken effect. In this lock-based implementation, the linearization points can be said to be anywhere
within the critical section that does the work.
6 Nir Shavit produced relaxed stacks for roughly the same reasons [Sha11]. This situation leads some to
believe that the linearization points are useful to theorists rather than developers, and leads others to wonder
to what extent the designers of such data structures and algorithms were considering the needs of their users.
104 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
other than microscopically tiny, the space of possible parallel programs is so huge that
convergence is not guaranteed in the lifetime of the universe. Besides, what exactly is
the “best possible parallel program”? After all, Section 2.2 called out no fewer than
three parallel-programming goals of performance, productivity, and generality, and
the best possible performance will likely come at a cost in terms of productivity and
generality. We clearly need to be able to make higher-level choices at design time in
order to arrive at an acceptably good parallel program before that program becomes
obsolete.
However, more detailed design criteria are required to actually produce a real-world
design, a task taken up in this section. This being the real world, these criteria often
conflict to a greater or lesser degree, requiring that the designer carefully balance the
resulting tradeoffs.
As such, these criteria may be thought of as the “forces” acting on the design, with
particularly good tradeoffs between these forces being called “design patterns” [Ale79,
GHJV95].
The design criteria for attaining the three parallel-programming goals are speedup,
contention, overhead, read-to-write ratio, and complexity:
These criteria will act together to enforce a maximum speedup. The first three criteria are
deeply interrelated, so the remainder of this section analyzes these interrelationships.8
Note that these criteria may also appear as part of the requirements specification.
For example, speedup may act as a relative desideratum (“the faster, the better”) or as
an absolute requirement of the workload (“the system must support at least 1,000,000
web hits per second”). Classic design pattern languages describe relative desiderata as
forces and absolute requirements as context.
An understanding of the relationships between these design criteria can be very
helpful when identifying appropriate design tradeoffs for a parallel program.
1. The less time a program spends in critical sections, the greater the potential
speedup. This is a consequence of Amdahl’s Law [Amd67] and of the fact that
only one CPU may execute within a given critical section at a given time.
More specifically, the fraction of time that the program spends in a given exclusive
critical section must be much less than the reciprocal of the number of CPUs for
the actual speedup to approach the number of CPUs. For example, a program
running on 10 CPUs must spend much less than one tenth of its time in the
most-restrictive critical section if it is to scale at all well.
2. Contention effects will consume the excess CPU and/or wallclock time should
the actual speedup be less than the number of available CPUs. The larger the
gap between the number of CPUs and the actual speedup, the less efficiently the
CPUs will be used. Similarly, the greater the desired efficiency, the smaller the
achievable speedup.
4. If the critical sections have high overhead compared to the primitives guarding
them, the best way to improve speedup is to increase parallelism by moving to
reader/writer locking, data locking, asymmetric, or data ownership.
8 A real-world parallel system will be subject to many additional design criteria, such as data-structure
layout, memory size, memory-hierarchy latencies, bandwidth limitations, and I/O issues.
106 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Sequential
Program
Partition Batch
Code
Locking
Partition Batch
Data
Locking
Own Disown
Data
Ownership
5. If the critical sections have high overhead compared to the primitives guarding
them and the data structure being guarded is read much more often than modi-
fied, the best way to increase parallelism is to move to reader/writer locking or
asymmetric primitives.
6. Many changes that improve SMP performance, for example, reducing lock con-
tention, also improve real-time latencies [McK05c].
Quick Quiz 6.12: Don’t all these problems with critical sections mean that we
should just always use non-blocking synchronization [Her90], which don’t have critical
sections?
instructions per clock, and MIPS for older CPUs requiring multiple clocks to execute even the simplest
6.3. SYNCHRONIZATION GRANULARITY 107
10000
100
10
0.1
1975
1980
1985
1990
1995
2000
2005
2010
2015
Year
result in single chips with thousands of CPUs will not be settled soon, but given that
Paul is typing this sentence on a dual-core laptop, the age of SMP does seem to be upon
us. It is also important to note that Ethernet bandwidth is continuing to grow, as shown
in Figure 6.15. This growth will motivate multithreaded servers in order to handle the
communications load.
Please note that this does not mean that you should code each and every program in
a multi-threaded manner. Again, if a program runs quickly enough on a single processor,
spare yourself the overhead and complexity of SMP synchronization primitives. The
simplicity of the hash-table lookup code in Figure 6.16 underscores this point.10 A key
point is that speedups due to parallelism are normally limited to the number of CPUs.
In contrast, speedups due to sequential optimizations, for example, careful choice of
data structure, can be arbitrarily large.
On the other hand, if you are not in this happy situation, read on!
instruction. The reason for taking this approach is that the newer CPUs’ ability to retire multiple instructions
per clock is typically limited by memory-system performance.
10 The examples in this section are taken from Hart et al. [HMB06], adapted for clarity by gathering
instances, you are instead using “data locking”, described in Section 6.3.3.
108 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1e+06
100000 Ethernet
Relative Performance 10000
1000
10
0.1
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
Year
1 struct hash_table
2 {
3 long nbuckets;
4 struct node **buckets;
5 };
6
7 typedef struct node {
8 unsigned long key;
9 struct node *next;
10 } node_t;
11
12 int hash_search(struct hash_table *h, long key)
13 {
14 struct node *cur;
15
16 cur = h->buckets[key % h->nbuckets];
17 while (cur != NULL) {
18 if (cur->key >= key) {
19 return (cur->key == key);
20 }
21 cur = cur->next;
22 }
23 return 0;
24 }
In these cases, code locking will provide a relatively simple program that is very similar
to its sequential counterpart, as can be seen in Figure 6.17. However, note that the
simple return of the comparison in hash_search() in Figure 6.16 has now become
three statements due to the need to release the lock before returning.
1 spinlock_t hash_lock;
2
3 struct hash_table
4 {
5 long nbuckets;
6 struct node **buckets;
7 };
8
9 typedef struct node {
10 unsigned long key;
11 struct node *next;
12 } node_t;
13
14 int hash_search(struct hash_table *h, long key)
15 {
16 struct node *cur;
17 int retval;
18
19 spin_lock(&hash_lock);
20 cur = h->buckets[key % h->nbuckets];
21 while (cur != NULL) {
22 if (cur->key >= key) {
23 retval = (cur->key == key);
24 spin_unlock(&hash_lock);
25 return retval;
26 }
27 cur = cur->next;
28 }
29 spin_unlock(&hash_lock);
30 return 0;
31 }
toy
always translates into increased performance and scalability. For this reason, data
locking was heavily used by Sequent in both its DYNIX and DYNIX/ptx operating
systems [BK85, Inm85, Gar90, Dov90, MD92, MG92, MS93].
However, as those who have taken care of small children can again attest, even
providing enough to go around is no guarantee of tranquillity. The analogous situation
can arise in SMP programs. For example, the Linux kernel maintains a cache of files
and directories (called “dcache”). Each entry in this cache has its own lock, but the
entries corresponding to the root directory and its direct descendants are much more
likely to be traversed than are more obscure entries. This can result in many CPUs
contending for the locks of these popular entries, resulting in a situation not unlike that
shown in Figure 6.21.
In many cases, algorithms can be designed to reduce the instance of data skew, and
in some cases eliminate it entirely (as appears to be possible with the Linux kernel’s
dcache [MSS04]). Data locking is often used for partitionable data structures such as
hash tables, as well as in situations where multiple entities are each represented by an
instance of a given data structure. The task list in version 2.6.17 of the Linux kernel is
an example of the latter, each task structure having its own proc_lock.
A key challenge with data locking on dynamically allocated structures is ensuring
that the structure remains in existence while the lock is being acquired. The code in
Figure 6.19 finesses this challenge by placing the locks in the statically allocated hash
buckets, which are never freed. However, this trick would not work if the hash table
were resizeable, so that the locks were now dynamically allocated. In this case, there
would need to be some means to prevent the hash bucket from being freed during the
time that its lock was being acquired.
Quick Quiz 6.13: What are some ways of preventing a structure from being freed
while its lock is being acquired?
6.3. SYNCHRONIZATION GRANULARITY 111
1 struct hash_table
2 {
3 long nbuckets;
4 struct bucket **buckets;
5 };
6
7 struct bucket {
8 spinlock_t bucket_lock;
9 node_t *list_head;
10 };
11
12 typedef struct node {
13 unsigned long key;
14 struct node *next;
15 } node_t;
16
17 int hash_search(struct hash_table *h, long key)
18 {
19 struct bucket *bp;
20 struct node *cur;
21 int retval;
22
23 bp = h->buckets[key % h->nbuckets];
24 spin_lock(&bp->bucket_lock);
25 cur = bp->list_head;
26 while (cur != NULL) {
27 if (cur->key >= key) {
28 retval = (cur->key == key);
29 spin_unlock(&bp->bucket_lock);
30 return retval;
31 }
32 cur = cur->next;
33 }
34 spin_unlock(&bp->bucket_lock);
35 return 0;
36 }
toy
yot
toy
happens to be that owned by a single CPU, that CPU will be a “hot spot”, sometimes
with results resembling that shown in Figure 6.21. However, in situations where no
sharing is required, data ownership achieves ideal performance, and with code that can
be as simple as the sequential-program case shown in Figure 6.16. Such situations
are often referred to as “embarrassingly parallel”, and, in the best case, resemble the
situation previously shown in Figure 6.20.
Another important instance of data ownership occurs when the data is read-only, in
which case, all threads can “own” it via replication.
Data ownership will be presented in more detail in Chapter 8.
toy
toy
toy
toy
tion was zero, and ignoring the fact that CPUs must wait on each other to complete
their synchronization operations, in other words, µ can be roughly thought of as the
synchronization overhead in absence of contention. For example, suppose that each
synchronization operation involves an atomic increment instruction, and that a computer
system is able to do an atomic increment every 25 nanoseconds on each CPU to a private
variable.12 The value of µ is therefore about 40,000,000 atomic increments per second.
Of course, the value of λ increases with increasing numbers of CPUs, as each CPU
is capable of processing transactions independently (again, ignoring synchronization):
λ = nλ0 (6.1)
where n is the number of CPUs and λ0 is the transaction-processing capability of a
single CPU. Note that the expected time for a single CPU to execute a single transaction
is 1/λ0 .
Because the CPUs have to “wait in line” behind each other to get their chance to
increment the single shared variable, we can use the M/M/1 queueing-model expression
for the expected total waiting time:
1
T= (6.2)
µ −λ
Substituting the above value of λ :
1
T= (6.3)
µ − nλ0
Now, the efficiency is just the ratio of the time required to process a transaction
in absence of synchronization (1/λ0 ) to the time required including synchronization
(T + 1/λ0 ):
12 Of course, if there are 8 CPUs all incrementing the same shared variable, then each CPU must wait
at least 175 nanoseconds for each of the other CPUs to do its increment before consuming an additional 25
nanoseconds doing its own increment. In actual fact, the wait will be longer due to the need to move the
variable from one CPU to another.
114 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Synchronization Efficiency
0.9
0.8
0.7
100
0.6
0.5 75
0.4 50
0.3 25
0.2 10
0.1
10
20
30
40
50
60
70
80
90
100
Number of CPUs (Threads)
1/λ0
e= (6.4)
T + 1/λ0
Substituting the above value for T and simplifying:
µ
λ0 −n
e= µ (6.5)
λ0 − (n − 1)
But the value of µ/λ0 is just the ratio of the time required to process the transaction
(absent synchronization overhead) to that of the synchronization overhead itself (absent
contention). If we call this ratio f , we have:
f −n
e= (6.6)
f − (n − 1)
Figure 6.22 plots the synchronization efficiency e as a function of the number of
CPUs/threads n for a few values of the overhead ratio f . For example, again using the
25-nanosecond atomic increment, the f = 10 line corresponds to each CPU attempting
an atomic increment every 250 nanoseconds, and the f = 100 line corresponds to each
CPU attempting an atomic increment every 2.5 microseconds, which in turn corresponds
to several thousand instructions. Given that each trace drops off sharply with increasing
numbers of CPUs or threads, we can conclude that synchronization mechanisms based
on atomic manipulation of a single global shared variable will not scale well if used
heavily on current commodity hardware. This is a mathematical depiction of the forces
leading to the parallel counting algorithms that were discussed in Chapter 5.
The concept of efficiency is useful even in cases having little or no formal synchro-
nization. Consider for example a matrix multiply, in which the columns of one matrix
are multiplied (via “dot product”) by the rows of another, resulting in an entry in a
third matrix. Because none of these operations conflict, it is possible to partition the
columns of the first matrix among a group of threads, with each thread computing the
corresponding columns of the result matrix. The threads can therefore operate entirely
independently, with no synchronization overhead whatsoever, as is done in matmul.c.
One might therefore expect a parallel matrix multiply to have a perfect efficiency of 1.0.
6.4. PARALLEL FASTPATH 115
However, Figure 6.23 tells a different story, especially for a 64-by-64 matrix multiply,
which never gets above an efficiency of about 0.7, even when running single-threaded.
The 512-by-512 matrix multiply’s efficiency is measurably less than 1.0 on as few as 10
threads, and even the 1024-by-1024 matrix multiply deviates noticeably from perfection
at a few tens of threads. Nevertheless, this figure clearly demonstrates the performance
and scalability benefits of batching: If you must incur synchronization overhead, you
may as well get your money’s worth.
Quick Quiz 6.14: How can a single-threaded 64-by-64 matrix multiple possibly
have an efficiency of less than 1.0? Shouldn’t all of the traces in Figure 6.23 have
efficiency of exactly 1.0 when running on only one thread?
Given these inefficiencies, it is worthwhile to look into more-scalable approaches
such as the data locking described in Section 6.3.3 or the parallel-fastpath approach
discussed in the next section.
Quick Quiz 6.15: How are data-parallel techniques going to help with matrix
multiply? It is already data parallel!!!
Reader/Writer
Locking
RCU
Parallel
Fastpath
Hierarchical
Locking
Allocator
Caches
4. Resource Allocator Caches ([McK96a, MS93]). See Section 6.4.3 for more detail.
1 rwlock_t hash_lock;
2
3 struct hash_table
4 {
5 long nbuckets;
6 struct node **buckets;
7 };
8
9 typedef struct node {
10 unsigned long key;
11 struct node *next;
12 } node_t;
13
14 int hash_search(struct hash_table *h, long key)
15 {
16 struct node *cur;
17 int retval;
18
19 read_lock(&hash_lock);
20 cur = h->buckets[key % h->nbuckets];
21 while (cur != NULL) {
22 if (cur->key >= key) {
23 retval = (cur->key == key);
24 read_unlock(&hash_lock);
25 return retval;
26 }
27 cur = cur->next;
28 }
29 read_unlock(&hash_lock);
30 return 0;
31 }
Quick Quiz 6.16: In what situation would hierarchical locking work well?
The basic problem facing a parallel memory allocator is the tension between the need to
provide extremely fast memory allocation and freeing in the common case and the need
to efficiently distribute memory in face of unfavorable allocation and freeing patterns.
To see this tension, consider a straightforward application of data ownership to this
problem—simply carve up memory so that each CPU owns its share. For example,
suppose that a system with two CPUs has two gigabytes of memory (such as the one that
I am typing on right now). We could simply assign each CPU one gigabyte of memory,
and allow each CPU to access its own private chunk of memory, without the need for
locking and its complexities and overheads. Unfortunately, this simple scheme breaks
down if an algorithm happens to have CPU 0 allocate all of the memory and CPU 1 the
free it, as would happen in a simple producer-consumer workload.
The other extreme, code locking, suffers from excessive lock contention and over-
head [MS93].
118 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1 struct hash_table
2 {
3 long nbuckets;
4 struct bucket **buckets;
5 };
6
7 struct bucket {
8 spinlock_t bucket_lock;
9 node_t *list_head;
10 };
11
12 typedef struct node {
13 spinlock_t node_lock;
14 unsigned long key;
15 struct node *next;
16 } node_t;
17
18 int hash_search(struct hash_table *h, long key)
19 {
20 struct bucket *bp;
21 struct node *cur;
22 int retval;
23
24 bp = h->buckets[key % h->nbuckets];
25 spin_lock(&bp->bucket_lock);
26 cur = bp->list_head;
27 while (cur != NULL) {
28 if (cur->key >= key) {
29 spin_lock(&cur->node_lock);
30 spin_unlock(&bp->bucket_lock);
31 retval = (cur->key == key);
32 spin_unlock(&cur->node_lock);
33 return retval;
34 }
35 cur = cur->next;
36 }
37 spin_unlock(&bp->bucket_lock);
38 return 0;
39 }
Global Pool
Overflow
Overflow
(Code Locked)
Empty
Empty
CPU 0 Pool CPU 1 Pool
Allocate/Free
13 Both pool sizes (TARGET_POOL_SIZE and GLOBAL_POOL_SIZE) are unrealistically small, but
this small size makes it easier to single-step the program in order to get a feel for its operation.
120 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
(Empty) −1
on line 9 and released on line 16. Lines 10-14 move blocks from the global to the
per-thread pool until either the local pool reaches its target size (half full) or the global
pool is exhausted, and line 15 sets the per-thread pool’s count to the proper value.
In either case, line 18 checks for the per-thread pool still being empty, and if not,
lines 19-21 remove a block and return it. Otherwise, line 23 tells the sad tale of memory
exhaustion.
and 14 acquiring and releasing the spinlock. Lines 9-12 implement the loop moving
blocks from the local to the global pool, and line 13 sets the per-thread pool’s count to
the proper value.
In either case, line 16 then places the newly freed block into the per-thread pool.
6.4.3.6 Performance
Rough performance results14 are shown in Figure 6.32, running on a dual-core Intel
x86 running at 1GHz (4300 bogomips per CPU) with at most six blocks allowed in
each CPU’s cache. In this micro-benchmark, each thread repeatedly allocates a group
of blocks and then frees all the blocks in that group, with the number of blocks in the
group being the “allocation run length” displayed on the x-axis. The y-axis shows the
number of successful allocation/free pairs per microsecond—failed allocations are not
counted. The “X”s are from a two-thread run, while the “+”s are from a single-threaded
run.
Note that run lengths up to six scale linearly and give excellent performance, while
run lengths greater than six show poor performance and almost always also show nega-
tive scaling. It is therefore quite important to size TARGET_POOL_SIZE sufficiently
large, which fortunately is usually quite easy to do in actual practice [MSK01], espe-
cially given today’s large memories. For example, in most systems, it is quite reasonable
to set TARGET_POOL_SIZE to 100, in which case allocations and frees are guaranteed
to be confined to per-thread pools at least 99% of the time.
As can be seen from the figure, the situations where the common-case data-ownership
applies (run lengths up to six) provide greatly improved performance compared to the
cases where locks must be acquired. Avoiding synchronization in the common case will
be a recurring theme through this book.
Quick Quiz 6.17: In Figure 6.32, there is a pattern of performance rising with
increasing run length in groups of three samples, for example, for run lengths 10, 11,
and 12. Why?
Quick Quiz 6.18: Allocation failures were observed in the two-thread tests at run
lengths of 19 and greater. Given the global-pool size of 40 and the per-thread target
pool size s of three, number of threads n equal to two, and assuming that the per-thread
14 This data was not collected in a statistically meaningful way, and therefore should be viewed with great
skepticism and suspicion. Good data-collection and -reduction practice is discussed in Chapter 11. That said,
repeated runs gave similar results, and these results match more careful evaluations of similar algorithms.
122 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
30
20
15
10
0
0 5 10 15 20 25
Allocation Run Length
pools are initially empty with none of the memory in use, what is the smallest allocation
run length m at which failures can occur? (Recall that each thread repeatedly allocates
m block of memory, and then frees the m blocks of memory.) Alternatively, given n
threads each with pool size s, and where each thread repeatedly first allocates m blocks
of memory and then frees those m blocks, how large must the global pool size be? Note:
Obtaining the correct answer will require you to examine the smpalloc.c source
code, and very likely single-step it as well. You have been warned!
The toy parallel resource allocator was quite simple, but real-world designs expand on
this approach in a number of ways.
First, real-world allocators are required to handle a wide range of allocation sizes,
as opposed to the single size shown in this toy example. One popular way to do this is
to offer a fixed set of sizes, spaced so as to balance external and internal fragmentation,
such as in the late-1980s BSD memory allocator [MK88]. Doing this would mean that
the “globalmem” variable would need to be replicated on a per-size basis, and that the
associated lock would similarly be replicated, resulting in data locking rather than the
toy program’s code locking.
Second, production-quality systems must be able to repurpose memory, meaning
that they must be able to coalesce blocks into larger structures, such as pages [MS93].
This coalescing will also need to be protected by a lock, which again could be replicated
on a per-size basis.
Third, coalesced memory must be returned to the underlying memory system, and
pages of memory must also be allocated from the underlying memory system. The
locking required at this level will depend on that of the underlying memory system, but
could well be code locking. Code locking can often be tolerated at this level, because
this level is so infrequently reached in well-designed systems [MSK01].
Despite this real-world design’s greater complexity, the underlying idea is the same—
repeated application of parallel fastpath, as shown in Table 6.1.
6.5. BEYOND PARTITIONING 123
1 2 3
2 3 4
3 4 5
1
0.9
0.8
0.7 PWQ
0.6
Probability
0.5 SEQ
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100 120 140
CDF of Solution Time (ms)
algorithm: At most one thread may be making progress along the solution path at any
given time. This weakness is addressed in the next section.
0.5 SEQ
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100 120 140
CDF of Solution Time (ms)
Figure 6.39: CDF of Solution Times For SEQ, PWQ, and PART
shown in Figure 6.38. Lines 8-9 check to see if the cells are connected, returning failure
if not. The loop spanning lines 11-18 attempts to mark the new cell visited. Line 13
checks to see if it has already been visited, in which case line 16 returns failure, but only
after line 14 checks to see if we have encountered the other thread, in which case line 15
indicates that the solution has been located. Line 19 updates to the new cell, lines 20
and 21 update this thread’s visited array, and line 22 returns success.
Performance testing revealed a surprising anomaly, shown in Figure 6.39. The
median solution time for PART (17 milliseconds) is more than four times faster than
that of SEQ (79 milliseconds), despite running on only two threads. The next section
analyzes this anomaly.
1
0.9
0.8
0.7
0.6
Probability
0.5 SEQ/PWQ SEQ/PART
0.4
0.3
0.2
0.1
0
0.1 1 10 100
CDF of Speedup Relative to SEQ
then run all solvers on that maze. It therefore makes sense to plot the CDF of the ratios
of solution times for each generated maze, as shown in Figure 6.40, greatly reducing
the CDFs’ overlap. This plot reveals that for some mazes, PART is more than forty
times faster than SEQ. In contrast, PWQ is never more than about two times faster than
SEQ. A forty-times speedup on two threads demands explanation. After all, this is
not merely embarrassingly parallel, where partitionability means that adding threads
does not increase the overall computational cost. It is instead humiliatingly parallel:
Adding threads significantly reduces the overall computational cost, resulting in large
algorithmic superlinear speedups.
Further investigation showed that PART sometimes visited fewer than 2% of the
maze’s cells, while SEQ and PWQ never visited fewer than about 9%. The reason for
this difference is shown by Figure 6.41. If the thread traversing the solution from the
upper left reaches the circle, the other thread cannot reach the upper-right portion of
the maze. Similarly, if the other thread reaches the square, the first thread cannot reach
the lower-left portion of the maze. Therefore, PART will likely visit a small fraction of
the non-solution-path cells. In short, the superlinear speedups are due to threads getting
in each others’ way. This is a sharp contrast with decades of experience with parallel
programming, where workers have struggled to keep threads out of each others’ way.
Figure 6.42 confirms a strong correlation between cells visited and solution time
for all three methods. The slope of PART’s scatterplot is smaller than that of SEQ,
indicating that PART’s pair of threads visits a given fraction of the maze faster than can
SEQ’s single thread. PART’s scatterplot is also weighted toward small visit percentages,
6.5. BEYOND PARTITIONING 129
140
120
100
60 PWQ
40
20 PART
0
0 10 20 30 40 50 60 70 80 90 100
Percent of Maze Cells Visited
confirming that PART does less total work, hence the observed humiliating parallelism.
The fraction of cells visited by PWQ is similar to that of SEQ. In addition, PWQ’s
solution time is greater than that of PART, even for equal visit fractions. The reason for
this is shown in Figure 6.43, which has a red circle on each cell with more than two
neighbors. Each such cell can result in contention in PWQ, because one thread can enter
but two threads can exit, which hurts performance, as noted earlier in this chapter. In
contrast, PART can incur such contention but once, namely when the solution is located.
Of course, SEQ never contends.
Although PART’s speedup is impressive, we should not neglect sequential optimiza-
tions. Figure 6.44 shows that SEQ, when compiled with -O3, is about twice as fast as
unoptimized PWQ, approaching the performance of unoptimized PART. Compiling all
three algorithms with -O3 gives results similar to (albeit faster than) those shown in
Figure 6.40, except that PWQ provides almost no speedup compared to SEQ, in keeping
with Amdahl’s Law [Amd67]. However, if the goal is to double performance compared
to unoptimized SEQ, as opposed to achieving optimality, compiler optimizations are
quite attractive.
Cache alignment and padding often improves performance by reducing false sharing.
However, for these maze-solution algorithms, aligning and padding the maze-cell array
degrades performance by up to 42% for 1000x1000 mazes. Cache locality is more
important than avoiding false sharing, especially for large mazes. For smaller 20-by-
20 or 50-by-50 mazes, aligning and padding can produce up to a 40% performance
improvement for PART, but for these small sizes, SEQ performs better anyway because
130 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1
0.9
PART
0.8
0.7
0.6
Probability
0.5 PWQ
0.4
0.3
0.2
0.1 SEQ -O3
0
0.1 1 10 100
CDF of Speedup Relative to SEQ
1
COPART
0.9
PWQ
0.8 PART
0.7
0.6
Probability
0.5
0.4
0.3
0.2
0.1
0
0.1 1 10 100
CDF of Speedup Relative to SEQ (-O3)
there is insufficient time for PART to make up for the overhead of thread creation and
destruction.
In short, the partitioned parallel maze solver is an interesting example of an algo-
rithmic superlinear speedup. If “algorithmic superlinear speedup” causes cognitive
dissonance, please proceed to the next section.
12
2 PART PWQ
0
10 100 1000
Maze Size
1.8
Speedup Relative to COPART (-O3)
1.6
1.4
1.2
0.8 PART
0.6
PWQ
0.4
0.2
0
10 100 1000
Maze Size
3.5
2.5
1.5
PART
1
0.5
PWQ
0
1 2 3 4 5 6 7 8
Number of Threads
efficiency breakeven is within the 90% confidence interval for seven and eight threads.
The reasons for the peak at two threads are (1) the lower complexity of termination
detection in the two-thread case and (2) the fact that there is a lower probability of the
third and subsequent threads making useful forward progress: Only the first two threads
are guaranteed to start on the solution line. This disappointing performance compared
to results in Figure 6.47 is due to the less-tightly integrated hardware available in the
larger and older Xeon® system running at 2.66GHz.
that this experience will motivate work on parallelism as a first-class design-time whole-
application optimization technique, rather than as a grossly suboptimal after-the-fact
micro-optimization to be retrofitted into existing programs.
Locking
In recent concurrency research, the role of villain is often played by locking. In many
papers and presentations, locking stands accused of promoting deadlocks, convoying,
starvation, unfairness, data races, and all manner of other concurrency sins. Interestingly
enough, the role of workhorse in production-quality shared-memory parallel software is
played by, you guessed it, locking. This chapter will look into this dichotomy between
villain and hero, as fancifully depicted in Figures 7.1 and 7.2.
There are a number of reasons behind this Jekyll-and-Hyde dichotomy:
1. Many of locking’s sins have pragmatic design solutions that work well in most
cases, for example:
2. Some of locking’s sins are problems only at high levels of contention, levels
reached only by poorly designed programs.
4. Until quite recently, almost all large shared-memory parallel programs were
developed in secret, so that it was difficult for most researchers to learn of these
pragmatic solutions.
5. Locking works extremely well for some software artifacts and extremely poorly
for others. Developers who have worked on artifacts for which locking works
well can be expected to have a much more positive opinion of locking than those
who have worked on artifacts for which locking works poorly, as will be discussed
in Section 7.5.
135
136 CHAPTER 7. LOCKING
XXXX
6. All good stories need a villain, and locking has a long and honorable history
serving as a research-paper whipping boy.
Quick Quiz 7.1: Just how can serving as a whipping boy be considered to be in any
way honorable???
This chapter will give an overview of a number of ways to avoid locking’s more
serious sins.
Lock 1
Thread A Lock 2
Lock 3 Thread B
Thread C Lock 4
7.1.1 Deadlock
Deadlock occurs when each of a group of threads is holding at least one lock while at
the same time waiting on a lock held by a member of that same group.
Without some sort of external intervention, deadlock is forever. No thread can
acquire the lock it is waiting on until that lock is released by the thread holding it, but
the thread holding it cannot release it until the holding thread acquires the lock that it is
waiting on.
We can create a directed-graph representation of a deadlock scenario with nodes for
threads and locks, as shown in Figure 7.3. An arrow from a lock to a thread indicates
that the thread holds the lock, for example, Thread B holds Locks 2 and 4. An arrow
from a thread to a lock indicates that the thread is waiting on the lock, for example,
Thread B is waiting on Lock 3.
A deadlock scenario will always contain at least one deadlock cycle. In Figure 7.3,
this cycle is Thread B, Lock 3, Thread C, Lock 4, and back to Thread B.
Quick Quiz 7.2: But the definition of deadlock only said that each thread was
holding at least one lock and waiting on another lock that was held by some thread.
How do you know that there is a cycle?
Although there are some software environments such as database systems that can
repair an existing deadlock, this approach requires either that one of the threads be
killed or that a lock be forcibly stolen from one of the threads. This killing and forcible
stealing can be appropriate for transactions, but is often problematic for kernel and
application-level use of locking: dealing with the resulting partially updated structures
can be extremely complex, hazardous, and error-prone.
Kernels and applications therefore work to avoid deadlocks rather than to recover
from them. There are a number of deadlock-avoidance strategies, including locking
hierarchies (Section 7.1.1.1), local locking hierarchies (Section 7.1.1.2), layered locking
hierarchies (Section 7.1.1.3), strategies for dealing with APIs containing pointers to
locks (Section 7.1.1.4), conditional locking (Section 7.1.1.5), acquiring all needed locks
first (Section 7.1.1.6), single-lock-at-a-time designs (Section 7.1.1.7), and strategies for
signal/interrupt handlers (Section 7.1.1.8). Although there is no deadlock-avoidance
138 CHAPTER 7. LOCKING
strategy that works perfectly for all situations, there is a good selection of deadlock-
avoidance tools to choose from.
Locking hierarchies order the locks and prohibit acquiring locks out of order. In
Figure 7.3, we might order the locks numerically, so that a thread was forbidden from
acquiring a given lock if it already held a lock with the same or a higher number.
Thread B has violated this hierarchy because it is attempting to acquire Lock 3 while
holding Lock 4, which permitted the deadlock to occur.
Again, to apply a locking hierarchy, order the locks and prohibit out-of-order
lock acquisition. In large program, it is wise to use tools to enforce your locking
hierarchy [Cor06a].
However, the global nature of locking hierarchies make them difficult to apply to library
functions. After all, the program using a given library function has not even been written
yet, so how can the poor library-function implementor possibly hope to adhere to the
yet-to-be-written program’s locking hierarchy?
One special case that is fortunately the common case is when the library function
does not invoke any of the caller’s code. In this case, the caller’s locks will never be
acquired while holding any of the library’s locks, so that there cannot be a deadlock
cycle containing locks from both the library and the caller.
Quick Quiz 7.3: Are there any exceptions to this rule, so that there really could be
a deadlock cycle containing locks from both the library and the caller, even given that
the library code never invokes any of the caller’s functions?
But suppose that a library function does invoke the caller’s code. For example,
the qsort() function invokes a caller-provided comparison function. A concurrent
implementation of qsort() likely uses locking, which might result in deadlock in
the perhaps-unlikely case where the comparison function is a complicated function
involving also locking. How can the library function avoid deadlock?
The golden rule in this case is “Release all locks before invoking unknown code.”
To follow this rule, the qsort() function must release all locks before invoking the
comparison function.
Quick Quiz 7.4: But if qsort() releases all its locks before invoking the compar-
ison function, how can it protect against races with other qsort() threads?
To see the benefits of local locking hierarchies, compare Figures 7.4 and 7.5. In
both figures, application functions foo() and bar() invoke qsort() while holding
Locks A and B, respectively. Because this is a parallel implementation of qsort(), it
acquires Lock C. Function foo() passes function cmp() to qsort(), and cmp()
acquires Lock B. Function bar() passes a simple integer-comparison function (not
shown) to qsort(), and this simple function does not acquire any locks.
Now, if qsort() holds Lock C while calling cmp() in violation of the golden
release-all-locks rule above, as shown in Figure 7.4, deadlock can occur. To see this,
suppose that one thread invokes foo() while a second thread concurrently invokes
bar(). The first thread will acquire Lock A and the second thread will acquire Lock B.
If the first thread’s call to qsort() acquires Lock C, then it will be unable to acquire
Lock B when it calls cmp(). But the first thread holds Lock C, so the second thread’s
7.1. STAYING ALIVE 139
Application
Library
Lock C
qsort()
Application
Library
Lock C
qsort()
call to qsort() will be unable to acquire it, and thus unable to release Lock B,
resulting in deadlock.
In contrast, if qsort() releases Lock C before invoking the comparison function
(which is unknown code from qsort()’s perspective, then deadlock is avoided as
shown in Figure 7.5.
If each module releases all locks before invoking unknown code, then deadlock is
avoided if each module separately avoids deadlock. This rule therefore greatly simplifies
deadlock analysis and greatly improves modularity.
Application
Lock A Lock B
foo() bar()
Library
Lock C
qsort()
Lock D
cmp()
construct a layered locking hierarchy, as shown in Figure 7.6. here, the cmp() function
uses a new Lock D that is acquired after all of Locks A, B, and C, avoiding deadlock.
we therefore have three layers to the global deadlock hierarchy, the first containing
Locks A and B, the second containing Lock C, and the third containing Lock D.
Please note that it is not typically possible to mechanically change cmp() to use
the new Lock D. Quite the opposite: It is often necessary to make profound design-level
modifications. Nevertheless, the effort required for such modifications is normally a
small price to pay in order to avoid deadlock.
For another example where releasing all locks before invoking unknown code is
impractical, imagine an iterator over a linked list, as shown in Figure 7.7 (locked_
list.c). The list_start() function acquires a lock on the list and returns the
first element (if there is one), and list_next() either returns a pointer to the next
element in the list or releases the lock and returns NULL if the end of the list has been
reached.
Figure 7.8 shows how this list iterator may be used. Lines 1-4 define the list_
ints element containing a single integer, and lines 6-17 show how to iterate over the
list. Line 11 locks the list and fetches a pointer to the first element, line 13 provides a
pointer to our enclosing list_ints structure, line 14 prints the corresponding integer,
and line 15 moves to the next element. This is quite simple, and hides all of the locking.
That is, the locking remains hidden as long as the code processing each list element
does not itself acquire a lock that is held across some other call to list_start() or
list_next(), which results in deadlock. We can avoid the deadlock by layering the
locking hierarchy to take the list-iterator locking into account.
This layered approach can be extended to an arbitrarily large number of layers, but
7.1. STAYING ALIVE 141
1 struct locked_list {
2 spinlock_t s;
3 struct list_head h;
4 };
5
6 struct list_head *list_start(struct locked_list *lp)
7 {
8 spin_lock(&lp->s);
9 return list_next(lp, &lp->h);
10 }
11
12 struct list_head *list_next(struct locked_list *lp,
13 struct list_head *np)
14 {
15 struct list_head *ret;
16
17 ret = np->next;
18 if (ret == &lp->h) {
19 spin_unlock(&lp->s);
20 ret = NULL;
21 }
22 return ret;
23 }
1 struct list_ints {
2 struct list_head n;
3 int a;
4 };
5
6 void list_print(struct locked_list *lp)
7 {
8 struct list_head *np;
9 struct list_ints *ip;
10
11 np = list_start(lp);
12 while (np != NULL) {
13 ip = list_entry(np, struct list_ints, n);
14 printf("\t%d\n", ip->a);
15 np = list_next(lp, np);
16 }
17 }
1 spin_lock(&lock2);
2 layer_2_processing(pkt);
3 nextlayer = layer_1(pkt);
4 spin_lock(&nextlayer->lock1);
5 layer_1_processing(pkt);
6 spin_unlock(&lock2);
7 spin_unlock(&nextlayer->lock1);
each added layer increases the complexity of the locking design. Such increases in
complexity are particularly inconvenient for some types of object-oriented designs, in
which control passes back and forth among a large group of objects in an undisciplined
manner.1 This mismatch between the habits of object-oriented design and the need to
avoid deadlock is an important reason why parallel programming is perceived by some
to be so difficult.
Some alternatives to highly layered locking hierarchies are covered in Chapter 9.
1 retry:
2 spin_lock(&lock2);
3 layer_2_processing(pkt);
4 nextlayer = layer_1(pkt);
5 if (!spin_trylock(&nextlayer->lock1)) {
6 spin_unlock(&lock2);
7 spin_lock(&nextlayer->lock1);
8 spin_lock(&lock2);
9 if (layer_1(pkt) != nextlayer) {
10 spin_unlock(&nextlayer->lock1);
11 spin_unlock(&lock2);
12 goto retry;
13 }
14 }
15 layer_1_processing(pkt);
16 spin_unlock(&lock2);
17 spin_unlock(&nextlayer->lock1);
Figure 7.10. Instead of unconditionally acquiring the layer-1 lock, line 5 conditionally
acquires the lock using the spin_trylock() primitive. This primitive acquires the
lock immediately if the lock is available (returning non-zero), and otherwise returns
zero without acquiring the lock.
If spin_trylock() was successful, line 15 does the needed layer-1 processing.
Otherwise, line 6 releases the lock, and lines 7 and 8 acquire them in the correct order.
Unfortunately, there might be multiple networking devices on the system (e.g., Ethernet
and WiFi), so that the layer_1() function must make a routing decision. This
decision might change at any time, especially if the system is mobile.2 Therefore, line 9
must recheck the decision, and if it has changed, must release the locks and start over.
Quick Quiz 7.7: Can the transformation from Figure 7.9 to Figure 7.10 be applied
universally?
Quick Quiz 7.8: But the complexity in Figure 7.10 is well worthwhile given that it
avoids deadlock, right?
on the ability to abort transactions, although this can be simplified by avoiding making
any changes to shared data until all needed locks are acquired. Livelock and deadlock
are issues in such systems, but practical solutions may be found in any of a number of
database textbooks.
In some cases, it is possible to avoid nesting locks, thus avoiding deadlock. For example,
if a problem is perfectly partitionable, a single lock may be assigned to each partition.
Then a thread working on a given partition need only acquire the one corresponding
lock. Because no thread ever holds more than one lock at a time, deadlock is impossible.
However, there must be some mechanism to ensure that the needed data structures
remain in existence during the time that neither lock is held. One such mechanism is
discussed in Section 7.4 and several others are presented in Chapter 9.
Deadlocks involving signal handlers are often quickly dismissed by noting that it is
not legal to invoke pthread_mutex_lock() from within a signal handler [Ope97].
However, it is possible (though almost always unwise) to hand-craft locking primitives
that can be invoked from signal handlers. Besides which, almost all operating-system
kernels permit locks to be acquired from within interrupt handlers, which are the kernel
analog to signal handlers.
The trick is to block signals (or disable interrupts, as the case may be) when acquiring
any lock that might be acquired within an interrupt handler. Furthermore, if holding
such a lock, it is illegal to attempt to acquire any lock that is ever acquired outside of a
signal handler without blocking signals.
Quick Quiz 7.10: Why is it illegal to acquire a Lock A that is acquired outside of a
signal handler without blocking signals while holding a Lock B that is acquired within a
signal handler?
If a lock is acquired by the handlers for several signals, then each and every one of
these signals must be blocked whenever that lock is acquired, even when that lock is
acquired within a signal handler.
Quick Quiz 7.11: How can you legally block signals within a signal handler?
Unfortunately, blocking and unblocking signals can be expensive in some operating
systems, notably including Linux, so performance concerns often mean that locks
acquired in signal handlers are only acquired in signal handlers, and that lockless
synchronization mechanisms are used to communicate between application code and
signal handlers.
Or that signal handlers are avoided completely except for handling fatal errors.
Quick Quiz 7.12: If acquiring locks in signal handlers is such a bad idea, why even
discuss ways of making it safe?
7.1.1.9 Discussion
1 void thread1(void)
2 {
3 retry:
4 spin_lock(&lock1);
5 do_one_thing();
6 if (!spin_trylock(&lock2)) {
7 spin_unlock(&lock1);
8 goto retry;
9 }
10 do_another_thing();
11 spin_unlock(&lock2);
12 spin_unlock(&lock1);
13 }
14
15 void thread2(void)
16 {
17 retry:
18 spin_lock(&lock2);
19 do_a_third_thing();
20 if (!spin_trylock(&lock1)) {
21 spin_unlock(&lock2);
22 goto retry;
23 }
24 do_a_fourth_thing();
25 spin_unlock(&lock1);
26 spin_unlock(&lock2);
27 }
tool in their toolbox: locking is a powerful concurrency tool, but there are jobs better
addressed with other tools.
Quick Quiz 7.13: Given an object-oriented application that passes control freely
among a group of objects such that there is no straightforward locking hierarchy,3
layered or otherwise, how can this application be parallelized?
Nevertheless, the strategies described in this section have proven quite useful in
many settings.
1 void thread1(void)
2 {
3 unsigned int wait = 1;
4 retry:
5 spin_lock(&lock1);
6 do_one_thing();
7 if (!spin_trylock(&lock2)) {
8 spin_unlock(&lock1);
9 sleep(wait);
10 wait = wait << 1;
11 goto retry;
12 }
13 do_another_thing();
14 spin_unlock(&lock2);
15 spin_unlock(&lock1);
16 }
17
18 void thread2(void)
19 {
20 unsigned int wait = 1;
21 retry:
22 spin_lock(&lock2);
23 do_a_third_thing();
24 if (!spin_trylock(&lock1)) {
25 spin_unlock(&lock2);
26 sleep(wait);
27 wait = wait << 1;
28 goto retry;
29 }
30 do_a_fourth_thing();
31 spin_unlock(&lock1);
32 spin_unlock(&lock2);
33 }
Quick Quiz 7.14: How can the livelock shown in Figure 7.11 be avoided?
Livelock can be thought of as an extreme form of starvation where a group of threads
starve, rather than just one of them.4
Livelock and starvation are serious issues in software transactional memory imple-
mentations, and so the concept of contention manager has been introduced to encapsu-
late these issues. In the case of locking, simple exponential backoff can often address
livelock and starvation. The idea is to introduce exponentially increasing delays before
each retry, as shown in Figure 7.12.
Quick Quiz 7.15: What problems can you spot in the code in Figure 7.12?
However, for better results, the backoff should be bounded, and even better high-
contention results have been obtained via queued locking [And90], which is discussed
more in Section 7.3.2. Of course, best of all is to use a good parallel design so that lock
contention remains low.
7.1.3 Unfairness
Unfairness can be thought of as a less-severe form of starvation, where a subset of
threads contending for a given lock are granted the lion’s share of the acquisitions. This
can happen on machines with shared caches or NUMA characteristics, for example, as
4 Try not to get too hung up on the exact definitions of terms like livelock, starvation, and unfairness.
Anything that causes a group of threads to fail to make adequate forward progress is a problem that needs to
be fixed, regardless of what name you choose for it.
7.2. TYPES OF LOCKS 147
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
shown in Figure 7.13. If CPU 0 releases a lock that all the other CPUs are attempting to
acquire, the interconnect shared between CPUs 0 and 1 means that CPU 1 will have an
advantage over CPUs 2-7. Therefore CPU 1 will likely acquire the lock. If CPU 1 hold
the lock long enough for CPU 0 to be requesting the lock by the time CPU 1 releases it
and vice versa, the lock can shuttle between CPUs 0 and 1, bypassing CPUs 2-7.
Quick Quiz 7.16: Wouldn’t it be better just to use a good parallel design so that
lock contention was low enough to avoid unfairness?
7.1.4 Inefficiency
Locks are implemented using atomic instructions and memory barriers, and often involve
cache misses. As we saw in Chapter 3, these instructions are quite expensive, roughly
two orders of magnitude greater overhead than simple instructions. This can be a serious
problem for locking: If you protect a single instruction with a lock, you will increase the
overhead by a factor of one hundred. Even assuming perfect scalability, one hundred
CPUs would be required to keep up with a single CPU executing the same code without
locking.
This situation underscores the synchronization-granularity tradeoff discussed in
Section 6.3, especially Figure 6.22: Too coarse a granularity will limit scalability, while
too fine a granularity will result in excessive synchronization overhead.
That said, once a lock is held, the data protected by that lock can be accessed by
the lock holder without interference. Acquiring a lock might be expensive, but once
held, the CPU’s caches are an effective performance booster, at least for large critical
sections.
Quick Quiz 7.17: How might the lock holder be interfered with?
reader-writer locks (Section 7.2.2), multi-role locks (Section 7.2.3), and scoped locking
(Section 7.2.4).
Concurrent Write
Concurrent Read
Null (Not Held)
Protected Write
Protected Read
Exclusive
Null (Not Held)
Concurrent Read X
Concurrent Write X X X
Protected Read X X X
Protected Write X X X X
Exclusive X X X X X
The VAX/VMS DLM uses six modes. For purposes of comparison, exclusive locks
use two modes (not held and held), while reader-writer locks use three modes (not held,
read held, and write held).
The first mode is null, or not held. This mode is compatible with all other modes,
which is to be expected: If a thread is not holding a lock, it should not prevent any other
thread from acquiring that lock.
The second mode is concurrent read, which is compatible with every other mode ex-
cept for exclusive. The concurrent-read mode might be used to accumulate approximate
statistics on a data structure, while permitting updates to proceed concurrently.
The third mode is concurrent write, which is compatible with null, concurrent read,
and concurrent write. The concurrent-write mode might be used to update approximate
statistics, while still permitting reads and concurrent updates to proceed concurrently.
The fourth mode is protected read, which is compatible with null, concurrent read,
and protected read. The protected-read mode might be used to obtain a consistent
snapshot of the data structure, while permitting reads but not updates to proceed concur-
rently.
The fifth mode is protected write, which is compatible with null and concurrent
read. The protected-write mode might be used to carry out updates to a data structure
that could interfere with protected readers but which could be tolerated by concurrent
readers.
The sixth and final mode is exclusive, which is compatible only with null. The
exclusive mode is used when it is necessary to exclude all other accesses.
It is interesting to note that exclusive locks and reader-writer locks can be emulated
by the VAX/VMS DLM. Exclusive locks would use only the null and exclusive modes,
while reader-writer locks might use the null, protected-read, and protected-write modes.
Quick Quiz 7.19: Is there any other way for the VAX/VMS DLM to emulate a
reader-writer lock?
Although the VAX/VMS DLM policy has seen widespread production use for dis-
tributed databases, it does not appear to be used much in shared-memory applications.
One possible reason for this is that the greater communication overheads of distributed
databases can hide the greater overhead of the VAX/VMS DLM’s more-complex admis-
sion policy.
Nevertheless, the VAX/VMS DLM is an interesting illustration of just how flexible
the concepts behind locking can be. It also serves as a very simple introduction to the
150 CHAPTER 7. LOCKING
locking schemes used by modern DBMSes, which can have more than thirty locking
modes, compared to VAX/VMS’s six.
finally.
6 My later work with parallelism at Sequent Computer Systems very quickly disabused me of this
misguided notion.
7.2. TYPES OF LOCKS 151
Root rcu_node
Structure
CPU m
CPU m * (N − 1)
CPU m * (N − 1) + 1
CPU m * N − 1
Figure 7.14: Locking Hierarchy
(lines 7-8) spins until the lock is available, at which point the outer loop makes another
attempt to acquire the lock.
Quick Quiz 7.23: Why bother with the inner loop on lines 7-8 of Figure 7.16? Why
not simply repeatedly do the atomic exchange operation on line 6?
Lock release is carried out by the xchg_unlock() function shown on lines 12-15.
Line 14 atomically exchanges the value zero (“unlocked”) into the lock, thus marking it
as having been released.
Quick Quiz 7.24: Why not simply store zero into the lock word on line 14 of
Figure 7.16?
This lock is a simple example of a test-and-set lock [SR84], but very similar mecha-
nisms have been used extensively as pure spinlocks in production.
7 Besides, the best way of handling high lock contention is to avoid it in the first place! However, there
are some situation where high lock contention is the lesser of the available evils, and in any case, studying
schemes that deal with high levels of contention is good mental exercise.
154 CHAPTER 7. LOCKING
More recent queued-lock implementations also take the system’s architecture into
account, preferentially granting locks locally, while also taking steps to avoid starva-
tion [SSVM02, RH03, RH02, JMRR02, MCM02]. Many of these can be thought of as
analogous to the elevator algorithms traditionally used in scheduling disk I/O.
Unfortunately, the same scheduling logic that improves the efficiency of queued
locks at high contention also increases their overhead at low contention. Beng-Hong Lim
and Anant Agarwal therefore combined a simple test-and-set lock with a queued lock,
using the test-and-set lock at low levels of contention and switching to the queued lock at
high levels of contention [LA94], thus getting low overhead at low levels of contention
and getting fairness and high throughput at high levels of contention. Browning et
al. took a similar approach, but avoided the use of a separate flag, so that the test-and-
set fast path uses the same sequence of instructions that would be used in a simple
test-and-set lock [BMMM05]. This approach has been used in production.
Another issue that arises at high levels of contention is when the lock holder is
delayed, especially when the delay is due to preemption, which can result in priority
inversion, where a low-priority thread holds a lock, but is preempted by a medium
priority CPU-bound thread, which results in a high-priority process blocking while
attempting to acquire the lock. The result is that the CPU-bound medium-priority
process is preventing the high-priority process from running. One solution is priority
inheritance [LR80], which has been widely used for real-time computing [SRL90a,
Cor06b], despite some lingering controversy over this practice [Yod04a, Loc02].
Another way to avoid priority inversion is to prevent preemption while a lock is
held. Because preventing preemption while locks are held also improves throughput,
most proprietary UNIX kernels offer some form of scheduler-conscious synchronization
mechanism [KWS97], largely due to the efforts of a certain sizable database vendor.
These mechanisms usually take the form of a hint that preemption would be inappro-
priate. These hints frequently take the form of a bit set in a particular machine register,
which enables extremely low per-lock-acquisition overhead for these mechanisms. In
contrast, Linux avoids these hints, instead getting similar results from a mechanism
called futexes [FRK02, Mol06, Ros06, Dre11].
Interestingly enough, atomic instructions are not strictly needed to implement
locks [Dij65, Lam74]. An excellent exposition of the issues surrounding locking imple-
mentations based on simple loads and stores may be found in Herlihy’s and Shavit’s
textbook [HS08]. The main point echoed here is that such implementations currently
have little practical application, although a careful study of them can be both entertaining
and enlightening. Nevertheless, with one exception described below, such study is left
as an exercise for the reader.
Gamsa et al. [GKAS99, Section 5.3] describe a token-based mechanism in which a
token circulates among the CPUs. When the token reaches a given CPU, it has exclusive
access to anything protected by that token. There are any number of schemes that may
be used to implement the token-based mechanism, for example:
1. Maintain a per-CPU flag, which is initially zero for all but one CPU. When a
CPU’s flag is non-zero, it holds the token. When it finishes with the token, it
zeroes its flag and sets the flag of the next CPU to one (or to any other non-zero
value).
CPU (taking counter wrap into account), the first CPU holds the token. When it
is finished with the token, it sets the next CPU’s counter to a value one greater
than its own counter.
Quick Quiz 7.25: How can you tell if one counter is greater than another, while
accounting for counter wrap?
Quick Quiz 7.26: Which is better, the counter approach or the flag approach?
This lock is unusual in that a given CPU cannot necessarily acquire it immediately,
even if no other CPU is using it at the moment. Instead, the CPU must wait until the
token comes around to it. This is useful in cases where CPUs need periodic access
to the critical section, but can tolerate variances in token-circulation rate. Gamsa et
al. [GKAS99] used it to implement a variant of read-copy update (see Section 9.5), but
it could also be used to protect periodic per-CPU operations such as flushing per-CPU
caches used by memory allocators [MS93], garbage-collecting per-CPU data structures,
or flushing per-CPU data to shared storage (or to mass storage, for that matter).
As increasing numbers of people gain familiarity with parallel hardware and paral-
lelize increasing amounts of code, we can expect more special-purpose locking primi-
tives to appear. Nevertheless, you should carefully consider this important safety tip:
Use the standard synchronization primitives whenever humanly possible. The big ad-
vantage of the standard synchronization primitives over roll-your-own efforts is that the
standard primitives are typically much less bug-prone.8
1. Global variables and static local variables in the base module will exist as long as
the application is running.
2. Global variables and static local variables in a loaded module will exist as long as
that module remains loaded.
8 And yes, I have done at least my share of roll-your-own synchronization primitives. However, you will
notice that my hair is much greyer than it was before I started doing that sort of work. Coincidence? Maybe.
But are you really willing to risk your own hair turning prematurely grey?
156 CHAPTER 7. LOCKING
3. A module will remain loaded as long as at least one of its functions has an active
instance.
4. A given function instance’s on-stack variables will exist until that instance returns.
5. If you are executing within a given function or have been called (directly or
indirectly) from that function, then the given function has an active instance.
lock is running in the parent but not the child, if the child calls your library function,
deadlock will ensue.
The following strategies may be used to avoid deadlock problems in these cases:
Let the caller control synchronization. This works extremely well when the library
functions are operating on independent caller-visible instances of a data structure, each
of which may be synchronized separately. For example, if the library functions operate
on a search tree, and if the application needs a large number of independent search trees,
then the application can associate a lock with each tree. The application then acquires
and releases locks as needed, so that the library need not be aware of parallelism at all.
Instead, the application controls the parallelism, so that locking can work very well, as
was discussed in Section 7.5.1.
However, this strategy fails if the library implements a data structure that requires
internal concurrency, for example, a hash table or a parallel sort. In this case, the library
absolutely must control its own synchronization.
The idea here is to add arguments to the library’s API to specify which locks to acquire,
how to acquire and release them, or both. This strategy allows the application to take on
the global task of avoiding deadlock by specifying which locks to acquire (by passing in
pointers to the locks in question) and how to acquire them (by passing in pointers to lock
acquisition and release functions), but also allows a given library function to control its
own concurrency by deciding where the locks should be acquired and released.
In particular, this strategy allows the lock acquisition and release functions to block
signals as needed without the library code needing to be concerned with which signals
need to be blocked by which locks. The separation of concerns used by this strategy can
be quite effective, but in some cases the strategies laid out in the following sections can
work better.
That said, passing explicit pointers to locks to external APIs must be very carefully
considered, as discussed in Section 7.1.1.4. Although this practice is sometimes the
right thing to do, you should do yourself a favor by looking into alternative designs first.
The basic rule behind this strategy was discussed in Section 7.1.1.2: “Release all locks
before invoking unknown code.” This is usually the best approach because it allows
the application to ignore the library’s locking hierarchy: the library remains a leaf or
isolated subtree of the application’s overall locking hierarchy.
In cases where it is not possible to release all locks before invoking unknown code,
the layered locking hierarchies described in Section 7.1.1.3 can work well. For example,
if the unknown code is a signal handler, this implies that the library function block
signals across all lock acquisitions, which can be complex and slow. Therefore, in
cases where signal handlers (probably unwisely) acquire locks, the strategies in the next
section may prove helpful.
160 CHAPTER 7. LOCKING
1. If the application invokes the library function from within a signal handler, then
that signal must be blocked every time that the library function is invoked from
outside of a signal handler.
2. If the application invokes the library function while holding a lock acquired within
a given signal handler, then that signal must be blocked every time that the library
function is called outside of a signal handler.
These rules can be enforced by using tools similar to the Linux kernel’s lockdep
lock dependency checker [Cor06a]. One of the great strengths of lockdep is that it is
not fooled by human intuition [Ros11].
1. The data structures protected by that lock are likely to be in some intermedi-
ate state, so that naively breaking the lock might result in arbitrary memory
corruption.
2. If the child creates additional threads, two threads might break the lock concur-
rently, with the result that both threads believe they own the lock. This could
again result in arbitrary memory corruption.
The atfork() function is provided to help deal with these situations. The idea is
to register a triplet of functions, one to be called by the parent before the fork(), one
to be called by the parent after the fork(), and one to be called by the child after the
fork(). Appropriate cleanups can then be carried out at these three points.
Be warned, however, that coding of atfork() handlers is quite subtle in general.
The cases where atfork() works best are cases where the data structure in question
can simply be re-initialized by the child.
These flaws and the consequences for locking are discussed in the following sections.
1. Determining when to resize the hash table. In this case, an approximate count
should work quite well. It might also be useful to trigger the resizing operation
from the length of the longest chain, which can be computed and maintained in a
nicely partitioned per-chain manner.
2. Producing an estimate of the time required to traverse the entire hash table. An
approximate count works well in this case, also.
3. For diagnostic purposes, for example, to check for items being lost when trans-
ferring them to and from the hash table. This clearly requires an exact count.
However, given that this usage is diagnostic in nature, it might suffice to maintain
the lengths of the hash chains, then to infrequently sum them up while locking
out addition and deletion operations.
It turns out that there is now a strong theoretical basis for some of the constraints that
performance and scalability place on a parallel library’s APIs [AGH+ 11a, AGH+ 11b,
McK11b]. Anyone designing a parallel library needs to pay close attention to those
constraints.
Although it is all too easy to blame locking for what are really problems due to a
concurrency-unfriendly API, doing so is not helpful. On the other hand, one has little
choice but to sympathize with the hapless developer who made this choice in (say)
1985. It would have been a rare and courageous developer to anticipate the need for
162 CHAPTER 7. LOCKING
parallelism at that time, and it would have required an even more rare combination of
brilliance and luck to actually arrive at a good parallel-friendly API.
Times change, and code must change with them. That said, there might be a huge
number of users of a popular library, in which case an incompatible change to the API
would be quite foolish. Adding a parallel-friendly API to complement the existing
heavily used sequential-only API is probably the best course of action in this situation.
Nevertheless, human nature being what it is, we can expect our hapless developer
to be more likely to complain about locking than about his or her own poor (though
understandable) API design choices.
Sections 7.1.1.2, 7.1.1.3, and 7.5.2 described how undisciplined use of callbacks can
result in locking woes. These sections also described how to design your library function
to avoid these problems, but it is unrealistic to expect a 1990s programmer with no
experience in parallel programming to have followed such a design. Therefore, someone
attempting to parallelize an existing callback-heavy single-threaded library will likely
have many opportunities to curse locking’s villainy.
If there are a very large number of uses of a callback-heavy library, it may be wise to
again add a parallel-friendly API to the library in order to allow existing users to convert
their code incrementally. Alternatively, some advocate use of transactional memory in
these cases. While the jury is still out on transactional memory, Section 17.2 discusses
its strengths and weaknesses. It is important to note that hardware transactional memory
(discussed in Section 17.3) cannot help here unless the hardware transactional memory
implementation provides forward-progress guarantees, which few do. Other alternatives
that appear to be quite practical (if less heavily hyped) include the methods discussed in
Sections 7.1.1.5, and 7.1.1.6, as well as those that will be discussed in Chapters 8 and 9.
is worth some time spent thinking about not only alternative ways to accomplish that
particular task, but also alternative tasks that might better solve the problem at hand.
7.6 Summary
Locking is perhaps the most widely used and most generally useful synchronization
tool. However, it works best when designed into an application or library from the
beginning. Given the large quantity of pre-existing single-threaded code that might
need to one day run in parallel, locking should therefore not be the only tool in your
parallel-programming toolbox. The next few chapters will discuss other tools, and how
they can best be used in concert with locking and with each other.
164 CHAPTER 7. LOCKING
It is mine, I tell you. My own. My precious. Yes, my
precious.
Chapter 8
Data Ownership
One of the simplest ways to avoid the synchronization overhead that comes with locking
is to parcel the data out among the threads (or, in the case of kernels, CPUs) so that a
given piece of data is accessed and modified by only one of the threads. Interestingly
enough, data ownership covers each of the “big three” parallel design techniques: It
partitions over threads (or CPUs, as the case may be), it batches all local operations, and
its elimination of synchronization operations is weakening carried to its logical extreme.
It should therefore be no surprise that data ownership is used extremely heavily, in fact,
it is one usage pattern that even novices use almost instinctively. In fact, it is used so
heavily that this chapter will not introduce any new examples, but will instead reference
examples from previous chapters.
Quick Quiz 8.1: What form of data ownership is extremely difficult to avoid when
creating shared-memory parallel programs (for example, using pthreads) in C or C++?
There are a number of approaches to data ownership. Section 8.1 presents the
logical extreme in data ownership, where each thread has its own private address space.
Section 8.2 looks at the opposite extreme, where the data is shared, but different threads
own different access rights to the data. Section 8.3 describes function shipping, which
is a way of allowing other threads to have indirect access to data owned by a particular
thread. Section 8.4 describes how designated threads can be assigned ownership of a
specified function and the related data. Section 8.5 discusses improving performance
by transforming algorithms with shared data to instead use data ownership. Finally,
Section 8.6 lists a few software environments that feature data ownership as a first-class
citizen.
165
166 CHAPTER 8. DATA OWNERSHIP
is owned by that process, so that almost the entirety of data in the above example
is owned. This approach almost entirely eliminates synchronization overhead. The
resulting combination of extreme simplicity and optimal performance is obviously quite
attractive.
Quick Quiz 8.2: What synchronization remains in the example shown in Sec-
tion 8.1?
Quick Quiz 8.3: Is there any shared data in the example shown in Section 8.1?
This same pattern can be written in C as well as in sh, as illustrated by Figures 4.2
and 4.3.
The next section discusses use of data ownership in shared-memory parallel pro-
grams.
8.5 Privatization
One way of improving the performance and scalability of a shared-memory parallel
program is to transform it so as to convert shared data to private data that is owned by a
particular thread.
An excellent example of this is shown in the answer to one of the Quick Quizzes in
Section 6.1.1, which uses privatization to produce a solution to the Dining Philosophers
problem with much better performance and scalability than that of the standard textbook
solution. The original problem has five philosophers sitting around the table with one
fork between each adjacent pair of philosophers, which permits at most two philosophers
to eat concurrently.
We can trivially privatize this problem by providing an additional five forks, so
that each philosopher has his or her own private pair of forks. This allows all five
philosophers to eat concurrently, and also offers a considerable reduction in the spread
of certain types of disease.
In other cases, privatization imposes costs. For example, consider the simple
limit counter shown in Figure 5.12 on page 67. This is an example of an algorithm
where threads can read each others’ data, but are only permitted to update their own
data. A quick review of the algorithm shows that the only cross-thread accesses are
in the summation loop in read_count(). If this loop is eliminated, we move to
the more-efficient pure data ownership, but at the cost of a less-accurate result from
read_count().
Quick Quiz 8.7: Is it possible to obtain greater accuracy while still maintaining full
privacy of the per-thread data?
In short, privatization is a powerful tool in the parallel programmer’s toolbox, but it
must nevertheless be used with care. Just like every other synchronization primitive, it
has the potential to increase complexity while decreasing performance and scalability.
Violet Fane
Chapter 9
Deferred Processing
The strategy of deferring work goes back before the dawn of recorded history. It
has occasionally been derided as procrastination or even as sheer laziness. However,
in the last few decades workers have recognized this strategy’s value in simplifying
and streamlining parallel algorithms [KL80, Mas92]. Believe it or not, “laziness”
in parallel programming often outperforms and out-scales industriousness! These
performance and scalability benefits stem from the fact that deferring work often enables
weakening of synchronization primitives, thereby reducing synchronization overhead.
General approaches of work deferral include reference counting (Section 9.2), hazard
pointers (Section 9.3), sequence locking (Section 9.4), and RCU (Section 9.5). Finally,
Section 9.6 describes how to choose among the work-deferral schemes covered in this
chapter and Section 9.7 discusses the role of updates. But first we will introduce an
example algorithm that will be used to compare and contrast these approaches.
169
170 CHAPTER 9. DEFERRED PROCESSING
route_list
2 Weizenbaum discusses reference counting as if it was already well-known, so it likely dates back to
the 1950s and perhaps even to the 1940s. And perhaps even further. People repairing and maintaining large
machines have long used a mechanical reference-counting technique, where each worker had a padlock.
9.2. REFERENCE COUNTING 171
1 struct route_entry {
2 struct cds_list_head re_next;
3 unsigned long addr;
4 unsigned long iface;
5 };
6 CDS_LIST_HEAD(route_list);
7
8 unsigned long route_lookup(unsigned long addr)
9 {
10 struct route_entry *rep;
11 unsigned long ret;
12
13 cds_list_for_each_entry(rep,
14 &route_list, re_next) {
15 if (rep->addr == addr) {
16 ret = rep->iface;
17 return ret;
18 }
19 }
20 return ULONG_MAX;
21 }
22
23 int route_add(unsigned long addr,
24 unsigned long interface)
25 {
26 struct route_entry *rep;
27
28 rep = malloc(sizeof(*rep));
29 if (!rep)
30 return -ENOMEM;
31 rep->addr = addr;
32 rep->iface = interface;
33 cds_list_add(&rep->re_next, &route_list);
34 return 0;
35 }
36
37 int route_del(unsigned long addr)
38 {
39 struct route_entry *rep;
40
41 cds_list_for_each_entry(rep,
42 &route_list, re_next) {
43 if (rep->addr == addr) {
44 cds_list_del(&rep->re_next);
45 free(rep);
46 return 0;
47 }
48 }
49 return -ENOENT;
50 }
finally lines 34-35 invoke re_free() if the new value of the reference count is zero.
Quick Quiz 9.2: Why doesn’t route_del() in Figure 9.4 use reference counts
to protect the traversal to the element to be freed?
Figure 9.5 shows the performance and scalability of reference counting on a read-
only workload with a ten-element list running on a single-socket four-core hyperthreaded
2.5GHz x86 system. The “ideal” trace was generated by running the sequential code
shown in Figure 9.2, which works only because this is a read-only workload. The
reference-counting performance is abysmal and its scalability even more so, with the
“refcnt” trace dropping down onto the x-axis. This should be no surprise in view of
Chapter 3: The reference-count acquisitions and releases have added frequent shared-
memory writes to an otherwise read-only workload, thus incurring severe retribution
from the laws of physics. As well it should, given that all the wishful thinking in the
world is not going to increase the speed of light or decrease the size of the atoms used
172 CHAPTER 9. DEFERRED PROCESSING
450000
400000 ideal
2. Thread B invokes route_del() in Figure 9.4 to delete the route entry for
address 42. It completes successfully, and because this entry’s ->re_refcnt
field was equal to the value one, it invokes re_free() to set the ->re_freed
field and to free the entry.
The problem is that the reference count is located in the object to be protected, but
that means that there is no protection during the instant in time when the reference
count itself is being acquired! This is the reference-counting counterpart of a locking
issue noted by Gamsa et al. [GKAS99]. One could imagine using a global lock or
reference count to protect the per-route-entry reference-count acquisition, but this
would result in severe contention issues. Although algorithms exist that allow safe
reference-count acquisition in a concurrent environment [Val95], they are not only
9.3. HAZARD POINTERS 175
extremely complex and error-prone [MS95], but also provide terrible performance and
scalability [HMBW07].
In short, concurrency has most definitely reduced the usefulness of reference count-
ing!
Quick Quiz 9.5: If concurrency has “most definitely reduced the usefulness of
reference counting”, why are there so many reference counters in the Linux kernel?
That said, sometimes it is necessary to look at a problem in an entirely different way
in order to successfully solve it. The next section describes what could be thought of as
an inside-out reference count that provides decent performance and scalability.
1 struct route_entry {
2 struct hazptr_head hh;
3 struct route_entry *re_next;
4 unsigned long addr;
5 unsigned long iface;
6 int re_freed;
7 };
8 struct route_entry route_list;
9 DEFINE_SPINLOCK(routelock);
10 hazard_pointer __thread *my_hazptr;
11
12 unsigned long route_lookup(unsigned long addr)
13 {
14 int offset = 0;
15 struct route_entry *rep;
16 struct route_entry **repp;
17
18 retry:
19 repp = &route_list.re_next;
20 do {
21 rep = ACCESS_ONCE(*repp);
22 if (rep == NULL)
23 return ULONG_MAX;
24 if (rep == (struct route_entry *)HAZPTR_POISON)
25 goto retry;
26 my_hazptr[offset].p = &rep->hh;
27 offset = !offset;
28 smp_mb();
29 if (ACCESS_ONCE(*repp) != rep)
30 goto retry;
31 repp = &rep->re_next;
32 } while (rep->addr != addr);
33 if (ACCESS_ONCE(rep->re_freed))
34 abort();
35 return rep->iface;
36 }
450000
400000 ideal
The Pre-BSD routing example can use hazard pointers as shown in Figure 9.7
for data structures and route_lookup(), and in Figure 9.8 for route_add()
and route_del() (route_hazptr.c). As with reference counting, the hazard-
pointers implementation is quite similar to the sequential algorithm shown in Figure 9.2
on page 171, so only differences will be discussed.
Starting with Figure 9.7, line 2 shows the ->hh field used to queue objects pending
hazard-pointer free, line 6 shows the ->re_freed field used to detect use-after-free
bugs, and lines 24-30 attempt to acquire a hazard pointer, branching to line 18’s retry
label on failure.
In Figure 9.8, line 11 initializes ->re_freed, lines 32 and 33 poison the ->re_
next field of the newly removed object, and line 35 passes that object to the hazard
pointers’s hazptr_free_later() function, which will free that object once it is
safe to do so. The spinlocks work the same as in Figure 9.4.
Figure 9.9 shows the hazard-pointers-protected Pre-BSD routing algorithm’s per-
formance on the same read-only workload as for Figure 9.5. Although hazard pointers
scales much better than does reference counting, hazard pointers still require readers
to do writes to shared memory (albeit with much improved locality of reference), and
also require a full memory barrier and retry check for each object traversed. Therefore,
hazard pointers’s performance is far short of ideal. On the other hand, hazard pointers
do operate correctly for workloads involving concurrent updates.
Quick Quiz 9.10: The paper “Structured Deferral: Synchronization via Procrasti-
nation” [McK13] shows that hazard pointers have near-ideal performance. Whatever
happened in Figure 9.9???
The next section attempts to improve on hazard pointers by using sequence locks,
which avoid both read-side writes and per-object memory barriers.
Figure 9.10, it is important to design code using sequence locks so that readers very
rarely need to retry.
Quick Quiz 9.11: Why isn’t this sequence-lock discussion in Chapter 7, you know,
the one on locking?
The key component of sequence locking is the sequence number, which has an even
value in the absence of updaters and an odd value if there is an update in progress.
Readers can then snapshot the value before and after each access. If either snapshot has
an odd value, or if the two snapshots differ, there has been a concurrent update, and the
reader must discard the results of the access and then retry it. Readers therefore use
the read_seqbegin() and read_seqretry() functions shown in Figure 9.11
when accessing data protected by a sequence lock. Writers must increment the value
before and after each update, and only one writer is permitted at a given time. Writers
therefore use the write_seqlock() and write_sequnlock() functions shown
in Figure 9.12 when updating data protected by a sequence lock.
As a result, sequence-lock-protected data can have an arbitrarily large number of
concurrent readers, but only one writer at a time. Sequence locking is used in the Linux
kernel to protect calibration quantities used for timekeeping. It is also used in pathname
traversal to detect concurrent rename operations.
A simple implementation of sequence locks is shown in Figure 9.13 (seqlock.h).
The seqlock_t data structure is shown on lines 1-4, and contains the sequence
number along with a lock to serialize writers. Lines 6-10 show seqlock_init(),
1 do {
2 seq = read_seqbegin(&test_seqlock);
3 /* read-side access. */
4 } while (read_seqretry(&test_seqlock, seq));
1 typedef struct {
2 unsigned long seq;
3 spinlock_t lock;
4 } seqlock_t;
5
6 static void seqlock_init(seqlock_t *slp)
7 {
8 slp->seq = 0;
9 spin_lock_init(&slp->lock);
10 }
11
12 static unsigned long read_seqbegin(seqlock_t *slp)
13 {
14 unsigned long s;
15
16 s = ACCESS_ONCE(slp->seq);
17 smp_mb();
18 return s & ~0x1UL;
19 }
20
21 static int read_seqretry(seqlock_t *slp,
22 unsigned long oldseq)
23 {
24 unsigned long s;
25
26 smp_mb();
27 s = ACCESS_ONCE(slp->seq);
28 return s != oldseq;
29 }
30
31 static void write_seqlock(seqlock_t *slp)
32 {
33 spin_lock(&slp->lock);
34 ++slp->seq;
35 smp_mb();
36 }
37
38 static void write_sequnlock(seqlock_t *slp)
39 {
40 smp_mb();
41 ++slp->seq;
42 spin_unlock(&slp->lock);
43 }
1 struct route_entry {
2 struct route_entry *re_next;
3 unsigned long addr;
4 unsigned long iface;
5 int re_freed;
6 };
7 struct route_entry route_list;
8 DEFINE_SEQ_LOCK(sl);
9
10 unsigned long route_lookup(unsigned long addr)
11 {
12 struct route_entry *rep;
13 struct route_entry **repp;
14 unsigned long ret;
15 unsigned long s;
16
17 retry:
18 s = read_seqbegin(&sl);
19 repp = &route_list.re_next;
20 do {
21 rep = ACCESS_ONCE(*repp);
22 if (rep == NULL) {
23 if (read_seqretry(&sl, s))
24 goto retry;
25 return ULONG_MAX;
26 }
27 repp = &rep->re_next;
28 } while (rep->addr != addr);
29 if (ACCESS_ONCE(rep->re_freed))
30 abort();
31 ret = rep->iface;
32 if (read_seqretry(&sl, s))
33 goto retry;
34 return ret;
35 }
Quick Quiz 9.15: What prevents sequence-locking updaters from starving readers?
Lines 31-36 show write_seqlock(), which simply acquires the lock, incre-
ments the sequence number, and executes a memory barrier to ensure that this in-
crement is ordered before the caller’s critical section. Lines 38-43 show write_
sequnlock(), which executes a memory barrier to ensure that the caller’s critical
section is ordered before the increment of the sequence number on line 44, then releases
the lock.
Quick Quiz 9.16: What if something else serializes writers, so that the lock is not
needed?
Quick Quiz 9.17: Why isn’t seq on line 2 of Figure 9.13 unsigned rather than
unsigned long? After all, if unsigned is good enough for the Linux kernel,
shouldn’t it be good enough for everyone?
So what happens when sequence locking is applied to the Pre-BSD routing table?
Figure 9.14 shows the data structures and route_lookup(), and Figure 9.15 shows
route_add() and route_del() (route_seqlock.c). This implementation
is once again similar to its counterparts in earlier sections, so only the differences will
be highlighted.
In Figure 9.14, line 5 adds ->re_freed, which is checked on lines 29 and 30.
Line 8 adds a sequence lock, which is used by route_lookup() on lines 18, 23,
and 32, with lines 24 and 33 branching back to the retry label on line 17. The effect
is to retry any lookup that runs concurrently with an update.
182 CHAPTER 9. DEFERRED PROCESSING
450000
400000 ideal
In Figure 9.15, lines 12, 15, 24, and 40 acquire and release the sequence lock, while
lines 11, 33, and 44 handle ->re_freed. This implementation is therefore quite
straightforward.
It also performs better on the read-only workload, as can be seen in Figure 9.16,
though its performance is still far from ideal.
Unfortunately, it also suffers use-after-free failures. The problem is that the reader
might encounter a segmentation violation due to accessing an already-freed structure
before it comes to the read_seqretry().
Quick Quiz 9.18: Can this bug be fixed? In other words, can you use sequence locks
as the only synchronization mechanism protecting a linked list supporting concurrent
addition, deletion, and lookup?
Both the read-side and write-side critical sections of a sequence lock can be thought
of as transactions, and sequence locking therefore can be thought of as a limited form
of transactional memory, which will be discussed in Section 17.2. The limitations of
sequence locking are: (1) Sequence locking restricts updates and (2) sequence locking
does not permit traversal of pointers to objects that might be freed by updaters. These
limitations are of course overcome by transactional memory, but can also be overcome
by combining other synchronization primitives with sequence locking.
Sequence locks allow writers to defer readers, but not vice versa. This can result
in unfairness and even starvation in writer-heavy workloads. On the other hand, in the
absence of writers, sequence-lock readers are reasonably fast and scale linearly. It is only
human to want the best of both worlds: fast readers without the possibility of read-side
failure, let alone starvation. In addition, it would also be nice to overcome sequence
locking’s limitations with pointers. The following section presents a synchronization
mechanism with exactly these properties.
pointers covered by Section 9.3 uses implicit counters in the guise of per-thread lists of
pointer. This avoids read-side contention, but requires full memory barriers in read-side
primitives. The sequence lock presented in Section 9.4 also avoids read-side contention,
but does not protect pointer traversals and, like hazard pointers, requires full memory
barriers in read-side primitives. These schemes’ shortcomings raise the question of
whether it is possible to do better.
This section introduces read-copy update (RCU), which provides an API that allows
delays to be identified in the source code, rather than as expensive updates to shared data.
The remainder of this section examines RCU from a number of different perspectives.
Section 9.5.1 provides the classic introduction to RCU, Section 9.5.2 covers fundamental
RCU concepts, Section 9.5.3 introduces some common uses of RCU, Section 9.5.4
presents the Linux-kernel API, Section 9.5.5 covers a sequence of “toy” implementations
of user-level RCU, and finally Section 9.5.6 provides some RCU exercises.
4 On many computer systems, simple assignment is insufficient due to interference from both the compiler
and, on DEC Alpha systems, the CPU as well. This will be covered in Section 9.5.2.
6 And yet again, this approximates reality, which will be expanded on in Section 9.5.2.
9.5. READ-COPY UPDATE (RCU) 185
(1) gptr
kmalloc()
p
->addr=?
(2) gptr ->iface=?
initialization
p
->addr=42
(3) gptr ->iface=1
gptr = p; /*almost*/
p
->addr=42
(4) gptr ->iface=1
shows that this can also result in long delays, just as can the locking and sequence-
locking approaches that we already rejected.
Let’s consider the logical extreme where the readers do absolutely nothing to
announce their presence. This approach clearly allows optimal performance for readers
(after all, free is a very good price), but leaves open the question of how the updater can
possibly determine when all the old readers are done. We clearly need some additional
constraints if we are to provide a reasonable answer to this question.
One constraint that fits well with some operating-system kernels is to consider the
case where threads are not subject to preemption. In such non-preemptible environments,
each thread runs until it explicitly and voluntarily blocks. This means that an infinite
loop without blocking will render a CPU useless for any other purpose from the start of
the infinite loop onwards.7 Non-preemptibility also requires that threads be prohibited
from blocking while holding spinlocks. Without this prohibition, all CPUs might be
consumed by threads spinning attempting to acquire a spinlock held by a blocked thread.
The spinning threads will not relinquish their CPUs until they acquire the lock, but
the thread holding the lock cannot possibly release it until one of the spinning threads
relinquishes a CPU. This is a classic deadlock situation.
Let us impose this same constraint on reader threads traversing the linked list:
7 In contrast, an infinite loop in a preemptible environment might be preempted. This infinite loop might
still waste considerable CPU time, but the CPU in question would nevertheless be able to do other work.
186 CHAPTER 9. DEFERRED PROCESSING
Readers?
(1) A B C 1 Version
list_del() /*almost*/
Readers?
(2) A B C 2 Versions
(3) A B C 1 Versions
free()
(4) A C 1 Versions
such threads are not allowed to block until after completing their traversal. Returning
to the second row of Figure 9.18, where the updater has just completed executing
list_del(), imagine that CPU 0 executes a context switch. Because readers are
not permitted to block while traversing the linked list, we are guaranteed that all prior
readers that might have been running on CPU 0 will have completed. Extending this
line of reasoning to the other CPUs, once each CPU has been observed executing a
context switch, we are guaranteed that all prior readers have completed, and that there
are no longer any reader threads referencing element B. The updater can then safely
free element B, resulting in the state shown at the bottom of Figure 9.18.
This approach is termed quiescent state based reclamation (QSBR) [HMB06]. A
QSBR schematic is shown in Figure 9.19, with time advancing from the top of the figure
to the bottom.
Although production-quality implementations of this approach can be quite complex,
a toy implementation is exceedingly simple:
1 for_each_online_cpu(cpu)
2 run_on(cpu);
The for_each_online_cpu() primitive iterates over all CPUs, and the run_
on() function causes the current thread to execute on the specified CPU, which forces
the destination CPU to execute a context switch. Therefore, once the for_each_
online_cpu() has completed, each CPU has executed a context switch, which in
turn guarantees that all pre-existing reader threads have completed.
Please note that this approach is not production quality. Correct handling of a
9.5. READ-COPY UPDATE (RCU) 187
list_del()
Context Switch
Reader
Grace Period
free()
number of corner cases and the need for a number of powerful optimizations mean that
production-quality implementations have significant additional complexity. In addition,
RCU implementations for preemptible environments require that readers actually do
something. However, this simple non-preemptible approach is conceptually complete,
and forms a good initial basis for understanding the RCU fundamentals covered in the
following section.
1 struct foo {
2 int a;
3 int b;
4 int c;
5 };
6 struct foo *gp = NULL;
7
8 /* . . . */
9
10 p = kmalloc(sizeof(*p), GFP_KERNEL);
11 p->a = 1;
12 p->b = 2;
13 p->c = 3;
14 gp = p;
possibly work). This document addresses these questions from a fundamental viewpoint;
later installments look at them from usage and from API viewpoints. This last installment
also includes a list of references.
RCU is made up of three fundamental mechanisms, the first being used for insertion,
the second being used for deletion, and the third being used to allow readers to tolerate
concurrent insertions and deletions. Section 9.5.2.1 describes the publish-subscribe
mechanism used for insertion, Section 9.5.2.2 describes how waiting for pre-existing
RCU readers enabled deletion, and Section 9.5.2.3 discusses how maintaining multiple
versions of recently updated objects permits concurrent insertions and deletions. Finally,
Section 9.5.2.4 summarizes RCU fundamentals.
A B C
Although this code fragment might well seem immune to misordering, unfortunately,
the DEC Alpha CPU [McK05a, McK05b] and value-speculation compiler optimizations
can, believe it or not, cause the values of p->a, p->b, and p->c to be fetched before
the value of p. This is perhaps easiest to see in the case of value-speculation compiler
optimizations, where the compiler guesses the value of p fetches p->a, p->b, and
p->c then fetches the actual value of p in order to check whether its guess was correct.
This sort of optimization is quite aggressive, perhaps insanely so, but does actually
occur in the context of profile-driven optimization.
Clearly, we need to prevent this sort of skullduggery on the part of both the compiler
and the CPU. The rcu_dereference() primitive uses whatever memory-barrier
instructions and compiler directives are required for this purpose:8
1 rcu_read_lock();
2 p = rcu_dereference(gp);
3 if (p != NULL) {
4 do_something_with(p->a, p->b, p->c);
5 }
6 rcu_read_unlock();
a memory barrier instruction. In the C11 and C++11 standards, memory_order_consume is intended
to provide longer-term support for rcu_dereference(), but no compilers implement this natively yet.
(They instead strengthen memory_order_consume to memory_order_acquire, thus emitting a
needless memory-barrier instruction on weakly ordered systems.)
190 CHAPTER 9. DEFERRED PROCESSING
1 struct foo {
2 struct list_head *list;
3 int a;
4 int b;
5 int c;
6 };
7 LIST_HEAD(head);
8
9 /* . . . */
10
11 p = kmalloc(sizeof(*p), GFP_KERNEL);
12 p->a = 1;
13 p->b = 2;
14 p->c = 3;
15 list_add_rcu(&p->list, &head);
1 struct foo {
2 struct hlist_node *list;
3 int a;
4 int b;
5 int c;
6 };
7 HLIST_HEAD(head);
8
9 /* . . . */
10
11 p = kmalloc(sizeof(*p), GFP_KERNEL);
12 p->a = 1;
13 p->b = 2;
14 p->c = 3;
15 hlist_add_head_rcu(&p->list, &head);
The set of RCU publish and subscribe primitives are shown in Table 9.1, along with
additional primitives to “unpublish”, or retract.
Note that the list_replace_rcu(), list_del_rcu(), hlist_replace_
rcu(), and hlist_del_rcu() APIs add a complication. When is it safe to free
up the data element that was replaced or removed? In particular, how can we possibly
know when all the readers have released their references to that data element?
These questions are addressed in the following section.
Grace Period
Reader Reader
Extends as
Reader Reader Needed
Reader Reader
Removal Reclamation
Time
2. Wait for all pre-existing RCU read-side critical sections to completely finish (for
example, by using the synchronize_rcu() primitive or its asynchronous
counterpart, call_rcu(), which invokes a specified function at the end of a
future grace period). The key observation here is that subsequent RCU read-side
critical sections have no way to gain a reference to the newly removed element.
3. Clean up, for example, free the element that was replaced above.
The code fragment shown in Figure 9.27, adapted from those in Section 9.5.2.1,
demonstrates this process, with field a being the search key.
Lines 19, 20, and 21 implement the three steps called out above. Lines 16-19 gives
RCU (“read-copy update”) its name: while permitting concurrent reads, line 16 copies
and lines 17-19 do an update.
As discussed in Section 9.5.1, the synchronize_rcu() primitive can be quite
simple (see Section 9.5.5 for additional “toy” RCU implementations). However,
production-quality implementations must deal with difficult corner cases and also incor-
porate powerful optimizations, both of which result in significant complexity. Although
it is good to know that there is a simple conceptual implementation of synchronize_
rcu(), other questions remain. For example, what exactly do RCU readers see when
9.5. READ-COPY UPDATE (RCU) 193
1 struct foo {
2 struct list_head *list;
3 int a;
4 int b;
5 int c;
6 };
7 LIST_HEAD(head);
8
9 /* . . . */
10
11 p = search(head, key);
12 if (p == NULL) {
13 /* Take appropriate action, unlock, & return. */
14 }
15 q = kmalloc(sizeof(*p), GFP_KERNEL);
16 *q = *p;
17 q->b = 2;
18 q->c = 3;
19 list_replace_rcu(&p->list, &q->list);
20 synchronize_rcu();
21 kfree(p);
This code will update the list as shown in Figure 9.28. The triples in each element
represent the values of fields a, b, and c, respectively. The red-shaded elements indicate
that RCU readers might be holding references to them, so in the initial state at the
top of the diagram, all elements are shaded red. Please note that we have omitted the
backwards pointers and the link from the tail of the list to the head for clarity.
After the list_del_rcu() on line 3 has completed, the 5,6,7 element has
been removed from the list, as shown in the second row of Figure 9.28. Since readers do
not synchronize directly with updaters, readers might be concurrently scanning this list.
These concurrent readers might or might not see the newly removed element, depending
on timing. However, readers that were delayed (e.g., due to interrupts, ECC memory
errors, or, in CONFIG_PREEMPT_RT kernels, preemption) just after fetching a pointer
to the newly removed element might see the old version of the list for quite some time
194 CHAPTER 9. DEFERRED PROCESSING
list_del_rcu()
synchronize_rcu()
kfree()
1,2,3 11,4,8
after the removal. Therefore, we now have two versions of the list, one with element
5,6,7 and one without. The 5,6,7 element in the second row of the figure is now
shaded yellow, indicating that old readers might still be referencing it, but that new
readers cannot obtain a reference to it.
Please note that readers are not permitted to maintain references to element 5,6,7
after exiting from their RCU read-side critical sections. Therefore, once the synchronize_
rcu() on line 4 completes, so that all pre-existing readers are guaranteed to have
completed, there can be no more readers referencing this element, as indicated by its
green shading on the third row of Figure 9.28. We are thus back to a single version of
the list.
At this point, the 5,6,7 element may safely be freed, as shown on the final row
of Figure 9.28. At this point, we have completed the deletion of element 5,6,7. The
following example covers replacement.
The initial state of the list, including the pointer p, is the same as for the deletion
example, as shown on the first row of Figure 9.29.
As before, the triples in each element represent the values of fields a, b, and c,
9.5. READ-COPY UPDATE (RCU) 195
Allocate
?,?,?
Copy
5,6,7
Update
5,2,3
list_replace_rcu()
5,2,3
synchronize_rcu()
5,2,3
kfree()
Discussion These examples assumed that a mutex was held across the entire update
operation, which would mean that there could be at most two versions of the list active
at a given time.
Quick Quiz 9.21: How would you modify the deletion example to permit more
than two versions of the list to be active?
Quick Quiz 9.22: How many RCU versions of a given list can be active at any
given time?
This sequence of events shows how RCU updates use multiple versions to safely
carry out changes in presence of concurrent readers. Of course, some algorithms cannot
gracefully handle multiple versions. There are techniques for adapting such algorithms
to RCU [McK04], but these are beyond the scope of this section.
Quick Quiz 9.23: How can RCU updaters possibly delay RCU readers, given
that the rcu_read_lock() and rcu_read_unlock() primitives neither spin
nor block?
These three RCU components allow data to be updated in face of concurrent readers,
and can be combined in different ways to implement a surprising variety of different
types of RCU-based algorithms, some of which are described in the following section.
1 struct route_entry {
2 struct rcu_head rh;
3 struct cds_list_head re_next;
4 unsigned long addr;
5 unsigned long iface;
6 int re_freed;
7 };
8 CDS_LIST_HEAD(route_list);
9 DEFINE_SPINLOCK(routelock);
10
11 unsigned long route_lookup(unsigned long addr)
12 {
13 struct route_entry *rep;
14 unsigned long ret;
15
16 rcu_read_lock();
17 cds_list_for_each_entry_rcu(rep, &route_list,
18 re_next) {
19 if (rep->addr == addr) {
20 ret = rep->iface;
21 if (ACCESS_ONCE(rep->re_freed))
22 abort();
23 rcu_read_unlock();
24 return ret;
25 }
26 }
27 rcu_read_unlock();
28 return ULONG_MAX;
29 }
The answer to this shown in Figure 9.33, which shows the RCU QSBR results as the
trace between the RCU and the ideal traces. RCU QSBR’s performance and scalability
is very nearly that of an ideal synchronization-free workload, as desired.
Quick Quiz 9.24: Why doesn’t RCU QSBR give exactly ideal results?
Quick Quiz 9.25: Given RCU QSBR’s read-side performance, why bother with
any other flavor of userspace RCU?
450000
400000 ideal
450000
400000 ideal
Lookups per Millisecond
350000
300000
RCU
250000
200000
seqlock
150000
100000
50000 hazptr
refcnt
0
1 2 3 4 5 6 7 8
Number of CPUs (Threads)
10000
1000 rwlock
Overhead (nanoseconds)
100
10
1
0.1
0.01
0.001 rcu
1e-04
1e-05
0 2 4 6 8 10 12 14 16
Number of CPUs
10000
rwlock
Overhead (nanoseconds)
1000
100
10 rcu
1
0 2 4 6 8 10 12 14 16
Number of CPUs
12000
10000
Overhead (nanoseconds)
8000
rwlock
6000
4000
2000 rcu
0
0 2 4 6 8 10
Critical-Section Duration (microseconds)
Note that do_update() is executed under the protection of the lock and under
RCU read-side protection.
Another interesting consequence of RCU’s deadlock immunity is its immunity to a
large class of priority inversion problems. For example, low-priority RCU readers cannot
prevent a high-priority RCU updater from acquiring the update-side lock. Similarly, a
low-priority RCU updater cannot prevent high-priority RCU readers from entering an
RCU read-side critical section.
Quick Quiz 9.29: Immunity to both deadlock and priority inversion??? Sounds too
good to be true. Why should I believe that this is even possible?
Realtime Latency Because RCU read-side primitives neither spin nor block, they
offer excellent realtime latencies. In addition, as noted earlier, this means that they are
immune to priority inversion involving the RCU read-side primitives and locks.
However, RCU is susceptible to more subtle priority-inversion scenarios, for exam-
ple, a high-priority process blocked waiting for an RCU grace period to elapse can be
blocked by low-priority RCU readers in -rt kernels. This can be solved by using RCU
priority boosting [McK07c, GMTW08].
9.5. READ-COPY UPDATE (RCU) 203
Update Received
RCU Readers and Updaters Run Concurrently Because RCU readers never spin
nor block, and because updaters are not subject to any sort of rollback or abort semantics,
RCU readers and updaters must necessarily run concurrently. This means that RCU
readers might access stale data, and might even see inconsistencies, either of which can
render conversion from reader-writer locking to RCU non-trivial.
However, in a surprisingly large number of situations, inconsistencies and stale data
are not problems. The classic example is the networking routing table. Because routing
updates can take considerable time to reach a given system (seconds or even minutes),
the system will have been sending packets the wrong way for quite some time when
the update arrives. It is usually not a problem to continue sending updates the wrong
way for a few additional milliseconds. Furthermore, because RCU updaters can make
changes without waiting for RCU readers to finish, the RCU readers might well see the
change more quickly than would batch-fair reader-writer-locking readers, as shown in
Figure 9.37.
Once the update is received, the rwlock writer cannot proceed until the last reader
completes, and subsequent readers cannot proceed until the writer completes. However,
these subsequent readers are guaranteed to see the new value, as indicated by the green
shading of the rightmost boxes. In contrast, RCU readers and updaters do not block
each other, which permits the RCU readers to see the updated values sooner. Of course,
because their execution overlaps that of the RCU updater, all of the RCU readers might
well see updated values, including the three readers that started before the update.
Nevertheless only the green-shaded rightmost RCU readers are guaranteed to see the
updated values.
Reader-writer locking and RCU simply provide different guarantees. With reader-
writer locking, any reader that begins after the writer begins is guaranteed to see new
values, and any reader that attempts to begin while the writer is spinning might or
might not see new values, depending on the reader/writer preference of the rwlock
implementation in question. In contrast, with RCU, any reader that begins after the
updater completes is guaranteed to see new values, and any reader that completes after
the updater begins might or might not see new values, depending on timing.
The key point here is that, although reader-writer locking does indeed guarantee
consistency within the confines of the computer system, there are situations where this
consistency comes at the price of increased inconsistency with the outside world. In
204 CHAPTER 9. DEFERRED PROCESSING
other words, reader-writer locking obtains internal consistency at the price of silently
stale data with respect to the outside world.
Nevertheless, there are situations where inconsistency and stale data within the
confines of the system cannot be tolerated. Fortunately, there are a number of approaches
that avoid inconsistency and stale data [McK04, ACMS03], and some methods based
on reference counting are discussed in Section 9.2.
RCU Grace Periods Extend for Many Milliseconds With the exception of QRCU
and several of the “toy” RCU implementations described in Section 9.5.5, RCU grace
periods extend for multiple milliseconds. Although there are a number of techniques to
render such long delays harmless, including use of the asynchronous interfaces where
available (call_rcu() and call_rcu_bh()), this situation is a major reason for
the rule of thumb that RCU be used in read-mostly situations.
Comparison of Reader-Writer Locking and RCU Code In the best case, the con-
version from reader-writer locking to RCU is quite simple, as shown in Figures 9.38,
9.39, and 9.40, all taken from Wikipedia [MPA+ 06].
1 struct el { 1 struct el {
2 struct list_head lp; 2 struct list_head lp;
3 long key; 3 long key;
4 spinlock_t mutex; 4 spinlock_t mutex;
5 int data; 5 int data;
6 /* Other data fields */ 6 /* Other data fields */
7 }; 7 };
8 DEFINE_RWLOCK(listmutex); 8 DEFINE_SPINLOCK(listmutex);
9 LIST_HEAD(head); 9 LIST_HEAD(head);
1 int search(long key, int *result) 1 int search(long key, int *result)
2 { 2 {
3 struct el *p; 3 struct el *p;
4 4
5 read_lock(&listmutex); 5 rcu_read_lock();
6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry_rcu(p, &head, lp) {
7 if (p->key == key) { 7 if (p->key == key) {
8 *result = p->data; 8 *result = p->data;
9 read_unlock(&listmutex); 9 rcu_read_unlock();
10 return 1; 10 return 1;
11 } 11 }
12 } 12 }
13 read_unlock(&listmutex); 13 rcu_read_unlock();
14 return 0; 14 return 0;
15 } 15 }
More-elaborate cases of replacing reader-writer locking with RCU are beyond the
scope of this document.
The assignment to head prevents any future references to p from being acquired,
and the synchronize_rcu() waits for any previously acquired references to be
released.
Quick Quiz 9.30: But wait! This is exactly the same code that might be used when
thinking of RCU as a replacement for reader-writer locking! What gives?
Of course, RCU can also be combined with traditional reference counting, as
discussed in Section 13.2.
But why bother? Again, part of the answer is performance, as shown in Figure 9.41,
again showing data taken on a 16-CPU 3GHz Intel x86 system.
Quick Quiz 9.31: Why the dip in refcnt overhead near 6 CPUs?
206 CHAPTER 9. DEFERRED PROCESSING
10000
refcnt
Overhead (nanoseconds)
1000
100
10 rcu
1
0 2 4 6 8 10 12 14 16
Number of CPUs
10000
Overhead (nanoseconds)
8000
6000
4000 refcnt
2000 rcu
0
0 2 4 6 8 10
Critical-Section Duration (microseconds)
And, as with reader-writer locking, the performance advantages of RCU are most
pronounced for short-duration critical sections, as shown Figure 9.42 for a 16-CPU
system. In addition, as with reader-writer locking, many system calls (and thus any
RCU read-side critical sections that they contain) complete in a few microseconds.
However, the restrictions that go with RCU can be quite onerous. For example, in
many cases, the prohibition against sleeping while in an RCU read-side critical section
would defeat the entire purpose. The next section looks at ways of addressing this
problem, while also reducing the complexity of traditional reference counting, at least
in some cases.
all pre-existing RCU read-side critical sections to complete, line 19 frees the newly
removed element, and line 20 indicates success. If the element is no longer the one we
want, line 22 releases the lock, line 23 leaves the RCU read-side critical section, and
line 24 indicates failure to delete the specified key.
Quick Quiz 9.33: Why is it OK to exit the RCU read-side critical section on line 15
of Figure 9.43 before releasing the lock on line 17?
Quick Quiz 9.34: Why not exit the RCU read-side critical section on line 23 of
Figure 9.43 before releasing the lock on line 22?
Alert readers will recognize this as only a slight variation on the original “RCU
is a way of waiting for things to finish” theme, which is addressed in Section 9.5.3.8.
They might also note the deadlock-immunity advantages over the lock-based existence
guarantees discussed in Section 7.4.
prevent any data from a SLAB_DESTROY_BY_RCU slab ever being returned to the
system, possibly resulting in OOM events?
These algorithms typically use a validation step that checks to make sure that the
newly referenced data structure really is the one that was requested [LS86, Section 2.5].
These validation checks require that portions of the data structure remain untouched by
the free-reallocate process. Such validation checks are usually very hard to get right,
and can hide subtle and difficult bugs.
Therefore, although type-safety-based lockless algorithms can be extremely helpful
in a very few difficult situations, you should instead use existence guarantees where
possible. Simpler is after all almost always better!
1. Make a change, for example, to the way that the OS reacts to an NMI.
2. Wait for all pre-existing read-side critical sections to completely finish (for ex-
ample, by using the synchronize_sched() primitive). The key observation
here is that subsequent RCU read-side critical sections are guaranteed to see
whatever change was made.
3. Clean up, for example, return status indicating that the change was successfully
made.
The remainder of this section presents example code adapted from the Linux ker-
nel. In this example, the timer_stop function uses synchronize_sched() to
ensure that all in-flight NMI notifications have completed before freeing the associated
resources. A simplified version of this code is shown Figure 9.44.
Lines 1-4 define a profile_buffer structure, containing a size and an indefinite
array of entries. Line 5 defines a pointer to a profile buffer, which is presumably
initialized elsewhere to point to a dynamically allocated region of memory.
Lines 7-16 define the nmi_profile() function, which is called from within an
NMI handler. As such, it cannot be preempted, nor can it be interrupted by a normal
interrupts handler, however, it is still subject to delays due to cache misses, ECC errors,
and cycle stealing by other hardware threads within the same core. Line 9 gets a local
pointer to the profile buffer using the rcu_dereference() primitive to ensure
memory ordering on DEC Alpha, and lines 11 and 12 exit from this function if there is
no profile buffer currently allocated, while lines 13 and 14 exit from this function if the
210 CHAPTER 9. DEFERRED PROCESSING
1 struct profile_buffer {
2 long size;
3 atomic_t entry[0];
4 };
5 static struct profile_buffer *buf = NULL;
6
7 void nmi_profile(unsigned long pcvalue)
8 {
9 struct profile_buffer *p = rcu_dereference(buf);
10
11 if (p == NULL)
12 return;
13 if (pcvalue >= p->size)
14 return;
15 atomic_inc(&p->entry[pcvalue]);
16 }
17
18 void nmi_stop(void)
19 {
20 struct profile_buffer *p = buf;
21
22 if (p == NULL)
23 return;
24 rcu_assign_pointer(buf, NULL);
25 synchronize_sched();
26 kfree(p);
27 }
In the meantime, Figure 9.45 shows some rough rules of thumb on where RCU is
most helpful.
As shown in the blue box at the top of the figure, RCU works best if you have
read-mostly data where stale and inconsistent data is permissible (but see below for
more information on stale and inconsistent data). The canonical example of this case
in the Linux kernel is routing tables. Because it may have taken many seconds or
even minutes for the routing updates to propagate across Internet, the system has been
sending packets the wrong way for quite some time. Having some small probability of
continuing to send some of them the wrong way for a few more milliseconds is almost
never a problem.
If you have a read-mostly workload where consistent data is required, RCU works
well, as shown by the green “read-mostly, need consistent data” box. One example
of this case is the Linux kernel’s mapping from user-level System-V semaphore IDs
to the corresponding in-kernel data structures. Semaphores tend to be used far more
frequently than they are created and destroyed, so this mapping is read-mostly. However,
it would be erroneous to perform a semaphore operation on a semaphore that has
already been deleted. This need for consistency is handled by using the lock in the
in-kernel semaphore data structure, along with a “deleted” flag that is set when deleting
a semaphore. If a user ID maps to an in-kernel data structure with the “deleted” flag set,
the data structure is ignored, so that the user ID is flagged as invalid.
Although this requires that the readers acquire a lock for the data structure repre-
senting the semaphore itself, it allows them to dispense with locking for the mapping
data structure. The readers therefore locklessly traverse the tree used to map from ID to
data structure, which in turn greatly improves performance, scalability, and real-time
response.
As indicated by the yellow “read-write” box, RCU can also be useful for read-write
workloads where consistent data is required, although usually in conjunction with a
number of other synchronization primitives. For example, the directory-entry cache in
recent Linux kernels uses RCU in conjunction with sequence locks, per-CPU locks, and
per-data-structure locks to allow lockless traversal of pathnames in the common case.
212 CHAPTER 9. DEFERRED PROCESSING
Although RCU can be very beneficial in this read-write case, such use is often more
complex than that of the read-mostly cases.
Finally, as indicated by the red box at the bottom of the figure, update-mostly
workloads requiring consistent data are rarely good places to use RCU, though there are
some exceptions [DMS+ 12]. In addition, as noted in Section 9.5.3.7, within the Linux
kernel, the SLAB_DESTROY_BY_RCU slab-allocator flag provides type-safe memory
to RCU readers, which can greatly simplify non-blocking synchronization and other
lockless algorithms.
In short, RCU is an API that includes a publish-subscribe mechanism for adding
new data, a way of waiting for pre-existing RCU readers to finish, and a discipline of
maintaining multiple versions to allow updates to avoid harming or unduly delaying
concurrent RCU readers. This RCU API is best suited for read-mostly situations,
especially if stale and inconsistent data can be tolerated by the application.
1. RCU BH: read-side critical sections must guarantee forward progress against
everything except for NMI and interrupt handlers, but not including software-
interrupt (softirq) handlers. RCU BH is global in scope.
2. RCU Sched: read-side critical sections must guarantee forward progress against
everything except for NMI and irq handlers, including softirq handlers. RCU
Sched is global in scope.
3. RCU (both classic and real-time): read-side critical sections must guarantee
forward progress against everything except for NMI handlers, irq handlers,
softirq handlers, and (in the real-time case) higher-priority real-time tasks.
RCU is global in scope.
4. SRCU: read-side critical sections need not guarantee forward progress unless
some other task is waiting for the corresponding grace period to complete, in
which case these read-side critical sections should complete in no more than a
few seconds (and preferably much more quickly).10 SRCU’s scope is defined by
the use of the corresponding srcu_struct.
NMI
rcu_assign_pointer()
call_rcu()
Process synchronize_rcu()
rule?
Quick Quiz 9.48: Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?
Figure 9.46 shows which APIs may be used in which in-kernel environments. The
RCU read-side primitives may be used in any environment, including NMI, the RCU
mutation and asynchronous grace-period primitives may be used in any environment
other than NMI, and, finally, the RCU synchronous grace-period primitives may be used
only in process context. The RCU list-traversal primitives include list_for_each_
entry_rcu(), hlist_for_each_entry_rcu(), etc. Similarly, the RCU list-
mutation primitives include list_add_rcu(), hlist_del_rcu(), etc.
Note that primitives from other families of RCU may be substituted, for example,
srcu_read_lock() may be used in any context in which rcu_read_lock()
may be used.
At its core, RCU is nothing more nor less than an API that supports publication and
subscription for insertions, waiting for all RCU readers to complete, and maintenance
of multiple versions. That said, it is possible to build higher-level constructs on top of
RCU, including the reader-writer-locking, reference-counting, and existence-guarantee
constructs listed in Section 9.5.3. Furthermore, I have no doubt that the Linux com-
munity will continue to find interesting new uses for RCU, just as they do for any of a
number of synchronization primitives throughout the kernel.
Of course, a more-complete view of RCU would also include all of the things you
can do with these APIs.
However, for many people, a complete view of RCU must include sample RCU
implementations. The next section therefore presents a series of “toy” RCU implemen-
tations of increasing complexity and capability.
218 CHAPTER 9. DEFERRED PROCESSING
1 atomic_t rcu_refcnt;
2
3 static void rcu_read_lock(void)
4 {
5 atomic_inc(&rcu_refcnt);
6 smp_mb();
7 }
8
9 static void rcu_read_unlock(void)
10 {
11 smp_mb();
12 atomic_dec(&rcu_refcnt);
13 }
14
15 void synchronize_rcu(void)
16 {
17 smp_mb();
18 while (atomic_read(&rcu_refcnt) != 0) {
19 poll(NULL, 0, 10);
20 }
21 smp_mb();
22 }
However, this implementations still has some serious shortcomings. First, the
atomic operations in rcu_read_lock() and rcu_read_unlock() are still quite
heavyweight, with read-side overhead ranging from about 100 nanoseconds on a single
Power5 CPU up to almost 40 microseconds on a 64-CPU system. This means that
the RCU read-side critical sections have to be extremely long in order to get any real
read-side parallelism. On the other hand, in the absence of readers, grace periods elapse
in about 40 nanoseconds, many orders of magnitude faster than production-quality
implementations in the Linux kernel.
Quick Quiz 9.55: How can the grace period possibly elapse in 40 nanoseconds
when synchronize_rcu() contains a 10-millisecond delay?
Second, if there are many concurrent rcu_read_lock() and rcu_read_
9.5. READ-COPY UPDATE (RCU) 221
1 DEFINE_SPINLOCK(rcu_gp_lock);
2 atomic_t rcu_refcnt[2];
3 atomic_t rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);
Design It is the two-element rcu_refcnt[] array that provides the freedom from
starvation. The key point is that synchronize_rcu() is only required to wait for
pre-existing readers. If a new reader starts after a given instance of synchronize_
rcu() has already begun execution, then that instance of synchronize_rcu()
need not wait on that new reader. At any given time, when a given reader enters its RCU
read-side critical section via rcu_read_lock(), it increments the element of the
rcu_refcnt[] array indicated by the rcu_idx variable. When that same reader
exits its RCU read-side critical section via rcu_read_unlock(), it decrements
whichever element it incremented, ignoring any possible subsequent changes to the
rcu_idx value.
This arrangement means that synchronize_rcu() can avoid starvation by
complementing the value of rcu_idx, as in rcu_idx = !rcu_idx. Suppose that
the old value of rcu_idx was zero, so that the new value is one. New readers that arrive
after the complement operation will increment rcu_idx[1], while the old readers that
previously incremented rcu_idx[0] will decrement rcu_idx[0] when they exit
their RCU read-side critical sections. This means that the value of rcu_idx[0] will
no longer be incremented, and thus will be monotonically decreasing.12 This means that
all that synchronize_rcu() need do is wait for the value of rcu_refcnt[0] to
reach zero.
With the background, we are ready to look at the implementation of the actual
primitives.
will be dealt with by the code for synchronize_rcu(). In the meantime, I suggest suspending disbelief.
9.5. READ-COPY UPDATE (RCU) 223
1 void synchronize_rcu(void)
2 {
3 int i;
4
5 smp_mb();
6 spin_lock(&rcu_gp_lock);
7 i = atomic_read(&rcu_idx);
8 atomic_set(&rcu_idx, !i);
9 smp_mb();
10 while (atomic_read(&rcu_refcnt[i]) != 0) {
11 poll(NULL, 0, 10);
12 }
13 smp_mb();
14 atomic_set(&rcu_idx, i);
15 smp_mb();
16 while (atomic_read(&rcu_refcnt[!i]) != 0) {
17 poll(NULL, 0, 10);
18 }
19 spin_unlock(&rcu_gp_lock);
20 smp_mb();
21 }
Discussion There are still some serious shortcomings. First, the atomic operations
in rcu_read_lock() and rcu_read_unlock() are still quite heavyweight. In
fact, they are more complex than those of the single-counter variant shown in Figure 9.49,
with the read-side primitives consuming about 150 nanoseconds on a single Power5 CPU
and almost 40 microseconds on a 64-CPU system. The update-side synchronize_
rcu() primitive is more costly as well, ranging from about 200 nanoseconds on a
single Power5 CPU to more than 40 microseconds on a 64-CPU system. This means
that the RCU read-side critical sections have to be extremely long in order to get any
real read-side parallelism.
Second, if there are many concurrent rcu_read_lock() and rcu_read_
unlock() operations, there will be extreme memory contention on the rcu_refcnt
elements, resulting in expensive cache misses. This further extends the RCU read-side
critical-section duration required to provide parallel read-side access. These first two
shortcomings defeat the purpose of RCU in most situations.
Third, the need to flip rcu_idx twice imposes substantial overhead on updates,
224 CHAPTER 9. DEFERRED PROCESSING
1 DEFINE_SPINLOCK(rcu_gp_lock);
2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 atomic_t rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);
1 DEFINE_SPINLOCK(rcu_gp_lock);
2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 long rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);
Figure 9.56: RCU Read-Side Using Per-Thread Reference-Count Pair and Shared
Update Data
1 static void rcu_read_lock(void)
2 {
3 int i;
4 int n;
5
6 n = __get_thread_var(rcu_nesting);
7 if (n == 0) {
8 i = ACCESS_ONCE(rcu_idx) & 0x1;
9 __get_thread_var(rcu_read_idx) = i;
10 __get_thread_var(rcu_refcnt)[i]++;
11 }
12 __get_thread_var(rcu_nesting) = n + 1;
13 smp_mb();
14 }
15
16 static void rcu_read_unlock(void)
17 {
18 int i;
19 int n;
20
21 smp_mb();
22 n = __get_thread_var(rcu_nesting);
23 if (n == 1) {
24 i = __get_thread_var(rcu_read_idx);
25 __get_thread_var(rcu_refcnt)[i]--;
26 }
27 __get_thread_var(rcu_nesting) = n - 1;
28 }
Figure 9.57: RCU Read-Side Using Per-Thread Reference-Count Pair and Shared
Update
increases linearly with the number of threads, imposing substantial overhead on applica-
tions with large numbers of threads.
Third, as before, although concurrent RCU updates could in principle be satisfied
by a common grace period, this implementation serializes grace periods, preventing
grace-period sharing.
Finally, as noted in the text, the need for per-thread variables and for enumerating
threads may be problematic in some software environments.
That said, the read-side primitives scale very nicely, requiring about 115 nanoseconds
regardless of whether running on a single-CPU or a 64-CPU Power5 system. As noted
above, the synchronize_rcu() primitive does not scale, ranging in overhead from
almost a microsecond on a single Power5 CPU up to almost 200 microseconds on a
64-CPU system. This implementation could conceivably form the basis for a production-
quality user-level RCU implementation.
The next section describes an algorithm permitting more efficient concurrent RCU
updates.
grace periods. The main difference from the earlier implementation shown in Fig-
ure 9.54 is that rcu_idx is now a long that counts freely, so that line 8 of Figure 9.57
must mask off the low-order bit. We also switched from using atomic_read() and
atomic_set() to using ACCESS_ONCE(). The data is also quite similar, as shown
in Figure 9.56, with rcu_idx now being a long instead of an atomic_t.
Figure 9.58 (rcu_rcpls.c) shows the implementation of synchronize_rcu()
and its helper function flip_counter_and_wait(). These are similar to those in
Figure 9.55. The differences in flip_counter_and_wait() include:
1. Line 6 uses ACCESS_ONCE() instead of atomic_set(), and increments
rather than complementing.
2. A new line 7 masks the counter down to its bottom bit.
The changes to synchronize_rcu() are more pervasive:
1. There is a new oldctr local variable that captures the pre-lock-acquisition value
of rcu_idx on line 23.
2. Line 26 uses ACCESS_ONCE() instead of atomic_read().
3. Lines 27-30 check to see if at least three counter flips were performed by other
threads while the lock was being acquired, and, if so, releases the lock, does
a memory barrier, and returns. In this case, there were two full waits for the
counters to go to zero, so those other threads already did all the required work.
228 CHAPTER 9. DEFERRED PROCESSING
1 DEFINE_SPINLOCK(rcu_gp_lock);
2 long rcu_gp_ctr = 0;
3 DEFINE_PER_THREAD(long, rcu_reader_gp);
4 DEFINE_PER_THREAD(long, rcu_reader_gp_snap);
Figure 9.60 (rcu.h and rcu.c) shows an RCU implementation based on a single
global free-running counter that takes on only even-numbered values, with data shown in
Figure 9.59. The resulting rcu_read_lock() implementation is extremely straight-
forward. Lines 3 and 4 simply add one to the global free-running rcu_gp_ctr variable
and stores the resulting odd-numbered value into the rcu_reader_gp per-thread
variable. Line 5 executes a memory barrier to prevent the content of the subsequent
RCU read-side critical section from “leaking out”.
The rcu_read_unlock() implementation is similar. Line 10 executes a mem-
ory barrier, again to prevent the prior RCU read-side critical section from “leaking out”.
Lines 11 and 12 then copy the rcu_gp_ctr global variable to the rcu_reader_gp
per-thread variable, leaving this per-thread variable with an even-numbered value so
that a concurrent instance of synchronize_rcu() will know to ignore it.
Quick Quiz 9.63: If any even value is sufficient to tell synchronize_rcu() to
ignore a given task, why don’t lines 10 and 11 of Figure 9.60 simply assign zero to
rcu_reader_gp?
Thus, synchronize_rcu() could wait for all of the per-thread rcu_reader_
gp variables to take on even-numbered values. However, it is possible to do much better
than that because synchronize_rcu() need only wait on pre-existing RCU read-
side critical sections. Line 19 executes a memory barrier to prevent prior manipulations
of RCU-protected data structures from being reordered (by either the CPU or the
compiler) to follow the increment on line 21. Line 20 acquires the rcu_gp_lock
(and line 30 releases it) in order to prevent multiple synchronize_rcu() instances
from running concurrently. Line 21 then increments the global rcu_gp_ctr variable
230 CHAPTER 9. DEFERRED PROCESSING
1 DEFINE_SPINLOCK(rcu_gp_lock);
2 #define RCU_GP_CTR_SHIFT 7
3 #define RCU_GP_CTR_BOTTOM_BIT (1 << RCU_GP_CTR_SHIFT)
4 #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
5 long rcu_gp_ctr = 0;
6 DEFINE_PER_THREAD(long, rcu_reader_gp);
by two, so that all pre-existing RCU read-side critical sections will have corresponding
per-thread rcu_reader_gp variables with values less than that of rcu_gp_ctr,
modulo the machine’s word size. Recall also that threads with even-numbered values
of rcu_reader_gp are not in an RCU read-side critical section, so that lines 23-29
scan the rcu_reader_gp values until they all are either even (line 24) or are greater
than the global rcu_gp_ctr (lines 25-26). Line 27 blocks for a short period of time
to wait for a pre-existing RCU read-side critical section, but this can be replaced with a
spin-loop if grace-period latency is of the essence. Finally, the memory barrier at line 31
ensures that any subsequent destruction will not be reordered into the preceding loop.
Quick Quiz 9.64: Why are the memory barriers on lines 19 and 31 of Figure 9.60
needed? Aren’t the memory barriers inherent in the locking primitives on lines 20
and 30 sufficient?
This approach achieves much better read-side performance, incurring roughly
63 nanoseconds of overhead regardless of the number of Power5 CPUs. Updates
incur more overhead, ranging from about 500 nanoseconds on a single Power5 CPU to
more than 100 microseconds on 64 such CPUs.
Quick Quiz 9.65: Couldn’t the update-side batching optimization described in
Section 9.5.5.6 be applied to the implementation shown in Figure 9.60?
This implementation suffers from some serious shortcomings in addition to the high
update-side overhead noted earlier. First, it is no longer permissible to nest RCU read-
side critical sections, a topic that is taken up in the next section. Second, if a reader is
preempted at line 3 of Figure 9.60 after fetching from rcu_gp_ctr but before storing
to rcu_reader_gp, and if the rcu_gp_ctr counter then runs through more than
half but less than all of its possible values, then synchronize_rcu() will ignore
the subsequent RCU read-side critical section. Third and finally, this implementation
requires that the enclosing software environment be able to enumerate threads and
maintain per-thread variables.
Quick Quiz 9.66: Is the possibility of readers being preempted in lines 3-4 of
Figure 9.60 a real problem, in other words, is there a real sequence of events that could
lead to failure? If not, why not? If so, what is the sequence of events, and how can the
failure be addressed?
1 DEFINE_SPINLOCK(rcu_gp_lock);
2 long rcu_gp_ctr = 0;
3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp);
13 Although the code in the figure is consistent with rcu_quiescent_state() being the same
1 void synchronize_rcu(void)
2 {
3 int t;
4
5 smp_mb();
6 spin_lock(&rcu_gp_lock);
7 rcu_gp_ctr += 2;
8 smp_mb();
9 for_each_thread(t) {
10 while (rcu_gp_ongoing(t) &&
11 ((per_thread(rcu_reader_qs_gp, t) -
12 rcu_gp_ctr) < 0)) {
13 poll(NULL, 0, 10);
14 }
15 }
16 spin_unlock(&rcu_gp_lock);
17 smp_mb();
18 }
rcu(), and then acquire that same lock within an RCU read-side critical section? This
should be a deadlock, but how can a primitive that generates absolutely no code possibly
participate in a deadlock cycle?
In addition, this implementation does not permit concurrent calls to synchronize_
rcu() to share grace periods. That said, one could easily imagine a production-quality
RCU implementation based on this version of RCU.
Quick Quiz 9.75: Given that grace periods are prohibited within RCU read-side
critical sections, how can an RCU data structure possibly be updated while in an RCU
read-side critical section?
This section is organized as a series of Quick Quizzes that invite you to apply RCU
to a number of examples earlier in this book. The answer to each Quick Quiz gives
some hints, and also contains a pointer to a later section where the solution is explained at
length. The rcu_read_lock(), rcu_read_unlock(), rcu_dereference(),
rcu_assign_pointer(), and synchronize_rcu() primitives should suffice
for most of these exercises.
Quick Quiz 9.76: The statistical-counter implementation shown in Figure 5.9
(count_end.c) used a global lock to guard the summation in read_count(),
which resulted in poor performance and negative scalability. How could you use RCU
to provide read_count() with excellent performance and good scalability. (Keep in
mind that read_count()’s scalability will necessarily be limited by its need to scan
all threads’ counters.)
Quick Quiz 9.77: Section 5.5 showed a fanciful pair of code fragments that dealt
with counting I/O accesses to removable devices. These code fragments suffered from
high overhead on the fastpath (starting an I/O) due to the need to acquire a reader-writer
lock. How would you use RCU to provide excellent performance and scalability? (Keep
in mind that the performance of the common-case first code fragment that does I/O
accesses is much more important than that of the device-removal code fragment.)
9.6. WHICH TO CHOOSE? 237
We have already seen one situation featuring high performance and scalability
for writers, namely the counting algorithms surveyed in Chapter 5. These algorithms
featured partially partitioned data structures so that updates can operate locally, while the
more-expensive reads must sum across the entire data structure. Silas Boyd-Wickhizer
has generalized this notion to produce OpLog, which he has applied to Linux-kernel
pathname lookup, VM reverse mappings, and the stat() system call [BW14].
Another approach, called “Disruptor”, is designed for applications that process
high-volume streams of input data. The approach is to rely on single-producer-single-
consumer FIFO queues, minimizing the need for synchronization [Sut13]. For Java
applications, Disruptor also has the virtue of minimizing use of the garbage collector.
And of course, where feasible, fully partitioned or “sharded” systems provide
excellent performance and scalability, as noted in Chapter 6.
The next chapter will look at updates in the context of several types of data struc-
tures.
240 CHAPTER 9. DEFERRED PROCESSING
Bad programmers worry about the code. Good
programmers worry about data structures and their
relationships.
Linus Torvalds
Chapter 10
Data Structures
241
242 CHAPTER 10. DATA STRUCTURES
Births, captures, and purchases result in insertions, while deaths, releases, and sales
result in deletions. Because Schrödinger’s zoo contains a large quantity of short-lived
animals, including mice and insects, the database must be able to support a high update
rate.
Those interested in Schrödinger’s animals can query them, however, Schrödinger
has noted extremely high rates of queries for his cat, so much so that he suspects that
his mice might be using the database to check up on their nemesis. This means that
Schrödinger’s application must be able to support a high rate of queries to a single data
element.
Please keep this application in mind as various data structures are presented.
1 struct ht_elem {
2 struct cds_list_head hte_next;
3 unsigned long hte_hash;
4 };
5
6 struct ht_bucket {
7 struct cds_list_head htb_head;
8 spinlock_t htb_lock;
9 };
10
11 struct hashtab {
12 unsigned long ht_nbuckets;
13 struct ht_bucket ht_bkt[0];
14 };
struct hashtab
−>ht_nbuckets = 4
−>ht_bkt[0] struct ht_elem struct ht_elem
−>htb_head −>hte_next −>hte_next
−>htb_lock −>hte_hash −>hte_hash
−>ht_bkt[1]
−>htb_head
−>htb_lock
−>ht_bkt[2] struct ht_elem
−>htb_head −>hte_next
−>htb_lock −>hte_hash
−>ht_bkt[3]
−>htb_head
−>htb_lock
caches the corresponding element’s hash value in the ->hte_hash field. The ht_
elem structure would be included in the larger structure being placed in the hash table,
and this larger structure might contain a complex key.
The diagram shown in Figure 10.2 has bucket 0 with two elements and bucket 2
with one.
Figure 10.3 shows mapping and locking functions. Lines 1 and 2 show the macro
HASH2BKT(), which maps from a hash value to the corresponding ht_bucket
structure. This macro uses a simple modulus: if more aggressive hashing is required, the
1 #define HASH2BKT(htp, h) \
2 (&(htp)->ht_bkt[h % (htp)->ht_nbuckets])
3
4 static void hashtab_lock(struct hashtab *htp,
5 unsigned long hash)
6 {
7 spin_lock(&HASH2BKT(htp, hash)->htb_lock);
8 }
9
10 static void hashtab_unlock(struct hashtab *htp,
11 unsigned long hash)
12 {
13 spin_unlock(&HASH2BKT(htp, hash)->htb_lock);
14 }
1 struct ht_elem *
2 hashtab_lookup(struct hashtab *htp,
3 unsigned long hash,
4 void *key,
5 int (*cmp)(struct ht_elem *htep,
6 void *key))
7 {
8 struct ht_bucket *htb;
9 struct ht_elem *htep;
10
11 htb = HASH2BKT(htp, hash);
12 cds_list_for_each_entry(htep,
13 &htb->htb_head,
14 hte_next) {
15 if (htep->hte_hash != hash)
16 continue;
17 if (cmp(htep, key))
18 return htep;
19 }
20 return NULL;
21 }
caller needs to implement it when mapping from key to hash value. The remaining two
functions acquire and release the ->htb_lock corresponding to the specified hash
value.
Figure 10.4 shows hashtab_lookup(), which returns a pointer to the element
with the specified hash and key if it exists, or NULL otherwise. This function takes both
a hash value and a pointer to the key because this allows users of this function to use
arbitrary keys and arbitrary hash functions, with the key-comparison function passed
in via cmp(), in a manner similar to qsort(). Line 11 maps from the hash value
to a pointer to the corresponding hash bucket. Each pass through the loop spanning
lines 12-19 examines one element of the bucket’s hash chain. Line 15 checks to see if
the hash values match, and if not, line 16 proceeds to the next element. Line 17 checks
to see if the actual key matches, and if so, line 18 returns a pointer to the matching
element. If no element matches, line 20 returns NULL.
Quick Quiz 10.2: But isn’t the double comparison on lines 15-18 in Figure 10.4
inefficient in the case where the key fits into an unsigned long?
Figure 10.5 shows the hashtab_add() and hashtab_del() functions that
add and delete elements from the hash table, respectively.
The hashtab_add() function simply sets the element’s hash value on line 6, then
adds it to the corresponding bucket on lines 7 and 8. The hashtab_del() function
simply removes the specified element from whatever hash chain it is on, courtesy of the
10.2. PARTITIONABLE DATA STRUCTURES 245
1 struct hashtab *
2 hashtab_alloc(unsigned long nbuckets)
3 {
4 struct hashtab *htp;
5 int i;
6
7 htp = malloc(sizeof(*htp) +
8 nbuckets *
9 sizeof(struct ht_bucket));
10 if (htp == NULL)
11 return NULL;
12 htp->ht_nbuckets = nbuckets;
13 for (i = 0; i < nbuckets; i++) {
14 CDS_INIT_LIST_HEAD(&htp->ht_bkt[i].htb_head);
15 spin_lock_init(&htp->ht_bkt[i].htb_lock);
16 }
17 return htp;
18 }
19
20 void hashtab_free(struct hashtab *htp)
21 {
22 free(htp);
23 }
90000
80000
Total Lookups per Millisecond
70000
60000 ideal
50000
40000
30000
20000
10000
1 2 3 4 5 6 7 8
Number of CPUs (Threads)
doubly linked nature of the hash-chain lists. Before calling either of these two functions,
the caller is required to ensure that no other thread is accessing or modifying this same
bucket, for example, by invoking hashtab_lock() beforehand.
60000
55000
70000
30000
1024
20000
10000
0 10 20 30 40 50 60
Number of CPUs (Threads)
Socket Core
0 0 1 2 3 4 5 6 7
32 33 34 35 36 37 38 39
1 8 9 10 11 12 13 14 15
40 41 42 43 44 45 46 47
2 16 17 18 19 20 21 22 23
48 49 50 51 52 53 54 55
3 24 25 26 27 28 29 30 31
56 47 58 59 60 61 62 63
One key property of the Schrödinger’s-zoo runs discussed thus far is that they are
all read-only. This makes the performance degradation due to lock-acquisition-induced
cache misses all the more painful. Even though we are not updating the underlying hash
table itself, we are still paying the price for writing to memory. Of course, if the hash
table was never going to be updated, we could dispense entirely with mutual exclusion.
This approach is quite straightforward and is left as an exercise for the reader. But
even with the occasional update, avoiding writes avoids cache misses, and allows the
read-mostly data to be replicated across all the caches, which in turn promotes locality
of reference.
The next section therefore examines optimizations that can be carried out in read-
mostly cases where updates are rare, but could happen at any time.
1 struct ht_elem
2 *hashtab_lookup(struct hashtab *htp,
3 unsigned long hash,
4 void *key,
5 int (*cmp)(struct ht_elem *htep,
6 void *key))
7 {
8 struct ht_bucket *htb;
9 struct ht_elem *htep;
10
11 htb = HASH2BKT(htp, hash);
12 cds_list_for_each_entry_rcu(htep,
13 &htb->htb_head,
14 hte_next) {
15 if (htep->hte_hash != hash)
16 continue;
17 if (cmp(htep, key))
18 return htep;
19 }
20 return NULL;
21 }
1 void
2 hashtab_add(struct hashtab *htp,
3 unsigned long hash,
4 struct ht_elem *htep)
5 {
6 htep->hte_hash = hash;
7 cds_list_add_rcu(&htep->hte_next,
8 &HASH2BKT(htp, hash)->htb_head);
9 }
10
11 void hashtab_del(struct ht_elem *htep)
12 {
13 cds_list_del_rcu(&htep->hte_next);
14 }
1e+06
Total Lookups per Millisecond
ideal
RCU,hazptr
100000
bucket
10000
global
1000
1 10 100
Number of CPUs (Threads)
lookups, doesn’t that mean that a lookup could return a reference to a data element that
was deleted immediately after it was looked up?
Figure 10.12 shows hashtab_add() and hashtab_del(), both of which are
quite similar to their counterparts in the non-RCU hash table shown in Figure 10.5. The
hashtab_add() function uses cds_list_add_rcu() instead of cds_list_
add() in order to ensure proper ordering when an element is added to the hash
table at the same time that it is being looked up. The hashtab_del() function
uses cds_list_del_rcu() instead of cds_list_del_init() to allow for the
case where an element is looked up just before it is deleted. Unlike cds_list_
del_init(), cds_list_del_rcu() leaves the forward pointer intact, so that
hashtab_lookup() can traverse to the newly deleted element’s successor.
Of course, after invoking hashtab_del(), the caller must wait for an RCU grace
period (e.g., by invoking synchronize_rcu()) before freeing or otherwise reusing
the memory for the newly deleted element.
900000
800000
scalability despite the larger numbers of threads and the NUMA effects. Results from a
globally locked implementation are also shown, and as expected the results are even
worse than those of the per-bucket-locked implementation. RCU does slightly better
than hazard pointers, but the difference is not readily visible in this log-scale plot.
Figure 10.14 shows the same data on a linear scale. This drops the global-locking
trace into the x-axis, but allows the relative performance of RCU and hazard pointers to
be more readily discerned. Both show a change in slope at 32 CPUs, and this is due to
hardware multithreading. At 32 and fewer CPUs, each thread has a core to itself. In this
regime, RCU does better than does hazard pointers because hazard pointers’s read-side
memory barriers result in dead time within the core. In short, RCU is better able to
utilize a core from a single hardware thread than is hazard pointers.
This situation changes above 32 CPUs. Because RCU is using more than half of
each core’s resources from a single hardware thread, RCU gains relatively little benefit
from the second hardware thread in each core. The slope of hazard pointers’s trace also
decreases at 32 CPUs, but less dramatically, because the second hardware thread is able
to fill in the time that the first hardware thread is stalled due to memory-barrier latency.
As we will see in later sections, hazard pointers’s second-hardware-thread advantage
depends on the workload.
As noted earlier, Schrödinger is surprised by the popularity of his cat [Sch35], but
recognizes the need to reflect this popularity in his design. Figure 10.15 shows the
results of 60-CPU runs, varying the number of CPUs that are doing nothing but looking
up the cat. Both RCU and hazard pointers respond well to this challenge, but bucket
locking scales negatively, eventually performing even worse than global locking. This
should not be a surprise because if all CPUs are doing nothing but looking up the cat,
the lock corresponding to the cat’s bucket is for all intents and purposes a global lock.
This cat-only benchmark illustrates one potential problem with fully partitioned
sharding approaches. Only the CPUs associated with the cat’s partition is able to access
the cat, limiting the cat-only throughput. Of course, a great many applications have
good load-spreading properties, and for these applications sharding works quite well.
However, sharding does not handle “hot spots” very well, with the hot spot exemplified
by Schrödinger’s cat being but one case in point.
Of course, if we were only ever going to read the data, we would not need any
10.3. READ-MOSTLY DATA STRUCTURES 251
1e+06
bucket
1000
100
global
10
1 10 100
Number of CPUs (Threads) Looking Up The Cat
1e+06
RCU
100000 hazptr
Lookups per Millisecond
bucket
10000
1000
global
100
10
1 10 100
Number of CPUs Doing Updates
concurrency control to begin with. Figure 10.16 therefore shows the effect of updates.
At the extreme left-hand side of this graph, all 60 CPUs are doing lookups, while to
the right all 60 CPUs are doing updates. For all four implementations, the number of
lookups per millisecond decreases as the number of updating CPUs increases, of course
reaching zero lookups per millisecond when all 60 CPUs are updating. RCU does
well relative to hazard pointers due to the fact that hazard pointers’s read-side memory
barriers incur greater overhead in the presence of updates. It therefore seems likely
that modern hardware heavily optimizes memory-barrier execution, greatly reducing
memory-barrier overhead in the read-only case.
Where Figure 10.16 showed the effect of increasing update rates on lookups, Fig-
ure 10.17 shows the effect of increasing update rates on the updates themselves. Hazard
pointers and RCU start off with a significant advantage because, unlike bucket locking,
readers do not exclude updaters. However, as the number of updating CPUs increases,
update-side overhead starts to make its presence known, first for RCU and then for
hazard pointers. Of course, all three of these implementations fare much better than
does global locking.
252 CHAPTER 10. DATA STRUCTURES
100000
bucket
RCU
1000
100
global
10
1 10 100
Number of CPUs Doing Updates
Of course, it is quite possible that the differences in lookup performance are affected
by the differences in update rates. One way to check this is to artificially throttle the
update rates of per-bucket locking and hazard pointers to match that of RCU. Doing so
does not significantly improve the lookup performace of per-bucket locking, nor does
it close the gap between hazard pointers and RCU. However, removing hazard point-
ers’s read-side memory barriers (thus resulting in an unsafe implementation of hazard
pointers) does nearly close the gap between hazard pointers and RCU. Although this
unsafe hazard-pointer implementation will usually be reliable enough for benchmarking
purposes, it is absolutely not recommended for production use.
Quick Quiz 10.6: The dangers of extrapolating from eight CPUs to 60 CPUs was
made quite clear in Section 10.2.3. But why should extrapolating up from 60 CPUs be
any safer?
Bucket 0 Bucket 1
A B C D
developed for the Linux kernel by Herbert Xu [Xu10], and is described in the following
sections. The other two are covered briefly in Section 10.4.4.
The key insight behind the first hash-table implementation is that each data element
can have two sets of list pointers, with one set currently being used by RCU readers (as
well as by non-RCU updaters) and the other being used to construct a new resized hash
table. This approach allows lookups, insertions, and deletions to all run concurrently
with a resize operation (as well as with each other).
The resize operation proceeds as shown in Figures 10.20-10.23, with the initial
two-bucket state shown in Figure 10.20 and with time advancing from figure to figure.
The initial state uses the zero-index links to chain the elements into hash buckets. A
four-bucket array is allocated, and the one-index links are used to chain the elements
into these four new hash buckets. This results in state (b) shown in Figure 10.21, with
readers still using the original two-bucket array.
The new four-bucket array is exposed to readers and then a grace-period operation
waits for all readers, resulting in state (c), shown in Figure 10.22. In this state, all
readers are using the new four-bucket array, which means that the old two-bucket array
may now be freed, resulting in state (d), shown in Figure 10.23.
This design leads to a relatively straightforward implementation, which is the subject
of the next section.
10.4. NON-PARTITIONABLE DATA STRUCTURES 255
Bucket 0 Bucket 1
A B C D
Bucket 0 Bucket 1
A B C D
A B C D
1 struct ht_elem {
2 struct rcu_head rh;
3 struct cds_list_head hte_next[2];
4 unsigned long hte_hash;
5 };
6
7 struct ht_bucket {
8 struct cds_list_head htb_head;
9 spinlock_t htb_lock;
10 };
11
12 struct ht {
13 long ht_nbuckets;
14 long ht_resize_cur;
15 struct ht *ht_new;
16 int ht_idx;
17 void *ht_hash_private;
18 int (*ht_cmp)(void *hash_private,
19 struct ht_elem *htep,
20 void *key);
21 long (*ht_gethash)(void *hash_private,
22 void *key);
23 void *(*ht_getkey)(struct ht_elem *htep);
24 struct ht_bucket ht_bkt[0];
25 };
26
27 struct hashtab {
28 struct ht *ht_cur;
29 spinlock_t ht_lock;
30 };
->ht_nbuckets field on line 13. The size is stored in the same structure containing
the array of buckets (->ht_bkt[] on line 24) in order to avoid mismatches between
the size and the array. The ->ht_resize_cur field on line 14 is equal to −1 unless
a resize operation is in progress, in which case it indicates the index of the bucket whose
elements are being inserted into the new hash table, which is referenced by the ->ht_
new field on line 15. If there is no resize operation in progress, ->ht_new is NULL.
Thus, a resize operation proceeds by allocating a new ht structure and referencing it via
the ->ht_new pointer, then advancing ->ht_resize_cur through the old table’s
buckets. When all the elements have been added to the new table, the new table is linked
into the hashtab structure’s ->ht_cur field. Once all old readers have completed,
the old hash table’s ht structure may be freed.
The ->ht_idx field on line 16 indicates which of the two sets of list pointers are
being used by this instantiation of the hash table, and is used to index the ->hte_
next[] array in the ht_elem structure on line 3.
The ->ht_hash_private, ->ht_cmp(), ->ht_gethash(), and ->ht_
getkey() fields on lines 17-23 collectively define the per-element key and the
hash function. The ->ht_hash_private allows the hash function to be per-
turbed [McK90a, McK90b, McK91], which can be used to avoid denial-of-service
attacks based on statistical estimation of the parameters used in the hash function. The
->ht_cmp() function compares a specified key with that of the specified element,
the ->ht_gethash() calculates the specified key’s hash, and ->ht_getkey()
extracts the key from the enclosing data element.
The ht_bucket structure is the same as before, and the ht_elem structure differs
from that of previous implementations only in providing a two-element array of list
pointer sets in place of the prior single set of list pointers.
In a fixed-sized hash table, bucket selection is quite straightforward: Simply trans-
10.4. NON-PARTITIONABLE DATA STRUCTURES 257
form the hash value to the corresponding bucket index. In contrast, when resizing, it is
also necessary to determine which of the old and new sets of buckets to select from. If
the bucket that would be selected from the old table has already been distributed into
the new table, then the bucket should be selected from the new table. Conversely, if the
bucket that would be selected from the old table has not yet been distributed, then the
bucket should be selected from the old table.
Bucket selection is shown in Figure 10.25, which shows ht_get_bucket_
single() on lines 1-8 and ht_get_bucket() on lines 10-24. The ht_get_
bucket_single() function returns a reference to the bucket corresponding to the
specified key in the specified hash table, without making any allowances for resizing.
It also stores the hash value corresponding to the key into the location referenced by
parameter b on lines 5 and 6. Line 7 then returns a reference to the corresponding
bucket.
The ht_get_bucket() function handles hash-table selection, invoking ht_
get_bucket_single() on line 16 to select the bucket corresponding to the hash in
the current hash table, storing the hash value through parameter b. If line 17 determines
that the table is being resized and that line 16’s bucket has already been distributed
across the new hash table, then line 18 selects the new hash table and line 19 selects
the bucket corresponding to the hash in the new hash table, again storing the hash value
through parameter b.
Quick Quiz 10.7: The code in Figure 10.25 computes the hash twice! Why this
blatant inefficiency?
If line 21 finds that parameter i is non-NULL, then line 22 stores the pointer-set
index for the selected hash table. Finally, line 23 returns a reference to the selected hash
bucket.
Quick Quiz 10.8: How does the code in Figure 10.25 protect against the resizing
process progressing past the selected bucket?
This implementation of ht_get_bucket_single() and ht_get_bucket()
will permit lookups and modifications to run concurrently with a resize operation.
Read-side concurrency control is provided by RCU as was shown in Figure 10.10, but
258 CHAPTER 10. DATA STRUCTURES
1 struct ht_elem *
2 hashtab_lookup(struct hashtab *htp_master,
3 void *key)
4 {
5 long b;
6 int i;
7 struct ht *htp;
8 struct ht_elem *htep;
9 struct ht_bucket *htbp;
10
11 htp = rcu_dereference(htp_master->ht_cur);
12 htbp = ht_get_bucket(&htp, key, &b, &i);
13 cds_list_for_each_entry_rcu(htep,
14 &htbp->htb_head,
15 hte_next[i]) {
16 if (htp->ht_cmp(htp->ht_hash_private,
17 htep, key))
18 return htep;
19 }
20 return NULL;
21 }
22
23 void
24 hashtab_add(struct hashtab *htp_master,
25 struct ht_elem *htep)
26 {
27 long b;
28 int i;
29 struct ht *htp;
30 struct ht_bucket *htbp;
31
32 htp = rcu_dereference(htp_master->ht_cur);
33 htbp = ht_get_bucket(&htp, htp->ht_getkey(htep),
34 &b, &i);
35 cds_list_add_rcu(&htep->hte_next[i],
36 &htbp->htb_head);
37 }
38
39 void
40 hashtab_del(struct hashtab *htp_master,
41 struct ht_elem *htep)
42 {
43 long b;
44 int i;
45 struct ht *htp;
46 struct ht_bucket *htbp;
47
48 htp = rcu_dereference(htp_master->ht_cur);
49 htbp = ht_get_bucket(&htp, htp->ht_getkey(htep),
50 &b, &i);
51 cds_list_del_rcu(&htep->hte_next[i]);
52 }
hash table during a resize operation. What prevents this insertion from being lost due to
a subsequent resize operation completing before the insertion does?
Now that we have bucket selection and concurrency control in place, we are ready to
search and update our resizable hash table. The hashtab_lookup(), hashtab_
add(), and hashtab_del() functions shown in Figure 10.27.
The hashtab_lookup() function on lines 1-21 of the figure does hash lookups.
Line 11 fetches the current hash table and line 12 obtains a reference to the bucket
corresponding to the specified key. This bucket will be located in a new resized hash
table when a resize operation has progressed past the bucket in the old hash table that
contained the desired data element. Note that line 12 also passes back the index that
will be used to select the correct set of pointers from the pair in each element. The loop
260 CHAPTER 10. DATA STRUCTURES
spanning lines 13-19 searches the bucket, so that if line 16 detects a match, line 18
returns a pointer to the enclosing data element. Otherwise, if there is no match, line 20
returns NULL to indicate failure.
Quick Quiz 10.11: In the hashtab_lookup() function in Figure 10.27, the
code carefully finds the right bucket in the new hash table if the element to be looked up
has already been distributed by a concurrent resize operation. This seems wasteful for
RCU-protected lookups. Why not just stick with the old hash table in this case?
The hashtab_add() function on lines 23-37 of the figure adds new data el-
ements to the hash table. Lines 32-34 obtain a pointer to the hash bucket corre-
sponding to the key (and provide the index), as before, and line 35 adds the new
element to the table. The caller is required to handle concurrency, for example, by
invoking hashtab_lock_mod() before the call to hashtab_add() and invok-
ing hashtab_unlock_mod() afterwards. These two concurrency-control functions
will correctly synchronize with a concurrent resize operation: If the resize operation has
already progressed beyond the bucket that this data element would have been added to,
then the element is added to the new table.
The hashtab_del() function on lines 39-52 of the figure removes an existing
element from the hash table. Lines 48-50 provide the bucket and index as before, and
line 51 removes the specified element. As with hashtab_add(), the caller is respon-
sible for concurrency control and this concurrency control suffices for synchronizing
with a concurrent resize operation.
Quick Quiz 10.12: The hashtab_del() function in Figure 10.27 does not
always remove the element from the old hash table. Doesn’t this mean that readers
might access this newly removed element after it has been freed?
The actual resizing itself is carried out by hashtab_resize, shown in Fig-
ure 10.28 on page 261. Line 17 conditionally acquires the top-level ->ht_lock, and
if this acquisition fails, line 18 returns -EBUSY to indicate that a resize is already in
progress. Otherwise, line 19 picks up a reference to the current hash table, and lines 21-
24 allocate a new hash table of the desired size. If a new set of hash/key functions have
been specified, these are used for the new table, otherwise those of the old table are
preserved. If line 25 detects memory-allocation failure, line 26 releases ->ht_lock
and line 27 returns a failure indication.
Line 29 starts the bucket-distribution process by installing a reference to the new
table into the ->ht_new field of the old table. Line 30 ensures that all readers who are
not aware of the new table complete before the resize operation continues. Line 31 picks
up the current table’s index and stores its inverse to the new hash table, thus ensuring
that the two hash tables avoid overwriting each other’s linked lists.
Each pass through the loop spanning lines 33-44 distributes the contents of one of
the old hash table’s buckets into the new hash table. Line 34 picks up a reference to the
old table’s current bucket, line 35 acquires that bucket’s spinlock, and line 36 updates
->ht_resize_cur to indicate that this bucket is being distributed.
Quick Quiz 10.13: In the hashtab_resize() function in Figure 10.27, what
guarantees that the update to ->ht_new on line 29 will be seen as happening before
the update to ->ht_resize_cur on line 36 from the perspective of hashtab_
lookup(), hashtab_add(), and hashtab_del()?
Each pass through the loop spanning lines 37-42 adds one data element from the
current old-table bucket to the corresponding new-table bucket, holding the new-table
bucket’s lock during the add operation. Finally, line 43 releases the old-table bucket
lock.
10.4. NON-PARTITIONABLE DATA STRUCTURES 261
Execution reaches line 45 once all old-table buckets have been distributed across
the new table. Line 45 installs the newly created table as the current one, and line 46
waits for all old readers (who might still be referencing the old table) to complete. Then
line 47 releases the resize-serialization lock, line 48 frees the old hash table, and finally
line 48 returns success.
1e+07
1e+06
10000
131,072
1000
100
1 10 100
Number of CPUs (Threads)
operation.
The uppermost three traces are for the 2048-element hash table. The upper trace
corresponds to the 2048-bucket fixed-size hash table, the middle trace to the 1024-
bucket fixed-size hash table, and the lower trace to the resizable hash table. In this case,
the short hash chains cause normal lookup overhead to be so low that the overhead
of resizing dominates. Nevertheless, the larger fixed-size hash table has a significant
performance advantage, so that resizing can be quite beneficial, at least given sufficient
time between resizing operations: One millisecond is clearly too short a time.
The middle three traces are for the 16,384-element hash table. Again, the upper
trace corresponds to the 2048-bucket fixed-size hash table, but the middle trace now
corresponds to the resizable hash table and the lower trace to the 1024-bucket fixed-size
hash table. However, the performance difference between the resizable and the 1024-
bucket hash table is quite small. One consequence of the eight-fold increase in number
of elements (and thus also in hash-chain length) is that incessant resizing is now no
worse than maintaining a too-small hash table.
The lower three traces are for the 131,072-element hash table. The upper trace
corresponds to the 2048-bucket fixed-size hash table, the middle trace to the resizable
hash table, and the lower trace to the 1024-bucket fixed-size hash table. In this case,
longer hash chains result in higher lookup overhead, so that this lookup overhead
dominates that of resizing the hash table. However, the performance of all three
approaches at the 131,072-element level is more than an order of magnitude worse
than that at the 2048-element level, suggesting that the best strategy would be a single
64-fold increase in hash-table size.
The key point from this data is that the RCU-protected resizable hash table performs
and scales almost as well as does its fixed-size counterpart. The performance during
an actual resize operation of course suffers somewhat due to the cache misses causes
by the updates to each element’s pointers, and this effect is most pronounced when the
hash-tables bucket lists are short. This indicates that hash tables should be resized by
substantial amounts, and that hysteresis should be be applied to prevent performance
degradation due to too-frequent resize operations. In memory-rich environments, hash-
table sizes should furthermore be increased much more aggressively than they are
decreased.
10.4. NON-PARTITIONABLE DATA STRUCTURES 263
even 0 2
a)
odd 1 3
even 0 2
b)
odd 1 3
all
even 0 2
c)
odd 1 3
all
even 0 2
d)
odd 1 3
all
even 0 2
e)
odd 1 3
all
f) all 1 3 0 2
a) all 0 1 2 3
even
b) odd
all 0 1 2 3
even
c) odd
all 0 1 2 3
even
d) odd
all 0 1 2 3
even 0 1 2 3
e)
odd
even 0 1 2 3
f)
odd
even 0 2
g)
odd 1 3
colored black to indicate that only those readers traversing the odd-values hash bucket
may reach it.
Next, the last odd-numbered element in the first consecutive run of such elements
now has its pointer-to-next updated to reference the following odd-numbered element.
After a subsequent grace-period operation, the result is state (f). A final unzipping
operation (including a grace-period operation) results in the final state (g).
In short, the relativistic hash table reduces the number of per-element list pointers
at the expense of additional grace periods incurred during resizing. These additional
grace periods are usually not a problem because insertions, deletions, and lookups may
proceed concurrently with a resize operation.
It turns out that it is possible to reduce the per-element memory overhead from a pair
of pointers to a single pointer, while still retaining O(1) deletions. This is accomplished
by augmenting split-order list [SS06] with RCU protection [Des09, MDJ13a]. The data
elements in the hash table are arranged into a single sorted linked list, with each hash
bucket referencing the first element in that bucket. Elements are deleted by setting
low-order bits in their pointer-to-next fields, and these elements are removed from the
list by later traversals that encounter them.
266 CHAPTER 10. DATA STRUCTURES
This RCU-protected split-order list is complex, but offers lock-free progress guaran-
tees for all insertion, deletion, and lookup operations. Such guarantees can be important
in real-time applications. An implementation is available from recent versions of the
userspace RCU library [Des09].
1 In the guise of swissTM [DFGG11], which is a variant of software transactional memory in which the
10.6 Micro-Optimization
The data structures shown in this section were coded straightforwardly, with no adap-
tation to the underlying system’s cache hierarchy. In addition, many of the imple-
mentations used pointers to functions for key-to-hash conversions and other frequent
operations. Although this approach provides simplicity and portability, in many cases it
does give up some performance.
The following sections touch on specialization, memory conservation, and hardware
considerations. Please do not mistake these short sections for a definitive treatise on this
subject. Whole books have been written on optimizing to a specific CPU, let alone to
the set of CPU families in common use today.
10.6.1 Specialization
The resizable hash table presented in Section 10.4 used an opaque type for the key.
This allows great flexibility, permitting any sort of key to be used, but it also incurs
significant overhead due to the calls via of pointers to functions. Now, modern hardware
uses sophisticated branch-prediction techniques to minimize this overhead, but on the
other hand, real-world software is often larger than can be accommodated even by
today’s large hardware branch-prediction tables. This is especially the case for calls via
pointers, in which case the branch prediction hardware must record a pointer in addition
to branch-taken/branch-not-taken information.
This overhead can be eliminated by specializing a hash-table implementation to
a given key type and hash function. Doing so eliminates the ->ht_cmp(), ->ht_
gethash(), and ->ht_getkey() function pointers in the ht structure shown in
Figure 10.24 on page 256. It also eliminates the corresponding calls through these point-
ers, which could allow the compiler to inline the resulting fixed functions, eliminating
not only the overhead of the call instruction, but the argument marshalling as well.
In addition, the resizable hash table is designed to fit an API that segregates bucket
selection from concurrency control. Although this allows a single torture test to exercise
all the hash-table implementations in this chapter, it also means that many operations
must compute the hash and interact with possible resize operations twice rather than just
once. In a performance-conscious environment, the hashtab_lock_mod() function
would also return a reference to the bucket selected, eliminating the subsequent call to
ht_get_bucket().
Quick Quiz 10.14: Couldn’t the hashtorture.h code be modified to accommo-
date a version of hashtab_lock_mod() that subsumes the ht_get_bucket()
functionality?
Quick Quiz 10.15: How much do these specializations really save? Are they really
worth it?
All that aside, one of the great benefits of modern hardware compared to that
available when I first started learning to program back in the early 1970s is that much
less specialization is required. This allows much greater productivity than was possible
back in the days of four-kilobyte address spaces.
could be eliminated, for example, by stealing a bit from the ->ht_resize_key field.
This works because the ->ht_resize_key field is large enough to address every
byte of memory and the ht_bucket structure is more than one byte long, so that the
->ht_resize_key field must have several bits to spare.
This sort of bit-packing trick is frequently used in data structures that are highly
replicated, as is the page structure in the Linux kernel. However, the resizable hash
table’s ht structure is not all that highly replicated. It is instead the ht_bucket
structures we should focus on. There are two major opportunities for shrinking the
ht_bucket structure: (1) Placing the ->htb_lock field in a low-order bit of one of
the ->htb_head pointers and (2) Reducing the number of pointers required.
The first opportunity might make use of bit-spinlocks in the Linux kernel, which
are provided by the include/linux/bit_spinlock.h header file. These are
used in space-critical data structures in the Linux kernel, but are not without their
disadvantages:
1 struct hash_elem {
2 struct ht_elem e;
3 long __attribute__ ((aligned(64))) counter;
4 };
cacheline to be present at the CPU doing the incrementing, but nowhere else. If other
CPUs attempted to traverse the hash bucket list containing that element, they would
incur expensive cache misses, degrading both performance and scalability.
One way to solve this problem on systems with 64-byte cache line is shown in
Figure 10.32. Here a gcc aligned attribute is used to force the ->counter and the
ht_elem structure into separate cache lines. This would allow CPUs to traverse the
hash bucket list at full speed despite the frequent incrementing.
Of course, this raises the question “How did we know that cache lines are 64
bytes in size?” On a Linux system, this information may be obtained from the /sys/
devices/system/cpu/cpu*/cache/ directories, and it is even possible to make
the installation process rebuild the application to accommodate the system’s hardware
structure. However, this would be more difficult if you wanted your application to also
run on non-Linux systems. Furthermore, even if you were content to run only on Linux,
such a self-modifying installation poses validation challenges.
Fortunately, there are some rules of thumb that work reasonably well in practice,
which were gathered into a 1995 paper [GKPS95].3 The first group of rules involve
rearranging structures to accommodate cache geometry:
1. Separate read-mostly data from data that is frequently updated. For example,
place read-mostly data at the beginning of the structure and frequently updated
data at the end. Where possible, place data that is rarely accessed in between.
2. If the structure has groups of fields such that each group is updated by an indepen-
dent code path, separate these groups from each other. Again, it can make sense
to place data that is rarely accessed between the groups. In some cases, it might
also make sense to place each such group into a separate structure referenced by
the original structure.
There has recently been some work towards automated trace-based rearrangement
of structure fields [GDZE10]. This work might well ease one of the more painstaking
tasks required to get excellent performance and scalability from multithreaded software.
An additional set of rules of thumb deal with locks:
1. Given a heavily contended lock protecting data that is frequently modified, take
one of the following approaches:
(a) Place the lock in a different cacheline than the data that it protects.
3 A number of these rules are paraphrased and expanded on here with permission from Orran Krieger.
270 CHAPTER 10. DATA STRUCTURES
(b) Use a lock that is adapted for high contention, such as a queued lock.
(c) Redesign to reduce lock contention. (This approach is best, but can require
quite a bit of work.)
2. Place uncontended locks into the same cache line as the data that they protect.
This approach means that the cache miss that brought the lock to the current CPU
also brought its data.
3. Protect read-mostly data with RCU, or, if RCU cannot be used and the critical
sections are of very long duration, reader-writer locks.
Of course, these are rules of thumb rather than absolute rules. Some experimentation
is required to work out which are most applicable to your particular situation.
10.7 Summary
This chapter has focused primarily on hash tables, including resizable hash tables, which
are not fully partitionable. Section 10.5 gave a quick overview of a few non-hash-table
data structures. Nevertheless, this exposition of hash tables is an excellent introduction
to the many issues surrounding high-performance scalable data access, including:
1. Fully partitioned data structures work well on small systems, for example, single-
socket systems.
2. Larger systems require locality of reference as well as full partitioning.
3. Read-mostly techniques, such as hazard pointers and RCU, provide good locality
of reference for read-mostly workloads, and thus provide excellent performance
and scalability even on larger systems.
4. Read-mostly techniques also work well on some types of non-partitionable data
structures, such as resizable hash tables.
5. Additional performance and scalability can be obtained by specializing the data
structure to a specific workload, for example, by replacing a general key with a
32-bit integer.
6. Although requirements for portability and for extreme performance often conflict,
there are some data-structure-layout techniques that can strike a good balance
between these two sets of requirements.
That said, performance and scalability is of little use without reliability, so the next
chapter covers validation.
If it is not tested, it doesn’t work.
Unknown
Chapter 11
Validation
I have had a few parallel programs work the first time, but that is only because I have
written a large number parallel programs over the past two decades. And I have had far
more parallel programs that fooled me into thinking that they were working correctly
the first time than actually were working the first time.
I have therefore had great need of validation for my parallel programs. The basic
trick behind parallel validation, as with other software validation, is to realize that the
computer knows what is wrong. It is therefore your job to force it to tell you. This
chapter can therefore be thought of as a short course in machine interrogation.1
A longer course may be found in many recent books on validation, as well as at least
one rather old but quite worthwhile one [Mye79]. Validation is an extremely important
topic that cuts across all forms of software, and is therefore worth intensive study in
its own right. However, this book is primarily about concurrency, so this chapter will
necessarily do little more than scratch the surface of this critically important topic.
Section 11.1 introduces the philosophy of debugging. Section 11.2 discusses tracing,
Section 11.3 discusses assertions, and Section 11.4 discusses static analysis. Section 11.5
describes some unconventional approaches to code review that can be helpful when the
fabled 10,000 eyes happen not to be looking at your code. Section 11.6 overviews the
use of probability for validating parallel software. Because performance and scalability
are first-class requirements for parallel programming, Section 11.7 covers these topics.
Finally, Section 11.8 gives a fanciful summary and a short list of statistical traps to
avoid.
But never forget that the two best debugging tools are a solid design and a good
night’s sleep!
11.1 Introduction
Section 11.1.1 discusses the sources of bugs, and Section 11.1.2 overviews the mindset
required when validating software. Section 11.1.3 discusses when you should start
validation, and Section 11.1.4 describes the surprisingly effective open-source regimen
of code review and community testing.
1 But you can leave the thumbscrews and waterboards at home. This chapter covers much more
sophisticated and effective methods, especially given that most computer systems neither feel pain nor fear
drowning. At least as far as we know.
271
272 CHAPTER 11. VALIDATION
The first two points should be uncontroversial, as they are illustrated by any number
of failed products, perhaps most famously Clippy and Microsoft Bob. By attempting
to relate to users as people, these two products raised common-sense and theory-of-
mind expectations that they proved incapable of meeting. Perhaps the set of software
assistants that have recently started appearing on smartphones will fare better. That said,
the developers working on them by all accounts still develop the old way: The assistants
might well benefit end users, but not so much their own developers.
This human love of fragmentary plans deserves more explanation, especially given
that it is a classic two-edged sword. This love of fragmentary plans is apparently due
to the assumption that the person carrying out the plan will have (1) common sense
and (2) a good understanding of the intent behind the plan. This latter assumption is
especially likely to hold in the common case where the person doing the planning and
the person carrying out the plan are one and the same: In this case, the plan will be
revised almost subconsciously as obstacles arise. Therefore, the love of fragmentary
plans has served human beings well, in part because it is better to take random actions
that have a high probability of locating food than to starve to death while attempting to
plan the unplannable. However, the past usefulness of fragmentary plans in everyday
life is no guarantee of their future usefulness in stored-program computers.
Furthermore, the need to follow fragmentary plans has had important effects on the
human psyche, due to the fact that throughout much of human history, life was often
difficult and dangerous. It should come as no surprise that executing a fragmentary
plan that has a high probability of a violent encounter with sharp teeth and claws
requires almost insane levels of optimism—a level of optimism that actually is present
in most human beings. These insane levels of optimism extend to self-assessments
of programming ability, as evidenced by the effectiveness of (and the controversy
over) interviewing techniques involving coding trivial programs [Bra07]. In fact, the
clinical term for a human being with less-than-insane levels of optimism is “clinically
depressed.” Such people usually have extreme difficulty functioning in their daily lives,
underscoring the perhaps counter-intuitive importance of insane levels of optimism to
a normal, healthy life. If you are not insanely optimistic, you are less likely to start a
11.1. INTRODUCTION 273
From these definitions, it logically follows that any reliable non-trivial program
contains at least one bug that you do not know about. Therefore, any validation effort
undertaken on a non-trivial program that fails to find any bugs is itself a failure. A good
validation is therefore an exercise in destruction. This means that if you are the type of
person who enjoys breaking things, validation is just the right type of job for you.
Quick Quiz 11.2: Suppose that you are writing a script that processes the output of
the time command, which looks as follows:
real 0m0.132s
user 0m0.040s
sys 0m0.008s
The script is required to check its input for errors, and to give appropriate diagnostics
if fed erroneous time output. What test inputs should you provide to this program to
test it for use with time output generated by single-threaded programs?
But perhaps you are a super-programmer whose code is always perfect the first time
every time. If so, congratulations! Feel free to skip this chapter, but I do hope that you
2 There are some famous exceptions to this rule of thumb. One set of exceptions is people who take on
difficult or risky projects in order to make at least a temporary escape from their depression. Another set is
people who have nothing to lose: the project is literally a matter of life or death.
274 CHAPTER 11. VALIDATION
will forgive my skepticism. You see, I have met far more people who claimed to be
able to write perfect code the first time than I have people who were actually capable
of carrying out this feat, which is not too surprising given the previous discussion of
optimism and over-confidence. And even if you really are a super-programmer, you just
might find yourself debugging lesser mortals’ work.
One approach for the rest of us is to alternate between our normal state of insane
optimism (Sure, I can program that!) and severe pessimism (It seems to work, but I just
know that there have to be more bugs hiding in there somewhere!). It helps if you enjoy
breaking things. If you don’t, or if your joy in breaking things is limited to breaking
other people’s things, find someone who does love breaking your code and get them to
help you test it.
Another helpful frame of mind is to hate it when other people find bugs in your
code. This hatred can help motivate you to torture your code beyond reason in order to
increase the probability that you find the bugs rather than someone else.
One final frame of mind is to consider the possibility that someone’s life depends on
your code being correct. This can also motivate you to torture your code into revealing
the whereabouts of its bugs.
This wide variety of frames of mind opens the door to the possibility of multiple
people with different frames of mind contributing to the project, with varying levels of
optimism. This can work well, if properly organized.
Some people might see vigorous validation as a form of torture, as depicted in
Figure 11.1.3 Such people might do well to remind themselves that, Tux cartoons aside,
they are really torturing an inanimate object, as shown in Figure 11.2. In addition, rest
assured that those who fail to torture their code are doomed to be tortured by it.
However, this leaves open the question of exactly when during the project lifetime
validation should start, a topic taken up by the next section.
To see this, consider that tracking down a bug is much harder in a large program
than in a small one. Therefore, to minimize the time and effort required to track down
bugs, you should test small units of code. Although you won’t find all the bugs this way,
you will find a substantial fraction, and it will be much easier to find and fix the ones
you do find. Testing at this level can also alert you to larger flaws in your overall design,
minimizing the time you waste writing code that is quite literally broken by design.
But why wait until you have code before validating your design?4 Hopefully reading
Chapters 3 and 4 provided you with the information required to avoid some regrettably
common design flaws, but discussing your design with a colleague or even simply
writing it down can help flush out additional flaws.
However, it is all too often the case that waiting to start validation until you have a
design is waiting too long. Mightn’t your natural level of optimism caused you to start
the design before you fully understood the requirements? The answer to this question
will almost always be “yes”. One good way to avoid flawed requirements is to get to
know your users. To really serve them well, you will have to live among them.
Quick Quiz 11.3: You are asking me to do all this validation BS before I even start
coding??? That sounds like a great way to never get started!!!
First-of-a-kind projects require different approaches to validation, for example, rapid
prototyping. Here, the main goal of the first few prototypes is to learn how the project
should be implemented, not so much to create a correct implementation on the first try.
However, it is important to keep in mind that you should not omit validation, but rather
take a radically different approach to it.
Now that we have established that you should start validation when you start the
project, the following sections cover a number of validation techniques and methods
that have proven their worth.
4 The old saying “First we must code, then we have incentive to think” notwithstanding.
276 CHAPTER 11. VALIDATION
Therefore, even when writing code for an open-source project, you need to be
prepared to develop and run your own test suite. Test development is an underappreciated
and very valuable skill, so be sure to take full advantage of any existing test suites
available to you. Important as test development is, we will leave further discussion of it
to books dedicated to that topic. The following sections therefore discuss locating bugs
in your code given that you already have a good test suite.
11.2 Tracing
When all else fails, add a printk()! Or a printf(), if you are working with
user-mode C-language applications.
The rationale is simple: If you cannot figure out how execution reached a given point
in the code, sprinkle print statements earlier in the code to work out what happened. You
can get a similar effect, and with more convenience and flexibility, by using a debugger
such as gdb (for user applications) or kgdb (for debugging Linux kernels). Much more
sophisticated tools exist, with some of the more recent offering the ability to rewind
backwards in time from the point of failure.
These brute-force testing tools are all valuable, especially now that typical systems
have more than 64K of memory and CPUs running faster than 4MHz. Much has been
written about these tools, so this chapter will add little more.
However, these tools all have a serious shortcoming when the job at hand is to
convince a the fastpath of a high-performance parallel algorithm to tell you what is
going wrong, namely, they often have excessive overheads. There are special tracing
technologies for this purpose, which typically leverage data ownership techniques (see
Chapter 8) to minimize the overhead of runtime data collection. One example within the
Linux kernel is “trace events” [Ros10b, Ros10c, Ros10d, Ros10a], which uses per-CPU
buffers to allow data to be collected with extremely low overhead. Even so, enabling
tracing can sometimes change timing enough to hide bugs, resulting in heisenbugs,
which are discussed in Section 11.6 and especially Section 11.6.4. In userspace code,
there is a huge number of tools that can help you. One good starting point is Brendan
Gregg’s blog.5
Even if you avoid heisenbugs, other pitfalls await you. For example, although the
machine really does know all, what it knows is almost always way more than your head
can hold. For this reason, high-quality test suites normally come with sophisticated
scripts to analyze the voluminous output. But beware—scripts won’t necessarily notice
surprising things. My rcutorture scripts are a case in point: Early versions of those
scripts were quite satisfied with a test run in which RCU grace periods stalled indefinitely.
This of course resulted in the scripts being modified to detect RCU grace-period stalls,
but this does not change the fact that the scripts will only detects problems that I think
to make them detect. But note well that unless you have a solid design, you won’t know
what your script should check for!
Another problem with tracing and especially with printk() calls is that their
overhead is often too much for production use. In some such cases, assertions can be
helpful.
5 http://www.brendangregg.com/blog/
278 CHAPTER 11. VALIDATION
11.3 Assertions
Assertions are usually implemented in the following manner:
1 if (something_bad_is_happening())
2 complain();
11.5.1 Inspection
Traditionally, formal code inspections take place in face-to-face meetings with formally
defined roles: moderator, developer, and one or two other participants. The developer
reads through the code, explaining what it is doing and why it works. The one or two
other participants ask questions and raise issues, while the moderator’s job is to resolve
any conflicts and to take notes. This process can be extremely effective at locating bugs,
particularly if all of the participants are familiar with the code at hand.
However, this face-to-face formal procedure does not necessarily work well in
the global Linux kernel community, although it might work well via an IRC session.
Instead, individuals review code separately and provide comments via email or IRC.
The note-taking is provided by email archives or IRC logs, and moderators volunteer
their services as appropriate. Give or take the occasional flamewar, this process also
works reasonably well, particularly if all of the participants are familiar with the code at
hand.6
It is quite likely that the Linux kernel community’s review process is ripe for
improvement:
1. There is sometimes a shortage of people with the time and expertise required to
carry out an effective review.
2. Even though all review discussions are archived, they are often “lost” in the sense
that insights are forgotten and people often fail to look up the discussions. This
can result in re-insertion of the same old bugs.
3. It is sometimes difficult to resolve flamewars when they do break out, especially
when the combatants have disjoint goals, experience, and vocabulary.
11.5.2 Walkthroughs
A traditional code walkthrough is similar to a formal inspection, except that the group
“plays computer” with the code, driven by specific test cases. A typical walkthrough team
has a moderator, a secretary (who records bugs found), a testing expert (who generates
6 That said, one advantage of the Linux kernel community approach over traditional formal inspections is
the greater probability of contributions from people not familiar with the code, who therefore might not be
blinded by the invalid assumptions harbored by those familiar with the code.
280 CHAPTER 11. VALIDATION
the test cases) and perhaps one to two others. These can be extremely effective, albeit
also extremely time-consuming.
It has been some decades since I have participated in a formal walkthrough, and
I suspect that a present-day walkthrough would use single-stepping debuggers. One
could imagine a particularly sadistic procedure as follows:
2. The moderator starts the code under a debugger, using the specified test case as
input.
3. Before each statement is executed, the developer is required to predict the outcome
of the statement and explain why this outcome is correct.
4. If the outcome differs from that predicted by the developer, this is taken as
evidence of a potential bug.
5. In parallel code, a “concurrency shark” asks what code might execute concurrently
with this code, and why such concurrency is harmless.
11.5.3 Self-Inspection
Although developers are usually not all that effective at inspecting their own code,
there are a number of situations where there is no reasonable alternative. For example,
the developer might be the only person authorized to look at the code, other qualified
developers might all be too busy, or the code in question might be sufficiently bizarre
that the developer is unable to convince anyone else to take it seriously until after
demonstrating a prototype. In these cases, the following procedure can be quite helpful,
especially for complex parallel code:
1. Write design document with requirements, diagrams for data structures, and
rationale for design choices.
3. Write the code in pen on paper, correct errors as you go. Resist the temptation to
refer to pre-existing nearly identical code sequences, instead, copy them.
4. If there were errors, copy the code in pen on fresh paper, correcting errors as you
go. Repeat until the last two copies are identical.
6. Where possible, test the code fragments from the bottom up.
7. When all the code is integrated, do full-up functional and stress testing.
8. Once the code passes all tests, write code-level documentation, perhaps as an
extension to the design document discussed above.
11.5. CODE REVIEW 281
When I faithfully follow this procedure for new RCU code, there are normally only a
few bugs left at the end. With a few prominent (and embarrassing) exceptions [McK11a],
I usually manage to locate these bugs before others do. That said, this is getting more
difficult over time as the number and variety of Linux-kernel users increases.
Quick Quiz 11.5: Why would anyone bother copying existing code in pen on
paper??? Doesn’t that just increase the probability of transcription errors?
Quick Quiz 11.6: This procedure is ridiculously over-engineered! How can you
expect to get a reasonable amount of software written doing it this way???
The above procedure works well for new code, but what if you need to inspect code
that you have already written? You can of course apply the above procedure for old
code in the special case where you wrote one to throw away [FPB79], but the following
approach can also be helpful in less desperate circumstances:
This works because describing the code in detail is an excellent way to spot
bugs [Mye79]. Although this second procedure is also a good way to get your head
around someone else’s code, in many cases, the first step suffices.
Although review and inspection by others is probably more efficient and effective,
the above procedures can be quite helpful in cases where for whatever reason it is not
feasible to involve others.
At this point, you might be wondering how to write parallel code without having to
do all this boring paperwork. Here are some time-tested ways of accomplishing this:
1. Write a sequential program that scales through use of available parallel library
functions.
2. Write sequential plug-ins for a parallel framework, such as map-reduce, BOINC,
or a web-application server.
3. Do such a good job of parallel design that the problem is fully partitioned, then
just implement sequential program(s) that run in parallel without communication.
4. Stick to one of the application areas (such as linear algebra) where tools can
automatically decompose and parallelize the problem.
5. Make extremely disciplined use of parallel-programming primitives, so that the
resulting code is easily seen to be correct. But beware: It is always tempting to
break the rules “just a little bit” to gain better performance or scalability. Breaking
the rules often results in general breakage. That is, unless you carefully do the
paperwork described in this section.
But the sad fact is that even if you do the paperwork or use one of the above ways to
more-or-less safely avoid paperwork, there will be bugs. If nothing else, more users and
a greater variety of users will expose more bugs more quickly, especially if those users
282 CHAPTER 11. VALIDATION
Hooray! I passed
the stress test!
are doing things that the original developers did not consider. The next section describes
how to handle the probabilistic bugs that occur all too commonly when validating
parallel software.
For example, a boot-up test of a Linux kernel patch is an example of a discrete test. You
boot the kernel, and it either comes up or it does not. Although you might spend an hour
boot-testing your kernel, the number of times you attempted to boot the kernel and the
number of times the boot-up succeeded would often be of more interest than the length
of time you spent testing. Functional tests tend to be discrete.
On the other hand, if my patch involved RCU, I would probably run rcutorture,
which is a kernel module that, strangely enough, tests RCU. Unlike booting the kernel,
where the appearance of a login prompt signals the successful end of a discrete test,
rcutorture will happily continue torturing RCU until either the kernel crashes or until
you tell it to stop. The duration of the rcutorture test is therefore (usually) of more
interest than the number of times you started and stopped it. Therefore, rcutorture is an
example of a continuous test, a category that includes many stress tests.
The statistics governing discrete and continuous tests differ somewhat. However,
the statistics for discrete tests is simpler and more familiar than that for continuous tests,
and furthermore the statistics for discrete tests can often be pressed into service (with
some loss of accuracy) for continuous tests. We therefore start with discrete tests.
However, many people find it easier to work with a formula than a series of steps,
although if you prefer the above series of steps, have at it! For those who like formulas,
call the probability of a single failure f . The probability of a single success is then 1 − f
and the probability that all of n tests will succeed is then:
Sn = (1 − f )n (11.1)
The probability of failure is 1 − Sn , or:
Fn = 1 − (1 − f )n (11.2)
Quick Quiz 11.8: Say what??? When I plug the earlier example of five tests each
with a 10% failure rate into the formula, I get 59,050% and that just doesn’t make
sense!!!
So suppose that a given test has been failing 10% of the time. How many times do
you have to run the test to be 99% sure that your supposed fix has actually improved
matters?
Another way to ask this question is “How many times would we need to run the
test to cause the probability of failure to rise above 99%?” After all, if we were to run
the test enough times that the probability of seeing at least one failure becomes 99%, if
284 CHAPTER 11. VALIDATION
1000
10
1
0 0.2 0.4 0.6 0.8 1
Per-Run Failure Probability
Figure 11.4: Number of Tests Required for 99 Percent Confidence Given Failure Rate
there are no failures, there is only 1% probability of this being due to dumb luck. And if
we plug f = 0.1 into Equation 11.2 and vary n, we find that 43 runs gives us a 98.92%
chance of at least one test failing given the original 10% per-test failure rate, while 44
runs gives us a 99.03% chance of at least one test failing. So if we run the test on our fix
44 times and see no failures, there is a 99% probability that our fix was actually a real
improvement.
But repeatedly plugging numbers into Equation 11.2 can get tedious, so let’s solve
for n:
Fn = 1 − (1 − f )n (11.3)
1 − Fn = (1 − f )n (11.4)
log (1 − Fn ) = n log (1 − f ) (11.5)
Finally the number of tests required is given by:
log (1 − Fn )
n= (11.6)
log (1 − f )
Plugging f = 0.1 and Fn = 0.99 into Equation 11.6 gives 43.7, meaning that we need
44 consecutive successful test runs to be 99% certain that our fix was a real improvement.
This matches the number obtained by the previous method, which is reassuring.
Quick Quiz 11.9: In Equation 11.6, are the logarithms base-10, base-2, or base-e?
Figure 11.4 shows a plot of this function. Not surprisingly, the less frequently each
test run fails, the more test runs are required to be 99% confident that the bug has been
fixed. If the bug caused the test to fail only 1% of the time, then a mind-boggling 458
test runs are required. As the failure probability decreases, the number of test runs
required increases, going to infinity as the failure probability goes to zero.
The moral of this story is that when you have found a rarely occurring bug, your
testing job will be much easier if you can come up with a carefully targeted test with a
much higher failure rate. For example, if your targeted test raised the failure rate from
1% to 30%, then the number of runs required for 99% confidence would drop from 458
test runs to a mere thirteen test runs.
11.6. PROBABILITY AND HEISENBUGS 285
But these thirteen test runs would only give you 99% confidence that your fix had
produced “some improvement”. Suppose you instead want to have 99% confidence that
your fix reduced the failure rate by an order of magnitude. How many failure-free test
runs are required?
An order of magnitude improvement from a 30% failure rate would be a 3% failure
rate. Plugging these numbers into Equation 11.6 yields:
log (1 − 0.99)
n= = 151.2 (11.7)
log (1 − 0.03)
So our order of magnitude improvement requires roughly an order of magnitude
more testing. Certainty is impossible, and high probabilities are quite expensive. Clearly
making tests run more quickly and making failures more probable are essential skills
in the development of highly reliable software. These skills will be covered in Sec-
tion 11.6.4.
λ m −λ
Fm = e (11.8)
m!
Here Fm is the probability of m failures in the test and λ is the expected failure
rate per unit time. A rigorous derivation may be found in any advanced probability
textbook, for example, Feller’s classic “An Introduction to Probability Theory and Its
Applications” [Fel50], while a more intuitive approach may be found in the first edition
of this book [McK14a].
Let’s try reworking the example from Section 11.6.2 using the Poisson distribution.
Recall that this example involved a test with a 30% failure rate per hour, and that the
question was how long the test would need to run error-free on a alleged fix to be 99%
certain that the fix actually reduced the failure rate. In this case, λ is zero, so that
Equation 11.8 reduces to:
F0 = e−λ (11.9)
Solving this requires setting F0 to 0.01 and solving for λ , resulting in:
286 CHAPTER 11. VALIDATION
This is the Poisson cumulative distribution function, which can be written more
compactly as:
m
λ i −λ
Fi≤m = ∑ e (11.13)
i=0 i!
Here m is the number of errors in the long test run (in this case, two) and λ is
expected number of errors in the long test run (in this case, 24). Plugging m = 2 and
λ = 24 into this expression gives the probability of two or fewer failures as about
1.2 × 10−8 , in other words, we have a high level of confidence that the fix actually had
some relationship to the bug.8
Quick Quiz 11.11: Doing the summation of all the factorials and exponentials is a
real pain. Isn’t there an easier way?
Quick Quiz 11.12: But wait!!! Given that there has to be some number of fail-
ures (including the possibility of zero failures), shouldn’t the summation shown in
Equation 11.13 approach the value 1 as m goes to infinity?
The Poisson distribution is a powerful tool for analyzing test results, but the fact is
that in this last example there were still two remaining test failures in a 24-hour test
run. Such a low failure rate results in very long test runs. The next section discusses
counter-intuitive ways of improving this situation.
8 Of course, this result in no way excuses you from finding and fixing the bug(s) resulting in the remaining
two failures!
11.6. PROBABILITY AND HEISENBUGS 287
However, it is often the case that the bug is in a specific subsystem, and the structure
of the program limits the amount of stress that can be applied to that subsystem. The
next section addresses this situation.
call_rcu()
Grace-Period Start
Near Miss
Error
Time
Grace-Period End
Callback Invocation
this can be problematic because different CPUs can have different opinions as to exactly
where a given grace period starts and ends, as indicated by the jagged lines.11 Using the
near misses as the error condition could therefore result in false positives, which need to
be avoided in the automated rcutorture testing.
By sheer dumb luck, rcutorture happens to include some statistics that are
sensitive to the near-miss version of the grace period. As noted above, these statistics
are subject to false positives due to their unsynchronized access to RCU’s state variables,
but these false positives turn out to be extremely rare on strongly ordered systems such
as the IBM mainframe and x86, occurring less than once per thousand hours of testing.
These near misses occurred roughly once per hour, about two orders of magnitude
more frequently than the actual errors. Use of these near misses allowed the bug’s root
cause to be identified in less than a week and a high degree of confidence in the fix to be
built in less than a day. In contrast, excluding the near misses in favor of the real errors
would have required months of debug and validation time.
To sum up near-miss counting, the general approach is to replace counting of
infrequent failures with more-frequent near misses that are believed to be correlated
with those failures. These near-misses can be considered an anti-heisenbug to the real
failure’s heisenbug because the near-misses, being more frequent, are likely to be more
robust in the face of changes to your code, for example, the changes you make to add
debugging code.
Thus far, we have focused solely on bugs in the parallel program’s functional-
ity. However, because performance is a first-class requirement for a parallel program
(otherwise, why not write a sequential program?), the next section looks into finding
performance bugs.
every microsecond matters and every nanosecond is needed. Therefore, for parallel
programs, insufficient performance is just as much a bug as is incorrectness.
Quick Quiz 11.16: That is ridiculous!!! After all, isn’t getting the correct answer
later than one would like better than getting an incorrect answer???
Quick Quiz 11.17: But if you are going to put in all the hard work of parallelizing an
application, why not do it right? Why settle for anything less than optimal performance
and linear scalability?
Validating a parallel program must therfore include validating its performance. But
validating performance means having a workload to run and performance criteria with
which to evaluate the program at hand. These needs are often met by performance
benchmarks, which are discussed in the next section.
11.7.1 Benchmarking
The old saying goes “There are lies, damn lies, statistics, and benchmarks.” However,
benchmarks are heavily used, so it is not helpful to be too dismissive of them.
Benchmarks span the range from ad hoc test jigs to international standards, but
regardless of their level of formality, benchmarks serve four major purposes:
Of course, the only completely fair framework is the intended application itself. So
why would anyone who cared about fairness in benchmarking bother creating imperfect
benchmarks rather than simply using the application itself as the benchmark?
Running the actual application is in fact the best approach where it is practical.
Unfortunately, it is often impractical for the following reasons:
1. The application might be proprietary, and you might not have the right to run the
intended application.
2. The application might require more hardware than you have access to.
3. The application might use data that you cannot legally access, for example, due
to privacy regulations.
In these cases, creating a benchmark that approximates the application can help
overcome these obstacles. A carefully constructed benchmark can help promote perfor-
mance, scalability, energy efficiency, and much else besides.
11.7.2 Profiling
In many cases, a fairly small portion of your software is responsible for the majority of
the performance and scalability shortfall. However, developers are notoriously unable
to identify the actual bottlenecks by hand. For example, in the case of a kernel buffer
292 CHAPTER 11. VALIDATION
allocator, all attention focused on a search of a dense array which turned out to represent
only a few percent of the allocator’s execution time. An execution profile collected via
a logic analyzer focused attention on the cache misses that were actually responsible for
the majority of the problem [MS93].
An old-school but quite effective method of tracking down performance and scalabil-
ity bugs is to run your program under a debugger, then periodically interrupt it, recording
the stacks of all threads at each interruption. The theory here is that if something is
slowing down your program, it has to be visible in your threads’ executions.
That said, there are a number of tools that will usually do a much better job of
helping you to focus your attention where it will do the most good. Two popular choices
are gprof and perf. To use perf on a single-process program, prefix your command
with perf record, then after the command completes, type perf report. There
is a lot of work on tools for performance debugging of multi-threaded programs, which
should make this important job easier. Again, one good starting point is Brendan Gregg’s
blog.12
11.7.4 Microbenchmarking
Microbenchmarking can be useful when deciding which algorithms or data structures
are worth incorporating into a larger body of software for deeper evaluation.
One common approach to microbenchmarking is to measure the time, run some
number of iterations of the code under test, then measure the time again. The difference
between the two times divided by the number of iterations gives the measured time
required to execute the code under test.
Unfortunately, this approach to measurement allows any number of errors to creep
in, including:
1. The measurement will include some of the overhead of the time measurement.
This source of error can be reduced to an arbitrarily small value by increasing the
number of iterations.
2. The first few iterations of the test might incur cache misses or (worse yet) page
faults that might inflate the measured value. This source of error can also be
12 http://www.brendangregg.com/blog/
11.7. PERFORMANCE ESTIMATION 293
The first and fourth sources of interference provide conflicting advice, which is one
sign that we are living in the real world. The remainder of this section looks at ways of
resolving this conflict.
Quick Quiz 11.18: But what about other sources of error, for example, due to
interactions between caches and memory layout?
The following sections discuss ways of dealing with these measurement errors, with
Section 11.7.5 covering isolation techniques that may be used to prevent some forms of
interference, and with Section 11.7.6 covering methods for detecting interference so as
to reject measurement data that might have been corrupted by that interference.
11.7.5 Isolation
The Linux kernel provides a number of ways to isolate a group of CPUs from outside
interference.
First, let’s look at interference by other processes, threads, and tasks. The POSIX
sched_setaffinity() system call may be used to move most tasks off of a
given set of CPUs and to confine your tests to that same group. The Linux-specific
user-level taskset command may be used for the same purpose, though both sched_
setaffinity() and taskset require elevated permissions. Linux-specific control
groups (cgroups) may be used for this same purpose. This approach can be quite
effective at reducing interference, and is sufficient in many cases. However, it does have
limitations, for example, it cannot do anything about the per-CPU kernel threads that
are often used for housekeeping tasks.
One way to avoid interference from per-CPU kernel threads is to run your test at a
high real-time priority, for example, by using the POSIX sched_setscheduler()
system call. However, note that if you do this, you are implicitly taking on responsibility
for avoiding infinite loops, because otherwise your test will prevent part of the kernel
from functioning.13
These approaches can greatly reduce, and perhaps even eliminate, interference
from processes, threads, and tasks. However, it does nothing to prevent interference
from device interrupts, at least in the absence of threaded interrupts. Linux allows
some control of threaded interrupts via the /proc/irq directory, which contains
numerical directories, one per interrupt vector. Each numerical directory contains
13 This is an example of the Spiderman Principle: “With great power comes great responsibility.”
294 CHAPTER 11. VALIDATION
interrupt to be shut off on any CPU that has only one runnable task, but as of early 2013, this is unfortunately
still work in progress.
11.7. PERFORMANCE ESTIMATION 295
1 #include <sys/time.h>
2 #include <sys/resource.h>
3
4 /* Return 0 if test results should be rejected. */
5 int runtest(void)
6 {
7 struct rusage ru1;
8 struct rusage ru2;
9
10 if (getrusage(RUSAGE_SELF, &ru1) != 0) {
11 perror("getrusage");
12 abort();
13 }
14 /* run test here. */
15 if (getrusage(RUSAGE_SELF, &ru2 != 0) {
16 perror("getrusage");
17 abort();
18 }
19 return (ru1.ru_nvcsw == ru2.ru_nvcsw &&
20 ru1.runivcsw == ru2.runivcsw);
21 }
The fact that smaller measurements are more likely to be accurate than larger
measurements suggests that sorting the measurements in increasing order is likely to be
productive.15 The fact that the measurement uncertainty is known allows us to accept
measurements within this uncertainty of each other: If the effects of interference are
large compared to this uncertainty, this will ease rejection of bad data. Finally, the fact
that some fraction (for example, one third) can be assumed to be good allows us to
blindly accept the first portion of the sorted list, and this data can then be used to gain
an estimate of the natural variation of the measured data, over and above the assumed
measurement error.
The approach is to take the specified number of leading elements from the beginning
of the sorted list, and use these to estimate a typical inter-element delta, which in turn
may be multiplied by the number of elements in the list to obtain an upper bound on
permissible values. The algorithm then repeatedly considers the next element of the list.
If it falls below the upper bound, and if the distance between the next element and the
previous element is not too much greater than the average inter-element distance for the
portion of the list accepted thus far, then the next element is accepted and the process
repeats. Otherwise, the remainder of the list is rejected.
Figure 11.7 shows a simple sh/awk script implementing this notion. Input consists
of an x-value followed by an arbitrarily long list of y-values, and output consists of one
line for each input line, with fields as follows:
1. The x-value.
15 To paraphrase the old saying, “Sort first and ask questions later.”
296 CHAPTER 11. VALIDATION
1 divisor=3
2 relerr=0.01
3 trendbreak=10
4 while test $# -gt 0
5 do
6 case "$1" in
7 --divisor)
8 shift
9 divisor=$1
10 ;;
11 --relerr)
12 shift
13 relerr=$1
14 ;;
15 --trendbreak)
16 shift
17 trendbreak=$1
18 ;;
19 esac
20 shift
21 done
22
23 awk -v divisor=$divisor -v relerr=$relerr \
24 -v trendbreak=$trendbreak ’{
25 for (i = 2; i <= NF; i++)
26 d[i - 1] = $i;
27 asort(d);
28 i = int((NF + divisor - 1) / divisor);
29 delta = d[i] - d[1];
30 maxdelta = delta * divisor;
31 maxdelta1 = delta + d[i] * relerr;
32 if (maxdelta1 > maxdelta)
33 maxdelta = maxdelta1;
34 for (j = i + 1; j < NF; j++) {
35 if (j <= 2)
36 maxdiff = d[NF - 1] - d[1];
37 else
38 maxdiff = trendbreak * \
39 (d[j - 1] - d[1]) / (j - 2);
40 if (d[j] - d[1] > maxdelta && \
41 d[j] - d[j - 1] > maxdiff)
42 break;
43 }
44 n = sum = 0;
45 for (k = 1; k < j; k++) {
46 sum += d[k];
47 n++;
48 }
49 min = d[1];
50 max = d[j - 1];
51 avg = sum / n;
52 print $1, avg, min, max, n, NF - 1;
53 }’
• --divisor: Number of segments to divide the list into, for example, a divisor
of four means that the first quarter of the data elements will be assumed to be
good. This defaults to three.
• --relerr: Relative measurement error. The script assumes that values that
differ by less than this error are for all intents and purposes equal. This defaults
to 0.01, which is equivalent to 1%.
Lines 1-3 of Figure 11.7 set the default values for the parameters, and lines 4-21
parse any command-line overriding of these parameters. The awk invocation on lines 23
and 24 sets the values of the divisor, relerr, and trendbreak variables to their
sh counterparts. In the usual awk manner, lines 25-52 are executed on each input line.
The loop spanning lines 24 and 26 copies the input y-values to the d array, which line 27
sorts into increasing order. Line 28 computes the number of y-values that are to be
trusted absolutely by applying divisor and rounding up.
Lines 29-33 compute the maxdelta value used as a lower bound on the upper
bound of y-values. To this end, lines 29 and 30 multiply the difference in values over
the trusted region of data by the divisor, which projects the difference in values
across the trusted region across the entire set of y-values. However, this value might
well be much smaller than the relative error, so line 31 computes the absolute error
(d[i] * relerr) and adds that to the difference delta across the trusted portion
of the data. Lines 32 and 33 then compute the maximum of these two values.
Each pass through the loop spanning lines 34-43 attempts to add another data value
to the set of good data. Lines 35-39 compute the trend-break delta, with line 36 disabling
this limit if we don’t yet have enough values to compute a trend, and with lines 38 and 39
multiplying trendbreak by the average difference between pairs of data values in
the good set. If line 40 determines that the candidate data value would exceed the lower
bound on the upper bound (maxdelta) and line 41 determines that the difference
between the candidate data value and its predecessor exceeds the trend-break difference
(maxdiff), then line 42 exits the loop: We have the full good set of data.
Lines 44-52 then compute and print the statistics for the data set.
Quick Quiz 11.20: This approach is just plain weird! Why not use means and
standard deviations, like we were taught in our statistics classes?
Quick Quiz 11.21: But what if all the y-values in the trusted group of data are
exactly zero? Won’t that cause the script to reject any non-zero value?
Although statistical interference detection can be quite useful, it should be used only
as a last resort. It is far better to avoid interference in the first place (Section 11.7.5), or,
failing that, detecting interference via measurement (Section 11.7.6.1).
11.8 Summary
Although validation never will be an exact science, much can be gained by taking
an organized approach to it, as an organized approach will help you choose the right
298 CHAPTER 11. VALIDATION
validation tools for your job, avoiding situations like the one fancifully depicted in
Figure 11.8.
A key choice is that of statistics. Although the methods described in this chapter
work very well most of the time, they do have their limitations. These limitations
are inherent because we are attempting to do something that is in general impossible,
courtesy of the Halting Problem [Tur37, Pul00]. Fortunately for us, there are a huge
number of special cases in which we can not only work out whether a given program
will halt, but also establish estimates for how long it will run before halting, as discussed
in Section 11.7. Furthermore, in cases where a given program might or might not work
correctly, we can often establish estimates for what fraction of the time it will work
correctly, as discussed in Section 11.6.
Nevertheless, unthinking reliance on these estimates is brave to the point of fool-
hardiness. After all, we are summarizing a huge mass of complexity in code and data
structures down to a single solitary number. Even though we can get away with such
bravery a surprisingly large fraction of the time, it is only reasonable to expect that the
code and data being abstracted away will occasionally cause severe problems.
One possible problem is variability, where repeated runs might give wildly different
results. This is often dealt with by maintaining a standard deviation as well as a mean,
but the fact is that attempting to summarize the behavior of a large and complex program
with two numbers is almost as brave as summarizing its behavior with only one number.
In computer programming, the surprising thing is that use of the mean or the mean and
standard deviation are often sufficient, but there are no guarantees.
One cause of variation is confounding factors. For example, the CPU time consumed
by a linked-list search will depend on the length of the list. Averaging together runs
with wildly different list lengths will probably not be useful, and adding a standard
deviation to the mean will not be much better. The right thing to do would be control for
list length, either by holding the length constant or to measure CPU time as a function
of list length.
Of course, this advice assumes that you are aware of the confounding factors, and
Murphy says that you probably will not be. I have been involved in projects that had
confounding factors as diverse as air conditioners (which drew considerable power at
startup, thus causing the voltage supplied to the computer to momentarily drop too low,
sometimes resulting in failure), cache state (resulting in odd variations in performance),
I/O errors (including disk errors, packet loss, and duplicate Ethernet MAC addresses),
and even porpoises (which could not resist playing with an array of transponders, which,
in the absence of porpoises, could be used for high-precision acoustic positioning and
navigation). And this is but one reason why a good night’s sleep is such an effective
11.8. SUMMARY 299
debugging tool.
In short, validation always will require some measure of the behavior of the system.
Because this measure must be a severe summarization of the system, it can be misleading.
So as the saying goes, “Be careful. It is a real world out there.”
But suppose you are working on the Linux kernel, which as of 2013 has about a
billion instances throughout the world? In that case, a bug that would be encountered
once every million years will be encountered almost three times per day across the
installed base. A test with a 50% chance of encountering this bug in a one-hour run
would need to increase that bug’s probability of occurrence by more than nine orders
of magnitude, which poses a severe challenge to today’s testing methodologies. One
important tool that can sometimes be applied with good effect to such situations is
formal verification, the subject of the next chapter.
300 CHAPTER 11. VALIDATION
Beware of bugs in the above code; I have only
proved it correct, not tried it.
Donald Knuth
Chapter 12
Formal Verification
Parallel algorithms can be hard to write, and even harder to debug. Testing, though
essential, is insufficient, as fatal race conditions can have extremely low probabilities
of occurrence. Proofs of correctness can be valuable, but in the end are just as prone
to human error as is the original algorithm. In addition, a proof of correctness cannot
be expected to find errors in your assumptions, shortcomings in the requirements,
misunderstandings of the underlying software or hardware primitives, or errors that you
did not think to construct a proof for. This means that formal methods can never replace
testing, however, formal methods are nevertheless a valuable addition to your validation
toolbox.
It would be very helpful to have a tool that could somehow locate all race conditions.
A number of such tools exist, for example, Section 12.1 provides an introduction to
the general-purpose state-space search tools Promela and Spin, Section 12.2 similarly
introduces the special-purpose ppcmem and cppmem tools, Section 12.3 looks at an
example axiomatic approach, Section 12.4 briefly overviews SAT solvers, and finally
Section 12.5 sums up use of formal-verification tools for verifying parallel algorithms.
301
302 CHAPTER 12. FORMAL VERIFICATION
1 #define NUMPROCS 2
2
3 byte counter = 0;
4 byte progress[NUMPROCS];
5
6 proctype incrementer(byte me)
7 {
8 int temp;
9
10 temp = counter;
11 counter = temp + 1;
12 progress[me] = 1;
13 }
14
15 init {
16 int i = 0;
17 int sum = 0;
18
19 atomic {
20 i = 0;
21 do
22 :: i < NUMPROCS ->
23 progress[i] = 0;
24 run incrementer(i);
25 i++
26 :: i >= NUMPROCS -> break
27 od;
28 }
29 atomic {
30 i = 0;
31 sum = 0;
32 do
33 :: i < NUMPROCS ->
34 sum = sum + progress[i];
35 i++
36 :: i >= NUMPROCS -> break
37 od;
38 assert(sum < NUMPROCS || counter == NUMPROCS)
39 }
40 }
your algorithm, either verifying or finding counter-examples for assertions that you can
include in your Promela program.
This full-state search can be extremely powerful, but can also be a two-edged sword.
If your algorithm is too complex or your Promela implementation is careless, there
might be more states than fit in memory. Furthermore, even given sufficient memory, the
state-space search might well run for longer than the expected lifetime of the universe.
Therefore, use this tool for compact but complex parallel algorithms. Attempts to
naively apply it to even moderate-scale algorithms (let alone the full Linux kernel) will
end badly.
Promela and Spin may be downloaded from http://spinroot.com/spin/
whatispin.html.
The above site also gives links to Gerard Holzmann’s excellent book [Hol03] on
Promela and Spin, as well as searchable online references starting at: http://www.
spinroot.com/spin/Man/index.html.
The remainder of this section describes how to use Promela to debug parallel
algorithms, starting with simple examples and progressing to more complex uses.
12.1. GENERAL-PURPOSE STATE-SPACE SEARCH 303
This will produce output as shown in Figure 12.2. The first line tells us that our
assertion was violated (as expected given the non-atomic increment!). The second line
that a trail file was written describing how the assertion was violated. The “Warning”
line reiterates that all was not well with our model. The second paragraph describes the
304 CHAPTER 12. FORMAL VERIFICATION
type of state-search being carried out, in this case for assertion violations and invalid
end states. The third paragraph gives state-size statistics: this small model had only 45
states. The final line shows memory usage.
The trail file may be rendered human-readable as follows:
spin -t -p increment.spin
This gives the output shown in Figure 12.3. As can be seen, the first portion of the
init block created both incrementer processes, both of which first fetched the counter,
then both incremented and stored it, losing a count. The assertion then triggered, after
which the global state is displayed.
• spin -a qrcu.spin Create a file pan.c that fully searches the state ma-
chine.
The -DSAFETY generates optimizations that are appropriate if you have only
assertions (and perhaps never statements). If you have liveness, fairness, or
forward-progress checks, you may need to compile without -DSAFETY. If you
leave off -DSAFETY when you could have used it, the program will let you know.
The optimizations produced by -DSAFETY greatly speed things up, so you should
use it when you can. An example situation where you cannot use -DSAFETY is
when checking for livelocks (AKA “non-progress cycles”) via -DNP.
• ./pan This actually searches the state space. The number of states can reach into
the tens of millions with very small state machines, so you will need a machine
with large memory. For example, qrcu.spin with 3 readers and 2 updaters
required 2.7GB of memory.
If you aren’t sure whether your machine has enough memory, run top in one
window and ./pan in another. Keep the focus on the ./pan window so that
you can quickly kill execution if need be. As soon as CPU time drops much below
100%, kill ./pan. If you have removed focus from the window running ./pan,
you may wait a long time for the windowing system to grab enough memory to
do anything for you.
Don’t forget to capture the output, especially if you are working on a remote
machine.
If your model includes forward-progress checks, you will likely need to enable
“weak fairness” via the -f command-line argument to ./pan. If your forward-
progress checks involve accept labels, you will also need the -a argument.
• spin -t -p qrcu.spin Given trail file output by a run that encountered
an error, output the sequence of steps leading to that error. The -g flag will also
include the values of changed global variables, and the -l flag will also include
the values of changed local variables.
6. In C torture-test code, it is often wise to keep per-task control variables. They are
cheap to read, and greatly aid in debugging the test code. In Promela, per-task
control variables should be used only when there is no other alternative. To see
this, consider a 5-task verification with one bit each to indicate completion. This
gives 32 states. In contrast, a simple counter would have only six states, more
than a five-fold reduction. That factor of five might not seem like a problem, at
least not until you are struggling with a verification program possessing more
than 150 million states consuming more than 10GB of memory!
7. One of the most challenging things both in C torture-test code and in Promela is
formulating good assertions. Promela also allows never claims that act sort of
like an assertion replicated between every line of code.
1 if
2 :: 1 -> r1 = x;
3 r2 = y
4 :: 1 -> r2 = y;
5 r1 = x
6 fi
2. State reduction. If you have complex assertions, evaluate them under atomic.
After all, they are not part of the algorithm. One example of a complex assertion
(to be discussed in more detail later) is as shown in Figure 12.6.
308 CHAPTER 12. FORMAL VERIFICATION
1 i = 0;
2 sum = 0;
3 do
4 :: i < N_QRCU_READERS ->
5 sum = sum + (readerstart[i] == 1 &&
6 readerprogress[i] == 1);
7 i++
8 :: i >= N_QRCU_READERS ->
9 assert(sum == 0);
10 break
11 od
3. Promela does not provide functions. You must instead use C preprocessor macros.
However, you must use them carefully in order to avoid combinatorial explosion.
1 #include "lock.h"
2
3 #define N_LOCKERS 3
4
5 bit mutex = 0;
6 bit havelock[N_LOCKERS];
7 int sum;
8
9 proctype locker(byte me)
10 {
11 do
12 :: 1 ->
13 spin_lock(mutex);
14 havelock[me] = 1;
15 havelock[me] = 0;
16 spin_unlock(mutex)
17 od
18 }
19
20 init {
21 int i = 0;
22 int j;
23
24 end: do
25 :: i < N_LOCKERS ->
26 havelock[i] = 0;
27 run locker(i);
28 i++
29 :: i >= N_LOCKERS ->
30 sum = 0;
31 j = 0;
32 atomic {
33 do
34 :: j < N_LOCKERS ->
35 sum = sum + havelock[j];
36 j = j + 1
37 :: j >= N_LOCKERS ->
38 break
39 od
40 }
41 assert(sum <= 1);
42 break
43 od
44 }
loop is a single atomic block that contains an if-fi statement. The if-fi construct is similar
to the do-od construct, except that it takes a single pass rather than looping. If the lock
is not held on line 5, then line 6 acquires it and line 7 breaks out of the enclosing do-od
loop (and also exits the atomic block). On the other hand, if the lock is already held on
line 8, we do nothing (skip), and fall out of the if-fi and the atomic block so as to take
another pass through the outer loop, repeating until the lock is available.
The spin_unlock() macro simply marks the lock as no longer held.
Note that memory barriers are not needed because Promela assumes full ordering.
In any given Promela state, all processes agree on both the current state and the order
of state changes that caused us to arrive at the current state. This is analogous to the
“sequentially consistent” memory model used by a few computer systems (such as MIPS
and PA-RISC). As noted earlier, and as will be seen in a later example, weak memory
ordering must be explicitly coded.
These macros are tested by the Promela code shown in Figure 12.9. This code is
similar to that used to test the increments, with the number of locking processes defined
by the N_LOCKERS macro definition on line 3. The mutex itself is defined on line 5, an
array to track the lock owner on line 6, and line 7 is used by assertion code to verify that
310 CHAPTER 12. FORMAL VERIFICATION
The output will look something like that shown in Figure 12.10. As expected, this
run has no assertion failures (“errors: 0”).
Quick Quiz 12.1: Why is there an unreached statement in locker? After all, isn’t
this a full state-space search?
Quick Quiz 12.2: What are some Promela code-style issues with this example?
idx = qrcu_read_lock(&my_qrcu_struct);
/* read-side critical section. */
qrcu_read_unlock(&my_qrcu_struct, idx);
A Linux-kernel patch for QRCU has been produced [McK07b], but has not yet been
included in the Linux kernel as of April 2008.
1 #include "lock.h"
2
3 #define N_QRCU_READERS 2
4 #define N_QRCU_UPDATERS 2
5
6 bit idx = 0;
7 byte ctr[2];
8 byte readerprogress[N_QRCU_READERS];
9 bit mutex = 0;
Returning to the Promela code for QRCU, the global variables are as shown in
Figure 12.11. This example uses locking, hence including lock.h. Both the number of
readers and writers can be varied using the two #define statements, giving us not one
but two ways to create combinatorial explosion. The idx variable controls which of the
two elements of the ctr array will be used by readers, and the readerprogress
variable allows an assertion to determine when all the readers are finished (since a QRCU
update cannot be permitted to complete until all pre-existing readers have completed
their QRCU read-side critical sections). The readerprogress array elements have
values as follows, indicating the state of the corresponding reader:
atomically increment it (and break from the infinite loop) if its value was non-zero
(atomic_inc_not_zero()). Line 17 marks entry into the RCU read-side critical
section, and line 18 marks exit from this critical section, both lines for the benefit of the
assert() statement that we shall encounter later. Line 19 atomically decrements the
same counter that we incremented, thereby exiting the RCU read-side critical section.
1 #define sum_unordered \
2 atomic { \
3 do \
4 :: 1 -> \
5 sum = ctr[0]; \
6 i = 1; \
7 break \
8 :: 1 -> \
9 sum = ctr[1]; \
10 i = 0; \
11 break \
12 od; \
13 } \
14 sum = sum + ctr[i]
The C-preprocessor macro shown in Figure 12.13 sums the pair of counters so as
to emulate weak memory ordering. Lines 2-13 fetch one of the counters, and line 14
fetches the other of the pair and sums them. The atomic block consists of a single do-od
statement. This do-od statement (spanning lines 3-12) is unusual in that it contains
two unconditional branches with guards on lines 4 and 8, which causes Promela to
non-deterministically choose one of the two (but again, the full state-space search causes
Promela to eventually make all possible choices in each applicable situation). The
first branch fetches the zero-th counter and sets i to 1 (so that line 14 will fetch the
first counter), while the second branch does the opposite, fetching the first counter and
setting i to 0 (so that line 14 will fetch the second counter).
Quick Quiz 12.3: Is there a more straightforward way to code the do-od statement?
With the sum_unordered macro in place, we can now proceed to the update-side
process shown in Figure 12.14. The update-side process repeats indefinitely, with the
corresponding do-od loop ranging over lines 7-57. Each pass through the loop first
12.1. GENERAL-PURPOSE STATE-SPACE SEARCH 313
snapshots the global readerprogress array into the local readerstart array on
lines 12-21. This snapshot will be used for the assertion on line 53. Line 23 invokes
sum_unordered, and then lines 24-27 re-invoke sum_unordered if the fastpath
is potentially usable.
Lines 28-40 execute the slowpath code if need be, with lines 30 and 38 acquiring and
releasing the update-side lock, lines 31-33 flipping the index, and lines 34-37 waiting
for all pre-existing readers to complete.
Lines 44-56 then compare the current values in the readerprogress array to
those collected in the readerstart array, forcing an assertion failure should any
readers that started before this update still be in progress.
Quick Quiz 12.4: Why are there atomic blocks at lines 12-21 and lines 44-56, when
the operations within those atomic blocks have no atomic implementation on any current
production microprocessor?
Quick Quiz 12.5: Is the re-summing of the counters on lines 24-27 really necessary?
1 init {
2 int i;
3
4 atomic {
5 ctr[idx] = 1;
6 ctr[!idx] = 0;
7 i = 0;
8 do
9 :: i < N_QRCU_READERS ->
10 readerprogress[i] = 0;
11 run qrcu_reader(i);
12 i++
13 :: i >= N_QRCU_READERS -> break
14 od;
15 i = 0;
16 do
17 :: i < N_QRCU_UPDATERS ->
18 run qrcu_updater(i);
19 i++
20 :: i >= N_QRCU_UPDATERS -> break
21 od
22 }
23 }
All that remains is the initialization block shown in Figure 12.15. This block simply
initializes the counter pair on lines 5-6, spawns the reader processes on lines 7-14, and
spawns the updater processes on lines 15-21. This is all done within an atomic block to
reduce state space.
The resulting output shows that this model passes all of the cases shown in Table 12.2.
Now, it would be nice to run the case with three readers and three updaters, however,
12.1. GENERAL-PURPOSE STATE-SPACE SEARCH 315
simple extrapolation indicates that this will require on the order of a terabyte of memory
best case. So, what to do? Here are some possible approaches:
1. See whether a smaller number of readers and updaters suffice to prove the general
case.
2. The counter corresponding to this reader will have been at least 1 during this time
interval.
4. Therefore, at any given point in time, either one of the counters will be at least 2,
or both of the counters will be at least one.
5. However, the synchronize_qrcu() fastpath code can read only one of the
counters at a given time. It is therefore possible for the fastpath code to fetch the
first counter while zero, but to race with a counter flip so that the second counter
is seen as one.
6. There can be at most one reader persisting through such a race condition, as
otherwise the sum would be two or greater, which would cause the updater to
take the slowpath.
7. But if the race occurs on the fastpath’s first read of the counters, and then again
on its second read, there have to have been two counter flips.
8. Because a given updater flips the counter only once, and because the update-side
lock prevents a pair of updaters from concurrently flipping the counters, the
only way that the fastpath code can race with a flip twice is if the first updater
completes.
9. But the first updater will not complete until after all pre-existing readers have
completed.
10. Therefore, if the fastpath races with a counter flip twice in succession, all pre-
existing readers must have completed, so that it is safe to take the fastpath.
Of course, not all parallel algorithms have such simple proofs. In such cases, it may
be necessary to enlist more capable tools.
presents a series of seven increasingly realistic Promela models, the last of which passes,
consuming about 40GB of main memory for the state space.
More important, Promela and Spin did find a very subtle bug for me!
Quick Quiz 12.7: Yeah, that’s just great! Now, just what am I supposed to do if I
don’t happen to have a machine with 40GB of main memory???
Still better would be to come up with a simpler and faster algorithm that has a
smaller state space. Even better would be an algorithm so simple that its correctness
was obvious to the casual observer!
Section 12.1.5.1 gives an overview of preemptible RCU’s dynticks interface, Sec-
tion 12.1.6, and Section 12.1.6.8 lists lessons (re)learned during this effort.
Line 3 fetches the current CPU’s number, while lines 5 and 6 increment the rcu_
update_flag nesting counter if it is already non-zero. Lines 7-9 check to see
whether we are the outermost level of interrupt, and, if so, whether dynticks_
progress_counter needs to be incremented. If so, line 10 increments dynticks_
progress_counter, line 11 executes a memory barrier, and line 12 increments
rcu_update_flag. As with rcu_exit_nohz(), the memory barrier ensures
that any other CPU that sees the effects of an RCU read-side critical section in the
interrupt handler (following the rcu_irq_enter() invocation) will also see the
increment of dynticks_progress_counter.
Quick Quiz 12.8: Why not simply increment rcu_update_flag, and then only
increment dynticks_progress_counter if the old value of rcu_update_
flag was zero???
Quick Quiz 12.9: But if line 7 finds that we are the outermost interrupt, wouldn’t
we always need to increment dynticks_progress_counter?
Interrupt exit is handled similarly by rcu_irq_exit():
1 void rcu_irq_exit(void)
2 {
3 int cpu = smp_processor_id();
4
5 if (per_cpu(rcu_update_flag, cpu)) {
6 if (--per_cpu(rcu_update_flag, cpu))
320 CHAPTER 12. FORMAL VERIFICATION
7 return;
8 WARN_ON(in_interrupt());
9 smp_mb();
10 per_cpu(dynticks_progress_counter, cpu)++;
11 WARN_ON(per_cpu(dynticks_progress_counter,
12 cpu) & 0x1);
13 }
14 }
Line 3 fetches the current CPU’s number, as before. Line 5 checks to see if the
rcu_update_flag is non-zero, returning immediately (via falling off the end of
the function) if not. Otherwise, lines 6 through 12 come into play. Line 6 decrements
rcu_update_flag, returning if the result is not zero. Line 8 verifies that we are
indeed leaving the outermost level of nested interrupts, line 9 executes a memory barrier,
line 10 increments dynticks_progress_counter, and lines 11 and 12 verify that
this variable is now even. As with rcu_enter_nohz(), the memory barrier ensures
that any other CPU that sees the increment of dynticks_progress_counter
will also see the effects of an RCU read-side critical section in the interrupt handler
(preceding the rcu_irq_exit() invocation).
These two sections have described how the dynticks_progress_counter
variable is maintained during entry to and exit from dynticks-idle mode, both by tasks
and by interrupts and NMIs. The following section describes how this variable is used
by preemptible RCU’s grace-period machinery.
rcu_try_flip_idle_state
Still no activity
(No RCU activity)
rcu_try_flip_waitack_state
(Wait for acknowledgements)
Memory barrier
rcu_try_flip_waitzero_state
(Wait for RCU read−side
critical sections to complete)
rcu_try_flip_waitmb_state
(Wait for memory barriers)
Of the four preemptible RCU grace-period states shown in Figure 12.16, only the
rcu_try_flip_waitack_state() and rcu_try_flip_waitmb_state()
states need to wait for other CPUs to respond.
Of course, if a given CPU is in dynticks-idle state, we shouldn’t wait for it. Therefore,
just before entering one of these two states, the preceding state takes a snapshot of each
CPU’s dynticks_progress_counter variable, placing the snapshot in another
per-CPU variable, rcu_dyntick_snapshot. This is accomplished by invoking
dyntick_save_progress_counter(), shown below:
1 static void dyntick_save_progress_counter(int cpu)
2 {
3 per_cpu(rcu_dyntick_snapshot, cpu) =
4 per_cpu(dynticks_progress_counter, cpu);
5 }
1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5
6 do
7 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
8 :: i < MAX_DYNTICK_LOOP_NOHZ ->
9 tmp = dynticks_progress_counter;
10 atomic {
11 dynticks_progress_counter = tmp + 1;
12 assert((dynticks_progress_counter & 1) == 1);
13 }
14 tmp = dynticks_progress_counter;
15 atomic {
16 dynticks_progress_counter = tmp + 1;
17 assert((dynticks_progress_counter & 1) == 0);
18 }
19 i++;
20 od;
21 }
Lines 6 and 20 define a loop. Line 7 exits the loop once the loop counter i has
exceeded the limit MAX_DYNTICK_LOOP_NOHZ. Line 8 tells the loop construct
to execute lines 9-19 for each pass through the loop. Because the conditionals on
lines 7 and 8 are exclusive of each other, the normal Promela random selection of
true conditions is disabled. Lines 9 and 11 model rcu_exit_nohz()’s non-atomic
increment of dynticks_progress_counter, while line 12 models the WARN_
ON(). The atomic construct simply reduces the Promela state space, given that
the WARN_ON() is not strictly speaking part of the algorithm. Lines 14-18 similarly
models the increment and WARN_ON() for rcu_enter_nohz(). Finally, line 19
increments the loop counter.
Each pass through the loop therefore models a CPU exiting dynticks-idle mode (for
example, starting to execute a task), then re-entering dynticks-idle mode (for example,
12.1. GENERAL-PURPOSE STATE-SPACE SEARCH 323
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5
6 atomic {
7 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
8 snap = dynticks_progress_counter;
9 }
10 do
11 :: 1 ->
12 atomic {
13 curr = dynticks_progress_counter;
14 if
15 :: (curr == snap) && ((curr & 1) == 0) ->
16 break;
17 :: (curr - snap) > 2 || (snap & 1) == 0 ->
18 break;
19 :: 1 -> skip;
20 fi;
21 }
22 od;
23 snap = dynticks_progress_counter;
24 do
25 :: 1 ->
26 atomic {
27 curr = dynticks_progress_counter;
28 if
29 :: (curr == snap) && ((curr & 1) == 0) ->
30 break;
31 :: (curr != snap) ->
32 break;
33 :: 1 -> skip;
34 fi;
35 }
36 od;
37 }
Lines 6-9 print out the loop limit (but only into the .trail file in case of error) and mod-
els a line of code from rcu_try_flip_idle() and its call to dyntick_save_
progress_counter(), which takes a snapshot of the current CPU’s dynticks_
progress_counter variable. These two lines are executed atomically to reduce
state space.
Lines 10-22 model the relevant code in rcu_try_flip_waitack() and its call
to rcu_try_flip_waitack_needed(). This loop is modeling the grace-period
state machine waiting for a counter-flip acknowledgement from each CPU, but only that
part that interacts with dynticks-idle CPUs.
Line 23 models a line from rcu_try_flip_waitzero() and its call to dyntick_
324 CHAPTER 12. FORMAL VERIFICATION
Lines 6, 10, 25, 26, 29, and 44 update this variable (combining atomically with
algorithmic operations where feasible) to allow the dyntick_nohz() process to
verify the basic RCU safety property. The form of this verification is to assert that the
value of the gp_state variable cannot jump from GP_IDLE to GP_DONE during a
time period over which RCU readers could plausibly persist.
Quick Quiz 12.14: Given there are a pair of back-to-back changes to gp_state
on lines 25 and 26, how can we be sure that line 25’s changes won’t be lost?
The dyntick_nohz() Promela process implements this verification as shown
below:
1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
10 tmp = dynticks_progress_counter;
11 atomic {
12 dynticks_progress_counter = tmp + 1;
13 old_gp_idle = (gp_state == GP_IDLE);
14 assert((dynticks_progress_counter & 1) == 1);
15 }
16 atomic {
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle ||
19 gp_state != GP_DONE);
20 }
21 atomic {
22 dynticks_progress_counter = tmp + 1;
23 assert((dynticks_progress_counter & 1) == 0);
24 }
25 i++;
26 od;
27 }
Line 13 sets a new old_gp_idle flag if the value of the gp_state variable is
GP_IDLE at the beginning of task execution, and the assertion at lines 18 and 19 fire
if the gp_state variable has advanced to GP_DONE during task execution, which
would be illegal given that a single RCU read-side critical section could span the entire
intervening time period.
The resulting model (dyntickRCU-base-s.spin), when run with the runspin.
sh script, generates 964 states and passes without errors, which is reassuring. That said,
although safety is critically important, it is also quite important to avoid indefinitely
stalling grace periods. The next section therefore covers verifying liveness.
40 shouldexit = dyntick_nohz_done;
41 curr = dynticks_progress_counter;
42 if
43 :: (curr == snap) && ((curr & 1) == 0) ->
44 break;
45 :: (curr != snap) ->
46 break;
47 :: else -> skip;
48 fi;
49 }
50 od;
51 gp_state = GP_DONE;
52 }
The first part of the condition is correct, because if curr and snap differ by two,
there will be at least one even number in between, corresponding to having passed
completely through a dynticks-idle phase. However, the second part of the condition
corresponds to having started in dynticks-idle mode, not having finished in this mode.
We therefore need to be testing curr rather than snap for being an even number.
The corrected C code is as follows:
1 static inline int
2 rcu_try_flip_waitack_needed(int cpu)
3 {
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb();
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
12 if ((curr - snap) > 2 || (curr & 0x1) == 0)
13 return 0;
14 return 1;
15 }
Lines 10-13 can now be combined and simplified, resulting in the following. A
similar simplification can be applied to rcu_try_flip_waitmb_needed().
1 static inline int
2 rcu_try_flip_waitack_needed(int cpu)
3 {
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb();
10 if ((curr - snap) >= 2 || (curr & 0x1) == 0)
11 return 0;
12 return 1;
13 }
12.1.6.4 Interrupts
There are a couple of ways to model interrupts in Promela:
1. using C-preprocessor tricks to insert the interrupt handler between each and every
statement of the dynticks_nohz() process, or
A bit of thought indicated that the second approach would have a smaller state
space, though it requires that the interrupt handler somehow run atomically with respect
12.1. GENERAL-PURPOSE STATE-SPACE SEARCH 329
Line 2 of the macro creates the specified statement label. Lines 3-8 are an atomic
block that tests the in_dyntick_irq variable, and if this variable is set (indicating
that the interrupt handler is active), branches out of the atomic block back to the label.
Otherwise, line 6 executes the specified statement. The overall effect is that mainline
execution stalls any time an interrupt is active, as required.
the relevant snippet of __irq_exit(), and finally lines 32-43 model rcu_irq_
exit().
Quick Quiz 12.18: What property of interrupts is this dynticks_irq() process
unable to model?
The grace_period() process then becomes as follows:
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5 bit shouldexit;
6
7 gp_state = GP_IDLE;
8 atomic {
9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 printf("MDLI = %d\n", MAX_DYNTICK_LOOP_IRQ);
11 shouldexit = 0;
12 snap = dynticks_progress_counter;
13 gp_state = GP_WAITING;
14 }
15 do
16 :: 1 ->
17 atomic {
18 assert(!shouldexit);
19 shouldexit = dyntick_nohz_done && dyntick_irq_done;
20 curr = dynticks_progress_counter;
21 if
22 :: (curr - snap) >= 2 || (curr & 1) == 0 ->
23 break;
24 :: else -> skip;
25 fi;
26 }
27 od;
28 gp_state = GP_DONE;
29 gp_state = GP_IDLE;
30 atomic {
31 shouldexit = 0;
32 snap = dynticks_progress_counter;
33 gp_state = GP_WAITING;
34 }
35 do
36 :: 1 ->
37 atomic {
38 assert(!shouldexit);
39 shouldexit = dyntick_nohz_done && dyntick_irq_done;
40 curr = dynticks_progress_counter;
41 if
42 :: (curr != snap) || ((curr & 1) == 0) ->
43 break;
44 :: else -> skip;
45 fi;
46 }
47 od;
48 gp_state = GP_DONE;
49 }
69 od;
70 dyntick_irq_done = 1;
71 }
36 if
37 :: rcu_update_flag == 0 ->
38 tmp = dynticks_progress_counter;
39 dynticks_progress_counter = tmp + 1;
40 :: else -> skip;
41 fi;
42 :: else -> skip;
43 fi;
44 atomic {
45 i++;
46 in_dyntick_nmi = 0;
47 }
48 od;
49 dyntick_nmi_done = 1;
50 }
Of course, the fact that we have NMIs requires adjustments in the other components.
For example, the EXECUTE_MAINLINE() macro now needs to pay attention to the
NMI handler (in_dyntick_nmi) as well as the interrupt handler (in_dyntick_
irq) by checking the dyntick_nmi_done variable as follows:
1 #define EXECUTE_MAINLINE(label, stmt) \
2 label: skip; \
3 atomic { \
4 if \
5 :: in_dyntick_irq || \
6 in_dyntick_nmi -> goto label; \
7 :: else -> stmt; \
8 fi; \
9 } \
24 fi;
25 }
26 stmt1_then: skip;
27 EXECUTE_IRQ(stmt1_1, tmp = rcu_update_flag)
28 EXECUTE_IRQ(stmt1_2, rcu_update_flag = tmp + 1)
29 stmt1_else: skip;
30 stmt2: skip; atomic {
31 if
32 :: in_dyntick_nmi -> goto stmt2;
33 :: !in_dyntick_nmi &&
34 !in_interrupt &&
35 (dynticks_progress_counter & 1) == 0 ->
36 goto stmt2_then;
37 :: else -> goto stmt2_else;
38 fi;
39 }
40 stmt2_then: skip;
41 EXECUTE_IRQ(stmt2_1, tmp = dynticks_progress_counter)
42 EXECUTE_IRQ(stmt2_2,
43 dynticks_progress_counter = tmp + 1)
44 EXECUTE_IRQ(stmt2_3, tmp = rcu_update_flag)
45 EXECUTE_IRQ(stmt2_4, rcu_update_flag = tmp + 1)
46 stmt2_else: skip;
47 EXECUTE_IRQ(stmt3, tmp = in_interrupt)
48 EXECUTE_IRQ(stmt4, in_interrupt = tmp + 1)
49 stmt5: skip;
50 atomic {
51 if
52 :: in_dyntick_nmi -> goto stmt4;
53 :: !in_dyntick_nmi && outermost ->
54 old_gp_idle = (gp_state == GP_IDLE);
55 :: else -> skip;
56 fi;
57 }
58 i++;
59 :: j < i ->
60 stmt6: skip;
61 atomic {
62 if
63 :: in_dyntick_nmi -> goto stmt6;
64 :: !in_dyntick_nmi && j + 1 == i ->
65 assert(!old_gp_idle ||
66 gp_state != GP_DONE);
67 :: else -> skip;
68 fi;
69 }
70 EXECUTE_IRQ(stmt7, tmp = in_interrupt);
71 EXECUTE_IRQ(stmt8, in_interrupt = tmp - 1);
72
73 stmt9: skip;
74 atomic {
75 if
76 :: in_dyntick_nmi -> goto stmt9;
77 :: !in_dyntick_nmi && rcu_update_flag != 0 ->
78 goto stmt9_then;
79 :: else -> goto stmt9_else;
80 fi;
81 }
82 stmt9_then: skip;
83 EXECUTE_IRQ(stmt9_1, tmp = rcu_update_flag)
84 EXECUTE_IRQ(stmt9_2, rcu_update_flag = tmp - 1)
85 stmt9_3: skip;
86 atomic {
87 if
88 :: in_dyntick_nmi -> goto stmt9_3;
89 :: !in_dyntick_nmi && rcu_update_flag == 0 ->
90 goto stmt9_3_then;
91 :: else -> goto stmt9_3_else;
92 fi;
93 }
94 stmt9_3_then: skip;
95 EXECUTE_IRQ(stmt9_3_1,
96 tmp = dynticks_progress_counter)
97 EXECUTE_IRQ(stmt9_3_2,
336 CHAPTER 12. FORMAL VERIFICATION
98 dynticks_progress_counter = tmp + 1)
99 stmt9_3_else:
100 stmt9_else: skip;
101 atomic {
102 j++;
103 in_dyntick_irq = (i != j);
104 }
105 od;
106 dyntick_irq_done = 1;
107 }
Note that we have open-coded the “if” statements (for example, lines 17-29). In
addition, statements that process strictly local state (such as line 58) need not exclude
dyntick_nmi().
Finally, grace_period() requires only a few changes:
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5 bit shouldexit;
6
7 gp_state = GP_IDLE;
8 atomic {
9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 printf("MDLI = %d\n", MAX_DYNTICK_LOOP_IRQ);
11 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NMI);
12 shouldexit = 0;
13 snap = dynticks_progress_counter;
14 gp_state = GP_WAITING;
15 }
16 do
17 :: 1 ->
18 atomic {
19 assert(!shouldexit);
20 shouldexit = dyntick_nohz_done &&
21 dyntick_irq_done &&
22 dyntick_nmi_done;
23 curr = dynticks_progress_counter;
24 if
25 :: (curr - snap) >= 2 || (curr & 1) == 0 ->
26 break;
27 :: else -> skip;
28 fi;
29 }
30 od;
31 gp_state = GP_DONE;
32 gp_state = GP_IDLE;
33 atomic {
34 shouldexit = 0;
35 snap = dynticks_progress_counter;
36 gp_state = GP_WAITING;
37 }
38 do
39 :: 1 ->
40 atomic {
41 assert(!shouldexit);
42 shouldexit = dyntick_nohz_done &&
43 dyntick_irq_done &&
44 dyntick_nmi_done;
45 curr = dynticks_progress_counter;
46 if
47 :: (curr != snap) || ((curr & 1) == 0) ->
48 break;
49 :: else -> skip;
50 fi;
51 }
52 od;
53 gp_state = GP_DONE;
54 }
2. Documenting code can help locate bugs. In this case, the documentation effort
located a misplaced memory barrier in rcu_enter_nohz() and rcu_exit_
nohz(), as shown by the patch in Figure 12.17.
3. Validate your code early, often, and up to the point of destruction. This effort
located one subtle bug in rcu_try_flip_waitack_needed() that would
have been quite difficult to test or debug, as shown by the patch in Figure 12.18.
4. Always verify your verification code. The usual way to do this is to insert
a deliberate bug and verify that the verification code catches it. Of course, if
the verification code fails to catch this bug, you may also need to verify the
bug itself, and so on, recursing infinitely. However, if you find yourself in this
position, getting a good night’s sleep can be an extremely effective debugging
technique. You will then see that the obvious verify-the-verification technique is
to deliberately insert bugs in the code being verified. If the verification fails to
find them, the verification clearly is buggy.
6. The need for complex formal verification often indicates a need to re-think
your design.
338 CHAPTER 12. FORMAL VERIFICATION
1 struct rcu_dynticks {
2 int dynticks_nesting;
3 int dynticks;
4 int dynticks_nmi;
5 };
6
7 struct rcu_data {
8 ...
9 int dynticks_snap;
10 int dynticks_nmi_snap;
11 ...
12 };
To this last point, it turn out that there is a much simpler solution to the dynticks
problem, which is presented in the next section.
The complexity of the dynticks interface for preemptible RCU is primarily due to the
fact that both irqs and NMIs use the same code path and the same state variables. This
leads to the notion of providing separate code paths and variables for irqs and NMIs,
as has been done for hierarchical RCU [McK08a] as indirectly suggested by Manfred
Spraul [Spr08].
Figure 12.19 shows the new per-CPU state variables. These variables are grouped into
structs to allow multiple independent RCU implementations (e.g., rcu and rcu_bh)
to conveniently and efficiently share dynticks state. In what follows, they can be thought
of as independent per-CPU variables.
The dynticks_nesting, dynticks, and dynticks_snap variables are for
the irq code paths, and the dynticks_nmi and dynticks_nmi_snap variables
are for the NMI code paths, although the NMI code path will also reference (but not
modify) the dynticks_nesting variable. These variables are used as follows:
1 void rcu_enter_nohz(void)
2 {
3 unsigned long flags;
4 struct rcu_dynticks *rdtp;
5
6 smp_mb();
7 local_irq_save(flags);
8 rdtp = &__get_cpu_var(rcu_dynticks);
9 rdtp->dynticks++;
10 rdtp->dynticks_nesting--;
11 WARN_ON(rdtp->dynticks & 0x1);
12 local_irq_restore(flags);
13 }
14
15 void rcu_exit_nohz(void)
16 {
17 unsigned long flags;
18 struct rcu_dynticks *rdtp;
19
20 local_irq_save(flags);
21 rdtp = &__get_cpu_var(rcu_dynticks);
22 rdtp->dynticks++;
23 rdtp->dynticks_nesting++;
24 WARN_ON(!(rdtp->dynticks & 0x1));
25 local_irq_restore(flags);
26 smp_mb();
27 }
1 void rcu_nmi_enter(void)
2 {
3 struct rcu_dynticks *rdtp;
4
5 rdtp = &__get_cpu_var(rcu_dynticks);
6 if (rdtp->dynticks & 0x1)
7 return;
8 rdtp->dynticks_nmi++;
9 WARN_ON(!(rdtp->dynticks_nmi & 0x1));
10 smp_mb();
11 }
12
13 void rcu_nmi_exit(void)
14 {
15 struct rcu_dynticks *rdtp;
16
17 rdtp = &__get_cpu_var(rcu_dynticks);
18 if (rdtp->dynticks & 0x1)
19 return;
20 smp_mb();
21 rdtp->dynticks_nmi++;
22 WARN_ON(rdtp->dynticks_nmi & 0x1);
23 }
1 void rcu_irq_enter(void)
2 {
3 struct rcu_dynticks *rdtp;
4
5 rdtp = &__get_cpu_var(rcu_dynticks);
6 if (rdtp->dynticks_nesting++)
7 return;
8 rdtp->dynticks++;
9 WARN_ON(!(rdtp->dynticks & 0x1));
10 smp_mb();
11 }
12
13 void rcu_irq_exit(void)
14 {
15 struct rcu_dynticks *rdtp;
16
17 rdtp = &__get_cpu_var(rcu_dynticks);
18 if (--rdtp->dynticks_nesting)
19 return;
20 smp_mb();
21 rdtp->dynticks++;
22 WARN_ON(rdtp->dynticks & 0x1);
23 if (__get_cpu_var(rcu_data).nxtlist ||
24 __get_cpu_var(rcu_bh_data).nxtlist)
25 set_need_resched();
26 }
1 static int
2 dyntick_save_progress_counter(struct rcu_data *rdp)
3 {
4 int ret;
5 int snap;
6 int snap_nmi;
7
8 snap = rdp->dynticks->dynticks;
9 snap_nmi = rdp->dynticks->dynticks_nmi;
10 smp_mb();
11 rdp->dynticks_snap = snap;
12 rdp->dynticks_nmi_snap = snap_nmi;
13 ret = ((snap & 0x1) == 0) &&
14 ((snap_nmi & 0x1) == 0);
15 if (ret)
16 rdp->dynticks_fqs++;
17 return ret;
18 }
1 static int
2 rcu_implicit_dynticks_qs(struct rcu_data *rdp)
3 {
4 long curr;
5 long curr_nmi;
6 long snap;
7 long snap_nmi;
8
9 curr = rdp->dynticks->dynticks;
10 snap = rdp->dynticks_snap;
11 curr_nmi = rdp->dynticks->dynticks_nmi;
12 snap_nmi = rdp->dynticks_nmi_snap;
13 smp_mb();
14 if ((curr != snap || (curr & 0x1) == 0) &&
15 (curr_nmi != snap_nmi ||
16 (curr_nmi & 0x1) == 0)) {
17 rdp->dynticks_fqs++;
18 return 1;
19 }
20 return rcu_implicit_offline_qs(rdp);
21 }
12.1.6.15 Discussion
A slight shift in viewpoint resulted in a substantial simplification of the dynticks interface
for RCU. The key change leading to this simplification was minimizing of sharing
between irq and NMI contexts. The only sharing in this simplified interface is references
from NMI context to irq variables (the dynticks variable). This type of sharing is
benign, because the NMI functions never update this variable, so that its value remains
constant through the lifetime of the NMI handler. This limitation of sharing allows the
individual functions to be understood one at a time, in happy contrast to the situation
described in Section 12.1.5, where an NMI might change shared state at any point during
execution of the irq functions.
Verification can be a good thing, but simplicity is even better.
1 PPC SB+lwsync-RMW-lwsync+isync-simple
2 ""
3 {
4 0:r2=x; 0:r3=2; 0:r4=y; 0:r10=0; 0:r11=0; 0:r12=z;
5 1:r2=y; 1:r4=x;
6 }
7 P0 | P1 ;
8 li r1,1 | li r1,1 ;
9 stw r1,0(r2) | stw r1,0(r2) ;
10 lwsync | sync ;
11 | lwz r3,0(r4) ;
12 lwarx r11,r10,r12 | ;
13 stwcx. r11,r10,r12 | ;
14 bne Fail1 | ;
15 isync | ;
16 lwz r3,0(r4) | ;
17 Fail1: | ;
18
19 exists
20 (0:r3=0 /\ 1:r3=0)
derstand memory models or any sort of reordering semantics. This section therefore
describes some state-space search tools that understand memory models used by pro-
duction systems, greatly simplifying the verification of weakly ordered code.
For example, Section 12.1.4 showed how to convince Promela to account for weak
memory ordering. Although this approach can work well, it requires that the developer
fully understand the system’s memory model. Unfortunately, few (if any) developers
fully understand the complex memory models of modern CPUs.
Therefore, another approach is to use a tool that already understands this memory
ordering, such as the PPCMEM tool produced by Peter Sewell and Susmit Sarkar at the
University of Cambridge, Luc Maranget, Francesco Zappa Nardelli, and Pankaj Pawan
at INRIA, and Jade Alglave at Oxford University, in cooperation with Derek Williams
of IBM [AMP+ 11]. This group formalized the memory models of Power, ARM, x86, as
well as that of the C/C++11 standard [Bec11], and produced the PPCMEM tool based
on the Power and ARM formalizations.
Quick Quiz 12.22: But x86 has strong memory ordering! Why would you need to
formalize its memory model?
The PPCMEM tool takes litmus tests as input. A sample litmus test is presented
in Section 12.2.1. Section 12.2.2 relates this litmus test to the equivalent C-language
program, Section 12.2.3 describes how to apply PPCMEM to this litmus test, and
Section 12.2.4 discusses the implications.
is the process identifier, R is the register identifier, and V is the value. For example,
process 0’s register r3 initially contains the value 2. If the value is a variable (x, y, or z
in the example) then the register is initialized to the address of the variable. It is also
possible to initialize the contents of variables, for example, x=1 initializes the value
of x to 1. Uninitialized variables default to the value zero, so that in the example, x, y,
and z are all initially zero.
Line 7 provides identifiers for the two processes, so that the 0:r3=2 on line 4 could
instead have been written P0:r3=2. Line 7 is required, and the identifiers must be of
the form Pn, where n is the column number, starting from zero for the left-most column.
This may seem unnecessarily strict, but it does prevent considerable confusion in actual
use.
Quick Quiz 12.23: Why does line 8 of Figure 12.25 initialize the registers? Why
not instead initialize them on lines 4 and 5?
Lines 8-17 are the lines of code for each process. A given process can have empty
lines, as is the case for P0’s line 11 and P1’s lines 12-17. Labels and branches are
permitted, as demonstrated by the branch on line 14 to the label on line 17. That said,
too-free use of branches will expand the state space. Use of loops is a particularly good
way to explode your state space.
Lines 19-20 show the assertion, which in this case indicates that we are interested
in whether P0’s and P1’s r3 registers can both contain zero after both threads complete
execution. This assertion is important because there are a number of use cases that
would fail miserably if both P0 and P1 saw zero in their respective r3 registers.
This should give you enough information to construct simple litmus tests. Some
additional documentation is available, though much of this additional documentation is
intended for a different research tool that runs tests on actual hardware. Perhaps more
importantly, a large number of pre-existing litmus tests are available with the online
tool (available via the “Select ARM Test” and “Select POWER Test” buttons). It is
quite likely that one of these pre-existing litmus tests will answer your Power or ARM
memory-ordering question.
1 void P0(void)
2 {
3 int r3;
4
5 x = 1; /* Lines 8 and 9 */
6 atomic_add_return(&z, 0); /* Lines 10-15 */
7 r3 = y; /* Line 16 */
8 }
9
10 void P1(void)
11 {
12 int r3;
13
14 y = 1; /* Lines 8-9 */
15 smp_mb(); /* Line 10 */
16 r3 = x; /* Line 11 */
17 }
1 PPC IRIW.litmus
2 ""
3 (* Traditional IRIW. *)
4 {
5 0:r1=1; 0:r2=x;
6 1:r1=1; 1:r4=y;
7 2:r2=x; 2:r4=y;
8 3:r2=x; 3:r4=y;
9 }
10 P0 | P1 | P2 | P3 ;
11 stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ;
12 | | sync | sync ;
13 | | lwz r5,0(r4) | lwz r5,0(r2) ;
14
15 exists
16 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
beginning of this chapter, “Beware of bugs in the above code; I have only proved
it correct, not tried it.”
That said, one strength of these tools is that they are designed to model the full range
of behaviors allowed by the architectures, including behaviors that are legal, but which
current hardware implementations do not yet inflict on unwary software developers.
Therefore, an algorithm that is vetted by these tools likely has some additional safety
margin when running on real hardware. Furthermore, testing on real hardware can only
find bugs; such testing is inherently incapable of proving a given usage correct. To
appreciate this, consider that the researchers routinely ran in excess of 100 billion test
runs on real hardware to validate their model. In one case, behavior that is allowed
by the architecture did not occur, despite 176 billion runs [AMP+ 11]. In contrast, the
full-state-space search allows the tool to prove code fragments correct.
It is worth repeating that formal methods and tools are no substitute for testing. The
fact is that producing large reliable concurrent software artifacts, the Linux kernel for
example, is quite difficult. Developers must therefore be prepared to apply every tool at
their disposal towards this goal. The tools presented in this chapter are able to locate
bugs that are quite difficult to produce (let alone track down) via testing. On the other
hand, testing can be applied to far larger bodies of software than the tools presented in
this chapter are ever likely to handle. As always, use the right tools for the job!
Of course, it is always best to avoid the need to work at this level by designing
your parallel code to be easily partitioned and then using higher-level primitives (such
as locks, sequence counters, atomic operations, and RCU) to get your job done more
straightforwardly. And even if you absolutely must use low-level memory barriers and
read-modify-write instructions to get your job done, the more conservative your use of
these sharp instruments, the easier your life is likely to be.
1 PPC IRIW5.litmus
2 ""
3 (* Traditional IRIW, but with five stores instead of just one. *)
4 {
5 0:r1=1; 0:r2=x;
6 1:r1=1; 1:r4=y;
7 2:r2=x; 2:r4=y;
8 3:r2=x; 3:r4=y;
9 }
10 P0 | P1 | P2 | P3 ;
11 stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ;
12 addi r1,r1,1 | addi r1,r1,1 | sync | sync ;
13 stw r1,0(r2) | stw r1,0(r4) | lwz r5,0(r4) | lwz r5,0(r2) ;
14 addi r1,r1,1 | addi r1,r1,1 | | ;
15 stw r1,0(r2) | stw r1,0(r4) | | ;
16 addi r1,r1,1 | addi r1,r1,1 | | ;
17 stw r1,0(r2) | stw r1,0(r4) | | ;
18 addi r1,r1,1 | addi r1,r1,1 | | ;
19 stw r1,0(r2) | stw r1,0(r4) | | ;
20
21 exists
22 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
can seem like a long time, it is much shorter than weeks or even months.
However, the time required is a bit surprising given the simplicity of the litmus test,
which has two threads storing to two separate variables and two other threads loading
from these two variables in opposite orders. The assertion triggers if the two loading
threads disagree on the order of the two stores. This litmus test is simple, even by the
standards of memory-order litmus tests.
One reason for the amount of time and space consumed is that PPCMEM does a
trace-based full-state-space search, which means that it must generate and evaluate all
possible orders and combinations of events at the architectural level. At this level, both
loads and stores correspond to ornate sequences of events and actions, resulting in a very
large state space that must be completely searched, in turn resulting in large memory
and CPU consumption.
Of course, many of the traces are quite similar to one another, which suggests that
an approach that treated similar traces as one might improve performace. One such
approach is the axiomatic approach of Alglave et al. [AMT14], which creates a set of
axioms to represent the memory model and then converts litmus tests to theorems that
might be proven or disproven over this set of axioms. The resulting tool, called “herd”,
conveniently takes as input the same litmus tests as PPCMEM, including the IRIW
litmus test shown in Figure 12.29.
However, where PPCMEM requires 14 CPU hours to solve IRIW, herd does so
in 17 milliseconds, which represents a speedup of more than six orders of magnitude.
That said, the problem is exponential in nature, so we should expect herd to exhibit
exponential slowdowns for larger problems. And this is exactly what happens, for
example, if we add four more writes per writing CPU as shown in Figure 12.30, herd
slows down by a factor of more than 50,000, requiring more than 15 minutes of CPU
time. Adding threads also results in exponential slowdowns [MS14].
Despite their exponential nature, both PPCMEM and herd have proven quite useful
for checking key parallel algorithms, including the queued-lock handoff on x86 systems.
The weaknesses of the herd tool are similar to those of PPCMEM, which were described
in Section 12.2.4. There are some obscure (but very real) cases for which the PPCMEM
and herd tools disagree, and as of late 2014 resolving these disagreements was ongoing.
12.4. SAT SOLVERS 349
Longer term, the hope is that the axiomatic approaches incorporate axioms describ-
ing higher-level software artifacts. This could potentially allow axiomatic verification
of much larger software systems. Another alternative is to press the axioms of boolean
logic into service, as described in the next section.
12.5 Summary
The formal-verification techniques described in this chapter are very powerful tools
for validating small parallel algorithms, but they should not be the only tools in your
toolbox. Despite decades of focus on formal verification, testing remains the validation
workhorse for large parallel software systems [Cor06a, Jon11].
It is nevertheless quite possible that this will not always be the case. To see this,
consider that there are more than one billion instances of the Linux kernel as of 2013.
Suppose that the Linux kernel has a bug that manifests on average every million years of
runtime. As noted at the end of the preceding chapter, this bug will be appearing three
times per day across the installed base. But the fact remains that most formal validation
techniques can be used only on very small code bases. So what is a concurrency coder
to do?
One approach is to think in terms of finding the first bug, the first relevant bug, the
last relevant bug, and the last bug.
350 CHAPTER 12. FORMAL VERIFICATION
The first bug is normally found via inspection or compiler diagnostics. Although
the increasingly sophisticated diagnostics provided by modern compilers might be
considered to be a lightweight sort of formal verification, it is not common to think of
them in those terms. This is in part due to an odd practitioner prejudice which says “If I
am using it, it cannot be formal verification” on the one hand, and the large difference
in sophistication between compiler diagnostics and verification research on the other.
Although the first relevant bug might be located via inspection or compiler diagnos-
tics, it is not unusual for these two steps to find only typos and false positives. Either
way, the bulk of the relevant bugs, that is, those bugs that might actually be encountered
in production, will often be found via testing.
When testing is driven by anticipated or real use cases, it is not uncommon for the
last relevant bug to be located by testing. This situation might motivate a complete
rejection of formal verification, however, irrelevant bugs have a bad habit of suddenly
becoming relevant at the least convenient moment possible, courtesy of black-hat attacks.
For security-critical software, which appears to be a continually increasing fraction of
the total, there can thus be strong motivation to find and fix the last bug. Testing is
demonstrably unable to find the last bug, so there is a possible role for formal verification.
That is, there is such a role if and only if formal verification proves capable of growing
into it. As this chapter has shown, current formal verification systems are extremely
limited.
Another approach is to consider that formal verification is often much harder to
use than is testing. This is of course in part a cultural statement, and there is every
reason to hope that formal verification will be perceived to be easier as more people
become familiar with it. That said, very simple test harnesses can find significant bugs
in arbitrarily large software systems. In contrast, the effort required to apply formal
verification seems to increase dramatically as the system size increases.
I have nevertheless made occasional use of formal verification for more than 20
years, playing to formal verification’s strengths, namely design-time verification of
small complex portions of the overarching software construct. The larger overarching
software construct is of course validated by testing.
Quick Quiz 12.26: In light of the full verification of the L4 microkernel, isn’t this
limited view of formal verification just a little bit obsolete?
One final approach is to consider the following two definitions and the consequence
that they imply:
From this viewpoint, any advances in validation and verification can have but two
effects: (1) An increase in the number of trivial programs or (2) A decrease in the number
of reliable programs. Of course, the human race’s increasing reliance on multicore
systems and software provides extreme motivation for a very sharp increase in the
number of trivial programs!
However, if your code is so complex that you find yourself relying too heavily
on formal-verification tools, you should carefully rethink your design, especially if
your formal-verification tools require your code to be hand-translated to a special-
purpose language. For example, a complex implementation of the dynticks interface
12.5. SUMMARY 351
for preemptible RCU that was presented in Section 12.1.5 turned out to have a much
simpler alternative implementation, as discussed in Section 12.1.6.9. All else being
equal, a simpler implementation is much better than a mechanical proof for a complex
implementation!
And the open challenge to those working on formal verification techniques and
systems is to prove this summary wrong!
352 CHAPTER 12. FORMAL VERIFICATION
You don’t learn how to shoot and then learn how to
launch and then learn to do a controlled spin—you
learn to launch-shoot-spin.
Chapter 13
353
354 CHAPTER 13. PUTTING IT ALL TOGETHER
results depicted in Figure 10.8. This will require that the counter increments be atomic
operations, especially for user-mode execution where a given thread could migrate to
another CPU at any time.
If some elements are looked up very frequently, there are a number of approaches
that batch updates by maintaining a per-thread log, where multiple log entries for a given
element can be merged. After a given log entry has a sufficiently large increment or
after sufficient time has passed, the log entries may be applied to the corresponding data
elements. Silas Boyd-Wickizer has done some work formalizing this notion [BW14].
1. A lock residing outside of the object must be held while manipulating the reference
count.
2. The object is created with a non-zero reference count, and new references may be
acquired only when the current value of the reference counter is non-zero. If a
thread does not have a reference to a given object, it may obtain one with the help
of another thread that already has a reference.
3. An existence guarantee is provided for the object, preventing it from being freed
while some other entity might be attempting to acquire a reference. Existence
guarantees are often provided by automatic garbage collectors, and, as will be
seen in Section 9.5, by RCU.
Release Synchronization
Acquisition Reference
Synchronization Locking Counting RCU
Locking - CAM CA
Reference A AM A
Counting
RCU CA MCA CA
1. Simple counting with neither atomic operations, memory barriers, nor alignment
constraints (“-”).
5. Atomic counting with a check combined with the atomic acquisition operation
(“CA”).
6. Atomic counting with a check combined with the atomic acquisition operation,
and with memory barriers also required on acquisition (“MCA”).
However, because all Linux-kernel atomic operations that return a value are defined
to contain memory barriers,1 all release operations contain memory barriers, and all
checked acquisition operations also contain memory barriers. Therefore, cases “CA”
and “MCA” are equivalent to “CAM”, so that there are sections below for only the first
four cases: “-”, “A”, “AM”, and “CAM”. The Linux primitives that support reference
counting are presented in Section 13.2.2. Later sections cite optimizations that can
improve performance if reference acquisition and release is very frequent, and the
reference count need be checked for zero only very rarely.
1 With atomic_read() and ATOMIC_INIT() being the exceptions that prove the rule.
356 CHAPTER 13. PUTTING IT ALL TOGETHER
1 struct sref {
2 int refcount;
3 };
4
5 void sref_init(struct sref *sref)
6 {
7 sref->refcount = 1;
8 }
9
10 void sref_get(struct sref *sref)
11 {
12 sref->refcount++;
13 }
14
15 int sref_put(struct sref *sref,
16 void (*release)(struct sref *sref))
17 {
18 WARN_ON(release == NULL);
19 WARN_ON(release == (void (*)(struct sref *))kfree);
20
21 if (--sref->refcount == 0) {
22 release(sref);
23 return 1;
24 }
25 return 0;
26 }
but where a reference to the object must be held after the lock is released. Figure 13.1
shows a simple API that might be used to implement simple non-atomic reference
counting—although simple reference counting is almost always open-coded instead.
1 struct kref {
2 atomic_t refcount;
3 };
4
5 void kref_init(struct kref *kref)
6 {
7 atomic_set(&kref->refcount, 1);
8 }
9
10 void kref_get(struct kref *kref)
11 {
12 WARN_ON(!atomic_read(&kref->refcount));
13 atomic_inc(&kref->refcount);
14 }
15
16 static inline int
17 kref_sub(struct kref *kref, unsigned int count,
18 void (*release)(struct kref *kref))
19 {
20 WARN_ON(release == NULL);
21
22 if (atomic_sub_and_test((int) count,
23 &kref->refcount)) {
24 release(kref);
25 return 1;
26 }
27 return 0;
28 }
sections.
Quick Quiz 13.2: Why isn’t it necessary to guard against cases where one CPU
acquires a reference just after another CPU releases the last reference?
The kref structure itself, consisting of a single atomic data item, is shown in lines 1-
3 of Figure 13.2. The kref_init() function on lines 5-8 initializes the counter to
the value “1”. Note that the atomic_set() primitive is a simple assignment, the
name stems from the data type of atomic_t rather than from the operation. The
kref_init() function must be invoked during object creation, before the object has
been made available to any other CPU.
The kref_get() function on lines 10-14 unconditionally atomically increments
the counter. The atomic_inc() primitive does not necessarily explicitly disable
compiler optimizations on all platforms, but the fact that the kref primitives are
in a separate module and that the Linux kernel build process does no cross-module
optimizations has the same effect.
The kref_sub() function on lines 16-28 atomically decrements the counter, and
if the result is zero, line 24 invokes the specified release() function and line 25
returns, informing the caller that release() was invoked. Otherwise, kref_sub()
returns zero, informing the caller that release() was not called.
Quick Quiz 13.3: Suppose that just after the atomic_sub_and_test() on
line 22 of Figure 13.2 is invoked, that some other CPU invokes kref_get(). Doesn’t
this result in that other CPU now having an illegal reference to a released object?
Quick Quiz 13.4: Suppose that kref_sub() returns zero, indicating that the
release() function was not invoked. Under what conditions can the caller rely on
the continued existence of the enclosing object?
Quick Quiz 13.5: Why not just pass kfree() as the release function?
358 CHAPTER 13. PUTTING IT ALL TOGETHER
1 static inline
2 struct dst_entry * dst_clone(struct dst_entry * dst)
3 {
4 if (dst)
5 atomic_inc(&dst->__refcnt);
6 return dst;
7 }
8
9 static inline
10 void dst_release(struct dst_entry * dst)
11 {
12 if (dst) {
13 WARN_ON(atomic_read(&dst->__refcnt) < 1);
14 smp_mb__before_atomic_dec();
15 atomic_dec(&dst->__refcnt);
16 }
17 }
decrements the reference count, and, if the result was zero, line 32 invokes the call_
rcu() primitives in order to free up the file structure (via the file_free_rcu()
function specified in call_rcu()’s second argument), but only after all currently-
executing RCU read-side critical sections complete. The time period required for all
currently-executing RCU read-side critical sections to complete is termed a “grace
period”. Note that the atomic_dec_and_test() primitive contains a memory
barrier. This memory barrier is not necessary in this example, since the structure cannot
be destroyed until the RCU read-side critical section completes, but in Linux, all atomic
operations that return a result must by definition contain memory barriers.
Once the grace period completes, the file_free_rcu() function obtains a
pointer to the file structure on line 39, and frees it on line 40.
This approach is also used by Linux’s virtual-memory system, see get_page_
unless_zero() and put_page_testzero() for page structures as well as
try_to_unuse() and mmput() for memory-map structures.
13.3.1.1 Design
The hope is to use RCU rather than final_mutex to protect the thread traversal in
read_count() in order to obtain excellent performance and scalability from read_
count(), rather than just from inc_count(). However, we do not want to give
up any accuracy in the computed sum. In particular, when a given thread exits, we
absolutely cannot lose the exiting thread’s count, nor can we double-count it. Such
an error could result in inaccuracies equal to the full precision of the result, in other
words, such an error would make the result completely useless. And in fact, one of the
purposes of final_mutex is to ensure that threads do not come and go in the middle
of read_count() execution.
Quick Quiz 13.9: Just what is the accuracy of read_count(), anyway?
Therefore, if we are to dispense with final_mutex, we will need to come up with
some other method for ensuring consistency. One approach is to place the total count
for all previously exited threads and the array of pointers to the per-thread counters into
a single structure. Such a structure, once made available to read_count(), is held
constant, ensuring that read_count() sees consistent data.
13.3.1.2 Implementation
Lines 1-4 of Figure 13.5 show the countarray structure, which contains a ->total
field for the count from previously exited threads, and a counterp[] array of pointers
to the per-thread counter for each currently running thread. This structure allows a
given execution of read_count() to see a total that is consistent with the indicated
set of running threads.
Lines 6-8 contain the definition of the per-thread counter variable, the global
pointer countarrayp referencing the current countarray structure, and the final_
mutex spinlock.
Lines 10-13 show inc_count(), which is unchanged from Figure 5.9.
Lines 15-29 show read_count(), which has changed significantly. Lines 21
and 27 substitute rcu_read_lock() and rcu_read_unlock() for acquisition
and release of final_mutex. Line 22 uses rcu_dereference() to snapshot
the current countarray structure into local variable cap. Proper use of RCU will
guarantee that this countarray structure will remain with us through at least the
end of the current RCU read-side critical section at line 27. Line 23 initializes sum to
cap->total, which is the sum of the counts of threads that have previously exited.
Lines 24-26 add up the per-thread counters corresponding to currently running threads,
and, finally, line 28 returns the sum.
13.3. RCU RESCUES 363
1 struct countarray {
2 unsigned long total;
3 unsigned long *counterp[NR_THREADS];
4 };
5
6 long __thread counter = 0;
7 struct countarray *countarrayp = NULL;
8 DEFINE_SPINLOCK(final_mutex);
9
10 void inc_count(void)
11 {
12 counter++;
13 }
14
15 long read_count(void)
16 {
17 struct countarray *cap;
18 unsigned long sum;
19 int t;
20
21 rcu_read_lock();
22 cap = rcu_dereference(countarrayp);
23 sum = cap->total;
24 for_each_thread(t)
25 if (cap->counterp[t] != NULL)
26 sum += *cap->counterp[t];
27 rcu_read_unlock();
28 return sum;
29 }
30
31 void count_init(void)
32 {
33 countarrayp = malloc(sizeof(*countarrayp));
34 if (countarrayp == NULL) {
35 fprintf(stderr, "Out of memory\n");
36 exit(-1);
37 }
38 memset(countarrayp, ’\0’, sizeof(*countarrayp));
39 }
40
41 void count_register_thread(void)
42 {
43 int idx = smp_thread_id();
44
45 spin_lock(&final_mutex);
46 countarrayp->counterp[idx] = &counter;
47 spin_unlock(&final_mutex);
48 }
49
50 void count_unregister_thread(int nthreadsexpected)
51 {
52 struct countarray *cap;
53 struct countarray *capold;
54 int idx = smp_thread_id();
55
56 cap = malloc(sizeof(*countarrayp));
57 if (cap == NULL) {
58 fprintf(stderr, "Out of memory\n");
59 exit(-1);
60 }
61 spin_lock(&final_mutex);
62 *cap = *countarrayp;
63 cap->total += counter;
64 cap->counterp[idx] = NULL;
65 capold = countarrayp;
66 rcu_assign_pointer(countarrayp, cap);
67 spin_unlock(&final_mutex);
68 synchronize_rcu();
69 free(capold);
70 }
13.3.1.3 Discussion
Quick Quiz 13.11: Wow! Figure 13.5 contains 69 lines of code, compared to only 42
in Figure 5.9. Is this extra complexity really worth it?
Use of RCU enables exiting threads to wait until other threads are guaranteed
to be done using the exiting threads’ __thread variables. This allows the read_
count() function to dispense with locking, thereby providing excellent performance
and scalability for both the inc_count() and read_count() functions. However,
this performance and scalability come at the cost of some increase in code complexity.
It is hoped that compiler and library writers employ user-level RCU [Des09] to provide
safe cross-thread access to __thread variables, greatly reducing the complexity seen
by users of __thread variables.
Section 5.5 showed a fanciful pair of code fragments for dealing with counting I/O
accesses to removable devices. These code fragments suffered from high overhead on
the fastpath (starting an I/O) due to the need to acquire a reader-writer lock.
This section shows how RCU may be used to avoid this overhead.
The code for performing an I/O is quite similar to the original, with a RCU read-side
critical section being substituted for the reader-writer lock read-side critical section in
the original:
13.3. RCU RESCUES 365
1 struct foo {
2 int length;
3 char *a;
4 };
1 rcu_read_lock();
2 if (removing) {
3 rcu_read_unlock();
4 cancel_io();
5 } else {
6 add_count(1);
7 rcu_read_unlock();
8 do_io();
9 sub_count(1);
10 }
The RCU read-side primitives have minimal overhead, thus speeding up the fastpath,
as desired.
The updated code fragment removing a device is as follows:
1 spin_lock(&mylock);
2 removing = 1;
3 sub_count(mybias);
4 spin_unlock(&mylock);
5 synchronize_rcu();
6 while (read_count() != 0) {
7 poll(NULL, 0, 1);
8 }
9 remove_device();
Here we replace the reader-writer lock with an exclusive spinlock and add a
synchronize_rcu() to wait for all of the RCU read-side critical sections to com-
plete. Because of the synchronize_rcu(), once we reach line 6, we know that all
remaining I/Os have been accounted for.
Of course, the overhead of synchronize_rcu() can be large, but given that
device removal is quite rare, this is usually a good tradeoff.
1. The array is initially 16 characters long, and thus ->length is equal to 16.
2. CPU 0 loads the value of ->length, obtaining the value 16.
3. CPU 1 shrinks the array to be of length 8, and assigns a pointer to a new 8-
character block of memory into ->a[].
4. CPU 0 picks up the new pointer from ->a[], and stores a new value into element
12. Because the array has only 8 characters, this results in a SEGV or (worse yet)
memory corruption.
366 CHAPTER 13. PUTTING IT ALL TOGETHER
1 struct foo_a {
2 int length;
3 char a[0];
4 };
5
6 struct foo {
7 struct foo_a *fa;
8 };
1. The array is initially 16 characters long, and thus ->length is equal to 16.
2. CPU 0 loads the value of ->fa, obtaining a pointer to the structure containing
the value 16 and the 16-byte array.
4. CPU 1 shrinks the array to be of length 8, and assigns a pointer to a new foo_a
structure containing an 8-character block of memory into ->fa.
5. CPU 0 picks up the new pointer from ->a[], and stores a new value into element
12. But because CPU 0 is still referencing the old foo_a structure that contains
the 16-byte array, all is well.
Of course, in both cases, CPU 1 must wait for a grace period before freeing the old
array.
A more general version of this approach is presented in the next section.
1 struct measurement {
2 double meas_1;
3 double meas_2;
4 double meas_3;
5 };
6
7 struct animal {
8 char name[40];
9 double age;
10 struct measurement *mp;
11 char photo[0]; /* large bitmap. */
12 };
confused. How can we guarantee that readers will see coordinated sets of these three
values?
One approach would be to allocate a new animal structure, copy the old structure
into the new structure, update the new structure’s meas_1, meas_2, and meas_3
fields, and then replace the old structure with a new one by updating the pointer. This
does guarantee that all readers see coordinated sets of measurement values, but it
requires copying a large structure due to the ->photo[] field. This copying might
incur unacceptably large overhead.
Another approach is to insert a level of indirection, as shown in Figure 13.9. When
a new measurement is taken, a new measurement structure is allocated, filled in
with the measurements, and the animal structure’s ->mp field is updated to point to
this new measurement structure using rcu_assign_pointer(). After a grace
period elapses, the old measurement structure can be freed.
Quick Quiz 13.12: But cant’t the approach shown in Figure 13.9 result in extra
cache misses, in turn resulting in additional read-side overhead?
This approach enables readers to see correlated values for selected fields with
minimal read-side overhead.
seqretry() loop. Note that sequence locks are not a replacement for RCU protection:
Sequence locks protect against concurrent modifications, but RCU is still needed to
protect against concurrent deletions.
This approach works quite well when the number of correlated elements is small,
the time to read these elements is short, and the update rate is low. Otherwise, updates
might happen so quickly that readers might never complete. Although Schrödinger
does not expect that even his least-sane relatives will marry and divorce quickly enough
for this to be a problem, he does realize that this problem could well arise in other
situations. One way to avoid this reader-starvation problem is to have the readers use
the update-side primitives if there have been too many retries, but this can degrade both
performance and scalability.
In addition, if the update-side primitives are used too frequently, poor performance
and scalability will result due to lock contention. One way to avoid this is to maintain a
per-element sequence lock, and to hold both spouses’ locks when updating their marital
status. Readers can do their retry looping on either of the spouses’ locks to gain a stable
view of any change in marital status involving both members of the pair. This avoids
contention due to high marriage and divorce rates, but complicates gaining a stable view
of all marital statuses during a single scan of the database.
If the element groupings are well-defined and persistent, which marital status is
hoped to be, then one approach is to add pointers to the data elements to link together
the members of a given group. Readers can then traverse these pointers to access all the
data elements in the same group as the first one located.
Other approaches using version numbering are left as exercises for the interested
reader.
2 Why would such a quantity be useful? Beats me! But group statistics in general are often useful.
If a little knowledge is a dangerous thing, just
imagine all the havoc you could wreak with a lot of
knowledge!
Unknown
Chapter 14
Advanced Synchronization
This section discusses a number of ways of using weaker, and hopefully lower-cost,
synchronization primitives. This weakening can be quite helpful, in fact, some have
argued that weakness is a virtue [Alg13]. Nevertheless, in parallel programming, as
in many other aspects of life, weakness is not a panacea. For example, as noted at the
end of Chapter 5, you should thoroughly apply partitioning, batching, and well-tested
packaged weak APIs (see Chapter 8 and 9) before even thinking about unstructured
weakening.
But after doing all that, you still might find yourself needing the advanced techniques
described in this chapter. To that end, Section 14.1 summarizes techniques used thus far
for avoiding locks, Section 14.2 covers use of memory barriers, and finally Section 14.3
gives a brief overview of non-blocking synchronization.
369
370 CHAPTER 14. ADVANCED SYNCHRONIZATION
Thread 1 Thread 2
x = 1; y = 1;
r1 = y; r2 = x;
assert(r1 == 1 || r2 == 1);
In short, lockless techniques are quite useful and are heavily used.
However, it is best if lockless techniques are hidden behind a well-defined API,
such as the inc_count(), memblock_alloc(), rcu_read_lock(), and so
on. The reason for this is that undisciplined use of lockless techniques is a good way to
create difficult bugs.
A key component of many lockless techniques is the memory barrier, which is
described in the following section.
of things?
Many people do indeed expect their computers to keep track of things, but many also
insist that they keep track of things quickly. One difficulty that modern computer-system
vendors face is that the main memory cannot keep up with the CPU—modern CPUs
can execute hundreds of instructions in the time required to fetch a single variable
from memory. CPUs therefore sport increasingly large caches, as shown in Figure 14.1.
Variables that are heavily used by a given CPU will tend to remain in that CPU’s cache,
allowing high-speed access to the corresponding data.
CPU 0 CPU 1
Cache Cache
Interconnect
Memory
Unfortunately, when a CPU accesses data that is not yet in its cache will result in an
expensive “cache miss”, requiring the data to be fetched from main memory. Doubly
unfortunately, running typical code results in a significant number of cache misses. To
limit the resulting performance degradation, CPUs have been designed to execute other
instructions and memory references while waiting for a cache miss to fetch data from
memory. This clearly causes instructions and memory references to execute out of
order, which could cause serious confusion, as illustrated in Figure 14.2. Compilers and
synchronization primitives (such as locking and RCU) are responsible for maintaining
the illusion of ordering through use of “memory barriers” (for example, smp_mb() in
the Linux kernel). These memory barriers can be explicit instructions, as they are on
ARM, POWER, Itanium, and Alpha, or they can be implied by other instructions, as
they are on x86.
Since the standard synchronization primitives preserve the illusion of ordering, your
path of least resistance is to stop reading this section and simply use these primitives.
However, if you need to implement the synchronization primitives themselves, or if
you are simply interested in understanding how memory ordering and memory barriers
work, read on!
The next sections present counter-intuitive scenarios that you might encounter when
using explicit memory barriers.
do I out things of
Look! can order.
This line of reasoning, intuitively obvious though it may be, is completely and
utterly incorrect. Please note that this is not a theoretical assertion: actually running this
code on real-world weakly-ordered hardware (a 1.5GHz 16-CPU POWER 5 system)
resulted in the assertion firing 16 times out of 10 million runs. Clearly, anyone who
produces code with explicit memory barriers should do some extreme testing—although
a proof of correctness might be helpful, the strongly counter-intuitive nature of the
behavior of memory barriers should in turn strongly limit one’s trust in such proofs. The
requirement for extreme testing should not be taken lightly, given that a number of dirty
hardware-dependent tricks were used to greatly increase the probability of failure in
this run.
Quick Quiz 14.1: How on earth could the assertion on line 21 of the code in
Figure 14.3 on page 373 possibly fail?
So what should you do? Your best strategy, if possible, is to use existing primitives
that incorporate any needed memory barriers, so that you can simply ignore the rest of
this chapter.
Of course, if you are implementing synchronization primitives, you don’t have this
luxury. The following discussion of memory ordering and memory barriers is for you.
14.2. MEMORY BARRIERS 373
1 thread0(void)
2 {
3 A = 1;
4 smp_wmb();
5 B = 1;
6 }
7
8 thread1(void)
9 {
10 while (B != 1)
11 continue;
12 barrier();
13 C = 1;
14 }
15
16 thread2(void)
17 {
18 while (C != 1)
19 continue;
20 barrier();
21 assert(A != 0);
22 }
Upon exit from the loop, firsttb will hold a timestamp taken shortly after the
assignment and lasttb will hold a timestamp taken before the last sampling of the
shared variable that still retained the assigned value, or a value equal to firsttb if
the shared variable had changed before entry into the loop. This allows us to plot each
CPU’s view of the value of state.variable over a 532-nanosecond time period,
as shown in Figure 14.5. This data was collected in 2006 on 1.5GHz POWER5 system
with 8 cores, each containing a pair of hardware threads. CPUs 1, 2, 3, and 4 recorded
the values, while CPU 0 controlled the test. The timebase counter period was about
374 CHAPTER 14. ADVANCED SYNCHRONIZATION
CPU 1 1 2
CPU 2 2
CPU 3 3 2
CPU 4 4 2
Each horizontal bar represents the observations of a given CPU over time, with
the black regions to the left indicating the time before the corresponding CPU’s first
measurement. During the first 5ns, only CPU 3 has an opinion about the value of the
variable. During the next 10ns, CPUs 2 and 3 disagree on the value of the variable,
but thereafter agree that the value is “2”, which is in fact the final agreed-upon value.
However, CPU 1 believes that the value is “1” for almost 300ns, and CPU 4 believes
that the value is “4” for almost 500ns.
Quick Quiz 14.4: How could CPUs possibly have different views of the value of a
single variable at the same time?
Quick Quiz 14.5: Why do CPUs 2 and 3 come to agreement so quickly, when it
takes so long for CPUs 1 and 4 to come to the party?
And if you think that the situation with four CPUs was intriguing, consider Fig-
ure 14.6, which shows the same situation, but with 15 CPUs each assigning their number
to a single shared variable at time t = 0. Both diagrams in the figure are drawn in
the same way as Figure 14.5. The only difference is that the unit of horizontal axis
is timebase ticks, with each tick lasting about 5.3 nanoseconds. The entire sequence
therefore lasts a bit longer than the events recorded in Figure 14.5, consistent with the
increase in number of CPUs. The upper diagram shows the overall picture, while the
lower one shows the zoom-up of first 50 timebase ticks.
Again, CPU 0 coordinates the test, so does not record any values.
All CPUs eventually agree on the final value of 9, but not before the values 15 and
12 take early leads. Note that there are fourteen different opinions on the variable’s
value at time 21 indicated by the vertical line in the lower diagram. Note also that all
CPUs see sequences whose orderings are consistent with the directed graph shown in
Figure 14.7. Nevertheless, both figures underscore the importance of proper use of
memory barriers for code that cares about memory ordering.
We have entered a regime where we must bid a fond farewell to comfortable
intuitions about values of variables and the passage of time. This is the regime where
memory barriers are needed.
All that aside, it is important to remember the lessons from Chapters 3 and 6. Having
all CPUs write concurrently to the same variable is absolutely no way to design a parallel
program, at least not if performance and scalability are at all important to you.
CPU 1 1 6 4 10 15 3 9
CPU 2 2 3 9
CPU 3 3 9
CPU 4 4 10 15 12 9
CPU 5 5 10 15 12 9
CPU 6 6 2 15 9
CPU 7 7 2 15 9
CPU 8 8 9
CPU 9 9
CPU 10 10 15 12 9
CPU 11 11 10 15 12 9
CPU 12 12 9
CPU 13 13 12 9
CPU 14 14 15 12 9
CPU 15 15 12 9
CPU 1 1
CPU 2 2
CPU 3 3
CPU 4 4
CPU 5 5
CPU 6 6
CPU 7 7
CPU 8 8 9
CPU 9 9
CPU 10 10
CPU 11 11
CPU 12 12
CPU 13 13
CPU 14 14 15
CPU 15 15
0 5 10 15 20 25 30 35 40 45
7 6
2 4 5 11
10 14
15 13
3 12 8
the bottom of the memory-barrier story, at least from the viewpoint of portable code.
If you just want to be told what the rules are rather than suffering through the actual
derivation, please feel free to skip to Section 14.2.6.
The exact semantics of memory barriers vary wildly from one CPU to another, so
portable code must rely only on the least-common-denominator semantics of memory
barriers.
Fortunately, all CPUs impose the following rules:
1. All accesses by a given CPU will appear to that CPU to have occurred in program
order.
2. All CPUs’ accesses to a single variable will be consistent with some global
ordering of stores to that variable.
1 Or, better yet, you can avoid explicit use of memory barriers entirely. But that would be the subject of
other sections.
14.2. MEMORY BARRIERS 377
A given CPU will see its own accesses as occurring in “program order”, as if the CPU
was executing only one instruction at a time with no reordering or speculation. For older
CPUs, this restriction is necessary for binary compatibility, and only secondarily for
the sanity of us software types. There have been a few CPUs that violate this rule to a
limited extent, but in those cases, the compiler has been responsible for ensuring that
ordering is explicitly enforced as needed.
Either way, from the programmer’s viewpoint, the CPU sees its own accesses in
program order.
CPU 1 CPU 2
access(A); access(B);
smp_mb(); smp_mb();
access(B); access(A);
Quick Quiz 14.6: But if the memory barriers do not unconditionally force ordering,
how the heck can a device driver reliably execute sequences of loads and stores to
MMIO registers?
Of course, accesses must be either loads or stores, and these do have different
properties. Table 14.2 shows all possible combinations of loads and stores from a pair
of CPUs. Of course, to enforce conditional ordering, there must be a memory barrier
between each CPU’s pair of operations.
Pairing 1. In this pairing, one CPU executes a pair of loads separated by a memory
barrier, while a second CPU executes a pair of stores also separated by a memory barrier,
as follows (both A and B are initially equal to zero):
CPU 1 CPU 2
A=1; Y=B;
smp_mb(); smp_mb();
B=1; X=A;
After both CPUs have completed executing these code sequences, if Y==1, then we
must also have X==1. In this case, the fact that Y==1 means that CPU 2’s load prior to
its memory barrier has seen the store following CPU 1’s memory barrier. Due to the
pairwise nature of memory barriers, CPU 2’s load following its memory barrier must
therefore see the store that precedes CPU 1’s memory barrier, so that X==1.
On the other hand, if Y==0, the memory-barrier condition does not hold, and so in
this case, X could be either 0 or 1.
14.2. MEMORY BARRIERS 379
Pairing 2. In this pairing, each CPU executes a load followed by a memory barrier
followed by a store, as follows (both A and B are initially equal to zero):
CPU 1 CPU 2
X=A; Y=B;
smp_mb(); smp_mb();
B=1; A=1;
After both CPUs have completed executing these code sequences, if X==1, then we
must also have Y==0. In this case, the fact that X==1 means that CPU 1’s load prior to
its memory barrier has seen the store following CPU 2’s memory barrier. Due to the
pairwise nature of memory barriers, CPU 1’s store following its memory barrier must
therefore see the results of CPU 2’s load preceding its memory barrier, so that Y==0.
On the other hand, if X==0, the memory-barrier condition does not hold, and so in
this case, Y could be either 0 or 1.
The two CPUs’ code sequences are symmetric, so if Y==1 after both CPUs have
finished executing these code sequences, then we must have X==0.
Pairing 3. In this pairing, one CPU executes a load followed by a memory barrier
followed by a store, while the other CPU executes a pair of stores separated by a memory
barrier, as follows (both A and B are initially equal to zero):
CPU 1 CPU 2
X=A; B=2;
smp_mb(); smp_mb();
B=1; A=1;
After both CPUs have completed executing these code sequences, if X==1, then we
must also have B==1. In this case, the fact that X==1 means that CPU 1’s load prior to
its memory barrier has seen the store following CPU 2’s memory barrier. Due to the
pairwise nature of memory barriers, CPU 1’s store following its memory barrier must
therefore see the results of CPU 2’s store preceding its memory barrier. This means that
CPU 1’s store to B will overwrite CPU 2’s store to B, resulting in B==1.
On the other hand, if X==0, the memory-barrier condition does not hold, and so in
this case, B could be either 1 or 2.
The following pairings from Table 14.2 can be used on modern hardware, but might
fail on some systems that were produced in the 1900s. However, these can safely be
used on all mainstream hardware introduced since the year 2000. So if you think that
memory barriers are difficult to deal with, please keep in mind that they used to be a lot
harder on some systems!
Ears to Mouths. Since the stores cannot see the results of the loads (again, ignoring
MMIO registers for the moment), it is not always possible to determine whether the
memory-barrier condition has been met. However, 21st -century hardware would guar-
antee that at least one of the loads saw the value stored by the corresponding store (or
some later value for that same variable).
Quick Quiz 14.7: How do we know that modern hardware guarantees that at least
one of the loads will see the value stored by the other thread in the ears-to-mouths
scenario?
380 CHAPTER 14. ADVANCED SYNCHRONIZATION
Stores “Pass in the Night”. In the following example, after both CPUs have fin-
ished executing their code sequences, it is quite tempting to conclude that the result
{A==1,B==2} cannot happen.
CPU 1 CPU 2
A=1; B=2;
smp_mb(); smp_mb();
B=1; A=2;
In the following combinations from Table 14.2, the memory barriers have very limited
use in portable code, even on 21st -century hardware. However, “limited use” is different
than “no use”, so let’s see what can be done! Avid readers will want to write toy
programs that rely on each of these combinations in order to fully understand how this
works.
Ears to Ears. Since loads do not change the state of memory (ignoring MMIO
registers for the moment), it is not possible for one of the loads to see the results of the
other load. However, if we know that CPU 2’s load from B returned a newer value than
CPU 1’s load from B, then we also know that CPU 2’s load from A returned either the
same value as CPU 1’s load from A or some later value.
Mouth to Mouth, Ear to Ear. One of the variables is only loaded from, and the other
is only stored to. Because (once again, ignoring MMIO registers) it is not possible
for one load to see the results of the other, it is not possible to detect the conditional
ordering provided by the memory barrier.
However, it is possible to determine which store happened last, but this requires an
additional load from B. If this additional load from B is executed after both CPUs 1
and 2 complete, and if it turns out that CPU 2’s store to B happened last, then we know
that CPU 2’s load from A returned either the same value as CPU 1’s load from A or
some later value.
Only One Store. Because there is only one store, only one of the variables permits
one CPU to see the results of the other CPU’s access. Therefore, there is no way to
detect the conditional ordering provided by the memory barriers.
At least not straightforwardly. But suppose that in combination 1 from Table 14.2,
CPU 1’s load from A returns the value that CPU 2 stored to A. Then we know that
14.2. MEMORY BARRIERS 381
CPU 1’s load from B returned either the same value as CPU 2’s load from A or some
later value.
Quick Quiz 14.8: How can the other “Only one store” entries in Table 14.2 be
used?
2. The lock acquisitions and releases must appear to have executed in a single global
order.2
3. Suppose a given variable has not yet been stored in a critical section that is
currently executing. Then any load from a given variable performed in that
critical section must see the last store to that variable from the last previous
critical section that stored to it.
The difference between the last two properties is a bit subtle: the second requires
that the lock acquisitions and releases occur in a well-defined order, while the third
requires that the critical sections not “bleed out” far enough to cause difficulties for
other critical section.
Why are these properties necessary?
Suppose the first property did not hold. Then the assertion in the following code
might well fail!
a = 1;
b = 1 + a;
assert(b == 2);
Quick Quiz 14.9: How could the assertion b==2 on page 381 possibly fail?
Suppose that the second property did not hold. Then the following code might leak
memory!
spin_lock(&mylock);
if (p == NULL)
p = kmalloc(sizeof(*p), GFP_ATOMIC);
spin_unlock(&mylock);
Quick Quiz 14.10: How could the code on page 381 possibly leak memory?
Suppose that the third property did not hold. Then the counter shown in the following
code might well count backwards. This third property is crucial, as it cannot be strictly
true with pairwise memory barriers.
spin_lock(&mylock);
ctr = ctr + 1;
spin_unlock(&mylock);
Quick Quiz 14.11: How could the code on page 381 possibly count backwards?
2 Of course, this order might be different from one run to the next. On any given run, however, all CPUs
and threads must have a consistent view of the order of critical sections for a given exclusive lock.
382 CHAPTER 14. ADVANCED SYNCHRONIZATION
If you are convinced that these rules are necessary, let’s look at how they interact
with a typical locking implementation.
In this particular case, pairwise memory barriers suffice to keep the two critical
sections in place. CPU 2’s atomic_xchg(&lck->a, 1) has seen CPU 1’s lck->
a=0, so therefore everything in CPU 2’s following critical section must see everything
that CPU 1’s preceding critical section did. Conversely, CPU 1’s critical section cannot
see anything that CPU 2’s critical section will do.
2. If a single shared variable is loaded and stored by multiple CPUs, then the series
of values seen by a given CPU will be consistent with the series seen by the other
14.2. MEMORY BARRIERS 383
CPUs, and there will be at least one sequence consisting of all values stored to
that variable with which each CPUs series will be consistent.3
3. If one CPU does ordered stores to variables A and B,4 and if a second CPU does
ordered loads from B and A,5 then if the second CPU’s load from B gives the
value stored by the first CPU, then the second CPU’s load from A must give the
value stored by the first CPU.
4. If one CPU does a load from A ordered before a store to B, and if a second CPU
does a load from B ordered before a store to A, and if the second CPU’s load
from B gives the value stored by the first CPU, then the first CPU’s load from A
must not give the value stored by the second CPU.
5. If one CPU does a load from A ordered before a store to B, and if a second CPU
does a store to B ordered before a store to A, and if the first CPU’s load from A
gives the value stored by the second CPU, then the first CPU’s store to B must
happen after the second CPU’s store to B, hence the value stored by the first CPU
persists.6
Device
Each CPU executes a program that generates memory access operations. In the
abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
perform the memory operations in any order it likes, provided program causality appears
to be maintained. Similarly, the compiler may also arrange the instructions it emits in
any order it likes, provided it doesn’t affect the apparent operation of the program.
So in the above diagram, the effects of the memory operations performed by a CPU
are perceived by the rest of the system as the operations cross the interface between the
CPU and rest of the system (the dotted lines).
3 A given CPU’s series may of course be incomplete, for example, if a given CPU never loaded or stored
the shared variable, then it can have no opinion about that variable’s value.
4 For example, by executing the store to A, a memory barrier, and then the store to B.
5 For example, by executing the load from B, a memory barrier, and then the load from A.
6 Or, for the more competitively oriented, the first CPU’s store to B “wins”.
384 CHAPTER 14. ADVANCED SYNCHRONIZATION
For example, consider the following sequence of events given the initial values {A
= 1, B = 2}:
CPU 1 CPU 2
A = 3; x = A;
B = 4; y = B;
The set of accesses as seen by the memory system in the middle can be arranged in
24 different combinations, with loads denoted by “ld” and stores denoted by “st”:
st A=3, st B=4, x=ld A→3, y=ld B→4
st A=3, st B=4, y=ld B→4, x=ld A→3
st A=3, x=ld A→3, st B=4, y=ld B→4
st A=3, x=ld A→3, y=ld B→2, st B=4
st A=3, y=ld B→2, st B=4, x=ld A→3
st A=3, y=ld B→2, x=ld A→3, st B=4
st B=4, st A=3, x=ld A→3, y=ld B→4
st B=4, ...
...
Furthermore, the stores committed by a CPU to the memory system may not be
perceived by the loads made by another CPU in the same order as the stores were
committed.
As a further example, consider this sequence of events given the initial values {A =
1, B = 2, C = 3, P = &A, Q = &C}:
CPU 1 CPU 2
B = 4; Q = P;
P = &B D = *Q;
There is an obvious data dependency here, as the value loaded into D depends on
the address retrieved from P by CPU 2. At the end of the sequence, any of the following
results are possible:
(Q == &A) and (D == 1)
(Q == &B) and (D == 2)
(Q == &B) and (D == 4)
Note that CPU 2 will never try and load C into D because the CPU will load P into
Q before issuing the load of *Q.
STORE *A = 5, x = LOAD *D
x = LOAD *D, STORE *A = 5
the second of which will almost certainly result in a malfunction, since it set the
address after attempting to read the register.
14.2.9 Guarantees
There are some minimal guarantees that may be expected of a CPU:
1. On any given CPU, dependent memory accesses will be issued in order, with
respect to itself. This means that for:
Q = P; D = *Q;
the CPU will only issue the following sequence of memory operations:
a = LOAD *X, STORE *X = b
And for:
*X = c; d = *X;
(Loads and stores overlap if they are targeted at overlapping pieces of memory).
3. A series of stores to a single variable will appear to all CPUs to have occurred in
a single order, though this order might not be predictable from the code, and in
fact the order might vary from one run to another.
And there are a number of things that must or must not be assumed:
1. It must not be assumed that independent loads and stores will be issued in the
order given. This means that for:
X = *A; Y = *B; *D = Z;
And for:
*A = X; *(A + 4) = Y;
Finally, for:
*A = X; *A = Y;
Write Memory Barriers A write memory barrier gives a guarantee that all the
STORE operations specified before the barrier will appear to happen before all the
STORE operations specified after the barrier with respect to the other components of
the system.
A write barrier is a partial ordering on stores only; it is not required to have any
effect on loads.
A CPU can be viewed as committing a sequence of store operations to the memory
system as time progresses. All stores before a write barrier will occur in the sequence
before all the stores after the write barrier.
Note that write barriers should normally be paired with read or data dependency
barriers; see Section 14.2.10.6.
Read Memory Barriers A read barrier is a data dependency barrier plus a guarantee
that all the LOAD operations specified before the barrier will appear to happen before
all the LOAD operations specified after the barrier with respect to the other components
of the system.
A read barrier is a partial ordering on loads only; it is not required to have any effect
on stores.
Read memory barriers imply data dependency barriers, and so can substitute for
them.
Note that read barriers should normally be paired with write barriers; see Sec-
tion 14.2.10.6.
General Memory Barriers A general memory barrier gives a guarantee that all the
LOAD and STORE operations specified before the barrier will appear to happen before
388 CHAPTER 14. ADVANCED SYNCHRONIZATION
all the LOAD and STORE operations specified after the barrier with respect to the other
components of the system.
A general memory barrier is a partial ordering over both loads and stores.
General memory barriers imply both read and write memory barriers, and so can
substitute for either.
There are a couple of types of implicit memory barriers, so called because they are
embedded into locking primitives:
2. UNLOCK operations.
There are certain things that memory barriers cannot guarantee outside of the confines
of a given architecture:
1. There is no guarantee that any of the memory accesses specified before a memory
barrier will be complete by the completion of a memory barrier instruction; the
barrier can be considered to draw a line in that CPU’s access queue that accesses
of the appropriate type may not cross.
14.2. MEMORY BARRIERS 389
2. There is no guarantee that issuing a memory barrier on one CPU will have any
direct effect on another CPU or any other hardware in the system. The indirect
effect will be the order in which the second CPU sees the effects of the first CPU’s
accesses occur, but see the next point.
3. There is no guarantee that a CPU will see the correct order of effects from a
second CPU’s accesses, even if the second CPU uses a memory barrier, unless
the first CPU also uses a matching memory barrier (see Section 14.2.10.6).
4. There is no guarantee that some intervening piece of off-the-CPU hardware7 will
not reorder the memory accesses. CPU cache coherency mechanisms should
propagate the indirect effects of a memory barrier between CPUs, but might not
do so in order.
There’s a clear data dependency here, and it would seem intuitively obvious that by
the end of the sequence, Q must be either &A or &B, and that:
(Q == &A) implies (D == 1)
(Q == &B) implies (D == 4)
Counter-intuitive though it might be, it is quite possible that CPU 2’s perception of
P might be updated before its perception of B, thus leading to the following situation:
(Q == &B) and (D == 2) ????
Whilst this may seem like a failure of coherency or causality maintenance, it isn’t,
and this behaviour can be observed on certain real CPUs (such as the DEC Alpha).
To deal with this, a data dependency barrier must be inserted between the address
load and the data load (again with initial values of {A = 1, B = 2, C = 3, P
= &A, Q = &C}):
CPU 1 CPU 2
B = 4;
<write barrier>
P = &B;
Q = P;
<data dependency barrier>
D = *Q;
This enforces the occurrence of one of the two implications, and prevents the third
possibility from arising.
7 This is of concern primarily in operating-system kernels. For more information on hardware opera-
tions and memory ordering, see the files pci.txt, DMA-API-HOWTO.txt, and DMA-API.txt in the
Documentation directory in the Linux source tree [Tor03].
390 CHAPTER 14. ADVANCED SYNCHRONIZATION
Note that this extremely counterintuitive situation arises most easily on machines
with split caches, so that, for example, one cache bank processes even-numbered cache
lines and the other bank processes odd-numbered cache lines. The pointer P might
be stored in an odd-numbered cache line, and the variable B might be stored in an
even-numbered cache line. Then, if the even-numbered bank of the reading CPU’s cache
is extremely busy while the odd-numbered bank is idle, one can see the new value of
the pointer P (which is &B), but the old value of the variable B (which is 2).
Another example of where data dependency barriers might by required is where a
number is read from memory and then used to calculate the index for an array access
with initial values {M[0] = 1, M[1] = 2, M[3] = 3, P = 0, Q = 3}:
CPU 1 CPU 2
M[1] = 4;
<write barrier>
P = 1;
Q = P;
<data dependency barrier>
D = M[Q];
The data dependency barrier is very important to the Linux kernel’s RCU system, for
example, see rcu_dereference() in include/linux/rcupdate.h. This
permits the current target of an RCU’d pointer to be replaced with a new modified target,
without the replacement target appearing to be incompletely initialised.
See also Section 14.2.13.1 for a larger example.
This will not have the desired effect because there is no actual data dependency, but
rather a control dependency that the CPU may short-circuit by attempting to predict the
outcome in advance, so that other CPUs see the load from y as having happened before
the load from x. In such a case what’s actually required is:
1 q = READ_ONCE(x);
2 if (q) {
3 <read barrier>
4 q = READ_ONCE(y);
5 }
However, stores are not speculated. This means that ordering is provided for load-
store control dependencies, as in the following example:
1 q = READ_ONCE(x);
2 if (q)
3 WRITE_ONCE(y, 1);
Control dependencies pair normally with other types of barriers. That said, please
note that neither READ_ONCE() nor WRITE_ONCE() are optional! Without the
14.2. MEMORY BARRIERS 391
READ_ONCE(), the compiler might combine the load from x with other loads from x.
Without the WRITE_ONCE(), the compiler might combine the store to y with other
stores to y. Either can result in highly counterintuitive effects on ordering.
Worse yet, if the compiler is able to prove (say) that the value of variable x is
always non-zero, it would be well within its rights to optimize the original example by
eliminating the “if” statement as follows:
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */
Now there is no conditional between the load from x and the store to y, which
means that the CPU is within its rights to reorder them: The conditional is absolutely
required, and must be present in the assembly code even after all compiler optimizations
have been applied. Therefore, if you need ordering in this example, you need explicit
memory barriers, for example, a release store:
1 q = READ_ONCE(x);
2 if (q) {
3 smp_store_release(&y, 1);
4 do_something();
5 } else {
6 smp_store_release(&y, 1);
7 do_something_else();
8 }
The initial READ_ONCE() is still required to prevent the compiler from proving
the value of x.
In addition, you need to be careful what you do with the local variable q, otherwise
the compiler might be able to guess the value and again remove the needed conditional.
For example:
1 q = READ_ONCE(x);
2 if (q % MAX) {
3 WRITE_ONCE(y, 1);
4 do_something();
5 } else {
6 WRITE_ONCE(y, 2);
7 do_something_else();
8 }
392 CHAPTER 14. ADVANCED SYNCHRONIZATION
If MAX is defined to be 1, then the compiler knows that (q%MAX) is equal to zero,
in which case the compiler is within its rights to transform the above code into the
following:
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 2);
3 do_something_else();
Given this transformation, the CPU is not required to respect the ordering between
the load from variable x and the store to variable y. It is tempting to add a barrier()
to constrain the compiler, but this does not help. The conditional is gone, and the barrier
won’t bring it back. Therefore, if you are relying on this ordering, you should make sure
that MAX is greater than one, perhaps as follows:
1 q = READ_ONCE(x);
2 BUILD_BUG_ON(MAX <= 1);
3 if (q % MAX) {
4 WRITE_ONCE(y, 1);
5 do_something();
6 } else {
7 WRITE_ONCE(y, 2);
8 do_something_else();
9 }
Please note once again that the stores to y differ. If they were identical, as noted
earlier, the compiler could pull this store outside of the “if” statement.
You must also avoid excessive reliance on boolean short-circuit evaluation. Consider
this example:
1 q = READ_ONCE(x);
2 if (q || 1 > 0)
3 WRITE_ONCE(y, 1);
Because the first condition cannot fault and the second condition is always true, the
compiler can transform this example as following, defeating control dependency:
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1);
This example underscores the need to ensure that the compiler cannot out-guess your
code. More generally, although READ_ONCE() does force the compiler to actually
emit code for a given load, it does not force the compiler to use the results.
In addition, control dependencies apply only to the then-clause and else-clause of
the if-statement in question. In particular, it does not necessarily apply to code following
the if-statement:
1 q = READ_ONCE(x);
2 if (q) {
3 WRITE_ONCE(y, 1);
4 } else {
5 WRITE_ONCE(y, 2);
6 }
7 WRITE_ONCE(z, 1); /* BUG: No ordering. */
It is tempting to argue that there in fact is ordering because the compiler cannot
reorder volatile accesses and also cannot reorder the writes to y with the condition.
Unfortunately for this line of reasoning, the compiler might compile the two writes to y
as conditional-move instructions, as in this fanciful pseudo-assembly language:
14.2. MEMORY BARRIERS 393
1 ld r1,x
2 cmp r1,$0
3 cmov,ne r4,$1
4 cmov,eq r4,$2
5 st r4,y
6 st $1,z
A weakly ordered CPU would have no dependency of any sort between the load
from x and the store to z. The control dependencies would extend only to the pair
of cmov instructions and the store depending on them. In short, control dependencies
apply only to the stores in the “then” and “else” of the “if” in question (including
functions invoked by those two clauses), not to code following that “if”.
Finally, control dependencies do not provide transitivity. This is demonstrated by
two related examples, with the initial values of x and y both being zero:
CPU 0 CPU 1
r1 = READ_ONCE(x); r2 = READ_ONCE(y);
if (r1 > 0) if (r2 > 0)
WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
The above two-CPU example will never trigger the assert(). However, if control
dependencies guaranteed transitivity (which they do not), then adding the following
CPU would guarantee a related assertion:
CPU 2
WRITE_ONCE(y, 1);
But because control dependencies do not provide transitivity, the above assertion
can fail after the combined three-CPU example completes. If you need the three-CPU
example to provide ordering, you will need smp_mb() between the loads and stores
in the CPU 0 and CPU 1 code fragments, that is, just before or just after the “if”
statements. Furthermore, the original two-CPU example is very fragile and should be
avoided.
The two-CPU example is known as LB (load buffering) and the three-CPU example
as WWC [MSS12].
The following list of rules summarizes the lessons of this section:
2. Control dependencies can order prior loads against later stores. However, they
do not guarantee any other sort of ordering: Not prior loads against later loads,
nor prior stores against later anything. If you need these other forms of ordering,
use smp_rmb(), smp_wmb(), or, in the case of prior stores and later loads,
smp_mb().
3. If both legs of the “if” statement begin with identical stores to the same variable,
then those stores must be ordered, either by preceding both of them with smp_
mb() or by using smp_store_release() to carry out the stores. Please
note that it is not sufficient to use barrier() at beginning of each leg of the
“if” statement because, as shown by the example above, optimizing compilers
can destroy the control dependency while respecting the letter of the barrier()
law.
394 CHAPTER 14. ADVANCED SYNCHRONIZATION
4. Control dependencies require at least one run-time conditional between the prior
load and the subsequent store, and this conditional must involve the prior load. If
the compiler is able to optimize the conditional away, it will have also optimized
away the ordering. Careful use of READ_ONCE() and WRITE_ONCE() can
help to preserve the needed conditional.
5. Control dependencies require that the compiler avoid reordering the depen-
dency into nonexistence. Careful use of READ_ONCE(), atomic_read(), or
atomic64_read() can help to preserve your control dependency.
6. Control dependencies apply only to the “then” and “else” of the “if” con-
taining the control dependency, including any functions that these two clauses
call. Control dependencies do not apply to code following the “if” containing
the control dependency.
7. Control dependencies pair normally with other types of barriers.
8. Control dependencies do not provide transitivity. If you need transitivity, use
smp_mb().
Or:
CPU 1 CPU 2
A = 1;
<write barrier>
B = &A;
X = B;
<data dependency barrier>
Y = *X;
One way or another, the read barrier must always be present, even though it might
be of a weaker type.8
Note that the stores before the write barrier would normally be expected to match
the loads after the read barrier or data dependency barrier, and vice versa:
CPU 1 CPU 2
a = 1; v = c
b = 2; w = d
<write barrier> <read barrier>
c = 3; x = a;
d = 4; y = b;
8 By “weaker”, we mean “makes fewer ordering guarantees”. A weaker barrier is usually also lower-
Firstly, write barriers act as partial orderings on store operations. Consider the following
sequence of events:
STORE A = 1
STORE B = 2
STORE C = 3
<write barrier>
STORE D = 4
STORE E = 5
1111
0000
occurring before the unordered set of {D=4,E=5}, as shown in Figure 14.9.
0000
1111
0000
1111 C=3
0000
1111
Events perceptible
A=1 to rest of system
CPU 1
0000
1111
0000
1111
B=2
wwwwwwwwwwwwwwww At this point the write barrier
0000
1111
requires all stores prior to the
E=5 barrier to be committed before
0000
1111
further stores may take place.
D=4
0000
1111 Sequence in which stores are committed to the
memory system by CPU 1
Without intervention, CPU 2 may perceive the events on CPU 1 in some effectively
random order, despite the write barrier issued by CPU 1, as shown in Figure 14.10.
In the above example, CPU 2 perceives that B is 7, despite the load of *C (which
would be B) coming after the LOAD of C.
If, however, a data dependency barrier were to be placed between the load of C and
the load of *C (i.e.: B) on CPU 2, again with initial values of {B = 7, X = 9, Y
= 8, C = &Y}:
CPU 1 CPU 2
A = 1;
B = 2;
<write barrier>
C = &B; LOAD X
D = 4; LOAD C (gets &B)
<data dependency barrier>
LOAD *C (reads B)
396 CHAPTER 14. ADVANCED SYNCHRONIZATION
1111
0000
0000
1111 B=2 1111
0000
0000
1111
Y−>8
Sequence of update
of perception on
0000
1111 0000
1111
CPU 2
CPU 1 A=1 C−>&Y
0000
1111
wwwwwwwwwwwwwwww
0000
1111C=&B 0000
1111
0000
1111
0000
1111
0000
1111
D=4
0000
1111
C−>&B
0000
1111
Apparently incorrect 0000
1111
0000
1111
B−>7
CPU 2
0000
1111
perception of B (!)
0000
1111
B−>2
0000
1111
Figure 14.10: Data Dependency Barrier Omitted
1111
0000 1111
0000
then ordering will be as intuitively expected, as shown in Figure 14.11.
0000
1111
0000
1111 B=2 0000
1111
0000
1111
Y−>8
CPU 1
0000
1111
0000
1111
A=1
wwwwwwwwwwwwwwww
0000
1111
C−>&Y
0000
1111
0000
1111
0000
1111
C=&B
D=4 0000
1111
0000
1111
C−>&B
0000
1111 0000
1111
0000
1111 CPU 2
0000
1111
ddddddddddddddddd
0000
1111
prior to the store of C
are perceptible to B−>2
0000
1111
subsequent loads
And thirdly, a read barrier acts as a partial order on loads. Consider the following
sequence of events, with initial values {A = 0, B = 9}:
CPU 1 CPU 2
A = 1;
<write barrier>
B = 2;
LOAD B
LOAD A
Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
some effectively random order, despite the write barrier issued by CPU 1, as shown in
Figure 14.12.
If, however, a read barrier were to be placed between the load of B and the load of A
on CPU 2, again with initial values of {A = 0, B = 9}:
1111
0000 1111
0000
14.2. MEMORY BARRIERS 397
0000
1111
0000
1111 0000
1111
CPU 1
0000
1111
A=1
wwwwwwwwwwwwwwww
0000
1111
A−>0
0000
1111
B−>9
0000
1111
0000
1111
B=2
0000
1111
0000
1111
0000
1111
B−>2
CPU 2
0000
1111
A−>0
0000
1111
0000
1111
A−>1
0000
1111
Figure 14.12: Read Barrier Needed
CPU 1 CPU 2
A = 1;
<write barrier>
B = 2;
LOAD B
<read barrier>
LOAD A
then the partial ordering imposed by CPU 1’s write barrier will be perceived correctly
by CPU 2, as shown in Figure 14.13.
1111
0000
0000
1111 1111
0000
CPU 1
0000
1111
A=1
wwwwwwwwwwwwwwww
0000
1111
A−>0
0000
1111
B−>9
0000
1111
0000
1111
B=2
0000
1111
0000
1111
0000
1111
B−>2
CPU 2
0000
1111
prior to the storage of B A−>1
to be perceptible to CPU 2
To illustrate this more completely, consider what could happen if the code contained
a load of A either side of the read barrier, once again with the same initial values of {A
= 0, B = 9}:
CPU 1 CPU 2
A = 1;
<write barrier>
B = 2;
LOAD B
LOAD A (1st )
<read barrier>
LOAD A (2nd )
Even though the two loads of A both occur after the load of B, they may both come
up with different values, as shown in Figure 14.14.
Of course, it may well be that CPU 1’s update to A becomes perceptible to CPU 2
before the read barrier completes, as shown in Figure 14.15.
1111
0000 1111
0000
398 CHAPTER 14. ADVANCED SYNCHRONIZATION
0000
1111
0000
1111 A=1 0000
1111
0000
1111
A−>0
CPU 1
0000
1111
wwwwwwwwwwwwwwww
0000
1111B=2 0000
1111
B−>9
0000
1111
0000
1111 0000
1111
0000
1111
B−>2
CPU 2
0000
1111
0000
1111
A−>0 1st
At this point the read
barrier causes all effects
prior to the storage of B 0000
1111
rrrrrrrrrrrrrrrrr
0000
1111
A−>1 2nd
0000
1111
to be perceptible to CPU 2
1111
0000 0000
1111
Figure 14.14: Read Barrier Supplied, Double Load
0000
1111
0000
1111 A=1 0000
1111
0000
1111
A−>0
CPU 1
0000
1111
wwwwwwwwwwwwwwww
0000
1111B=2 0000
1111
B−>9
0000
1111
0000
1111 0000
1111
0000
1111
B−>2
CPU 2
0000
1111
0000
1111
A−>1 1st
0000
1111
rrrrrrrrrrrrrrrrr
0000
1111
A−>1 2nd
0000
1111
Figure 14.15: Read Barrier Supplied, Take Two
The guarantee is that the second load will always come up with A == 1 if the load
of B came up with B == 2. No such guarantee exists for the first load of A; that may
come up with either A == 0 or A == 1.
Many CPUs speculate with loads: that is, they see that they will need to load an item
from memory, and they find a time where they’re not using the bus for any other loads,
and then do the load in advance—even though they haven’t actually got to that point
in the instruction execution flow yet. Later on, this potentially permits the actual load
instruction to complete immediately because the CPU already has the value on hand.
It may turn out that the CPU didn’t actually need the value (perhaps because a
branch circumvented the load) in which case it can discard the value or just cache it for
later use. For example, consider the following:
CPU 1 CPU 2
LOAD B
DIVIDE
DIVIDE
LOAD A
14.2. MEMORY BARRIERS 399
On some CPUs, divide instructions can take a long time to complete, which means
that CPU 2’s bus might go idle during that time. CPU 2 might therefore speculatively
load A before the divides complete. In the (hopefully) unlikely event of an exception
from one of the dividees, this speculative load will have been wasted, but in the (again,
hopefully) common case, overlapping the load with the divides will permit the load to
complete more quickly, as illustrated by Figure 14.16.
11111
00000
00000
11111 B−>2
00000
11111
CPU 2
DIVIDE
The CPU being busy doing a
division speculates on the
LOAD of A 00000
11111
00000
11111
A−>0
00000
11111
DIVIDE
00000
11111
Once the divisions are complete
the CPU can then perform the
LOAD with immediate effect
Placing a read barrier or a data dependency barrier just before the second load:
CPU 1 CPU 2
LOAD B
DIVIDE
DIVIDE
<read barrier>
LOAD A
11111
00000
00000
11111 B−>2
00000
11111
CPU 2
DIVIDE
The CPU being busy doing a
division speculates on the
LOAD of A 00000
11111
00000
11111
A−>0
00000
11111
DIVIDE
00000
11111
rrrrrrrrrrrrr
00000
11111
00000
11111
Figure 14.17: Speculative Load and Barrier
11111
00000
00000
11111 B−>2
00000
11111
CPU 2
DIVIDE
The CPU being busy doing a
division speculates on the
LOAD of A 00000
11111
00000
11111
A−>0
00000
11111
DIVIDE
00000
11111A−>1
00000
11111
and an updated value is
retrieved
LOCK to happen after the LOCK, and an access following the UNLOCK to happen
before the UNLOCK, and the two accesses can themselves then cross. For example, the
following:
1 *A = a;
2 LOCK
3 UNLOCK
4 *B = b;
Again, always remember that LOCK and UNLOCK are permitted to let preceding
operations and following operations “bleed in” to the critical section respectively.
Quick Quiz 14.13: What sequence of LOCK-UNLOCK operations would act as a
full memory barrier?
Quick Quiz 14.14: What (if any) CPUs have memory-barrier instructions from
which these semi-permeable locking primitives might be constructed?
This could legitimately execute in the following order, where pairs of operations on
the same line indicate that the CPU executed those operations concurrently:
3 LOCK
1 *A = a; *F = f;
7 *E = e;
4 *C = c; *D = d;
2 *B = b;
6 UNLOCK
Quick Quiz 14.15: Given that operations grouped in curly braces are executed con-
currently, which of the rows of Table 14.3 are legitimate reorderings of the assignments
402 CHAPTER 14. ADVANCED SYNCHRONIZATION
to variables “A” through “F” and the LOCK/UNLOCK operations? (The order in the
code is A, B, LOCK, C, D, UNLOCK, E, F.) Why or why not?
Ordering with Multiple Locks: Code containing multiple locks still sees ordering
constraints from those locks, but one must be careful to keep track of which lock is
which. For example, consider the code shown in Table 14.4, which uses a pair of locks
named “M” and “Q”.
CPU 1 CPU 2
A = a; E = e;
LOCK M; LOCK Q;
B = b; F = f;
C = c; G = g;
UNLOCK M; UNLOCK Q;
D = d; H = h;
In this example, there are no guarantees as to what order the assignments to vari-
ables “A” through “H” will appear in, other than the constraints imposed by the locks
themselves, as described in the previous section.
Quick Quiz 14.16: What are the constraints for Table 14.4?
Ordering with Multiple CPUs on One Lock: Suppose, instead of the two different
locks as shown in Table 14.4, both CPUs acquire the same lock, as shown in Table 14.5?
CPU 1 CPU 2
A = a; E = e;
LOCK M; LOCK M;
B = b; F = f;
C = c; G = g;
UNLOCK M; UNLOCK M;
D = d; H = h;
In this case, either CPU 1 acquires M before CPU 2 does, or vice versa. In the first
case, the assignments to A, B, and C must precede those to F, G, and H. On the other
hand, if CPU 2 acquires the lock first, then the assignments to E, F, and G must precede
those to B, C, and D.
CPU Memory
Cache
Coherency
Mechanism
instructions may generate memory accesses that must be queued in the CPU’s memory
access queue, but execution may nonetheless continue until the CPU either fills up its
internal resources or until it must wait for some queued memory access to complete.
Although cache-coherence protocols guarantee that a given CPU sees its own accesses
in order, and that all CPUs agree on the order of modifications to a single variable
contained within a single cache line, there is no guarantee that modifications to different
variables will be seen in the same order by all CPUs—although some computer systems
do make some such guarantees, portable software cannot rely on them.
Cache A
CPU 1
Cache B
Memory
System
Cache C
CPU 2
Cache D
To see why reordering can occur, consider the two-CPU system shown in Fig-
ure 14.20, in which each CPU has a split cache. This system has the following proper-
ties:
In short, if cache A is busy, but cache B is idle, then CPU 1’s stores to odd-numbered
cache lines may be delayed compared to CPU 2’s stores to even-numbered cache lines.
In not-so-extreme cases, CPU 2 may see CPU 1’s operations out of order.
Much more detail on memory ordering in hardware and software may be found in
Appendix B.
9 But note that in “superscalar” systems, the CPU might well be accessing both halves of its cache at
once, and might in fact be performing multiple concurrent accesses to each of the halves.
14.3. NON-BLOCKING SYNCHRONIZATION 405
NBS classes 1, 2 and 3 were first formulated in the early 1990s, class 4 was first
formulated in the early 2000s, and class 5 was first formulated in 2013. The final two
classes have seen informal use for a great many decades, but were reformulated in 2013.
In theory, any parallel algorithm can be cast into wait-free form, but there are a
relatively small subset of NBS algorithms that are in common use. A few of these are
listed in the following section.
10 As we will see below, some recent NBS work relaxes this guarantee.
11 Again, some recent NBS work relaxes this guarantee.
406 CHAPTER 14. ADVANCED SYNCHRONIZATION
Konrad Adenauer
Chapter 15
409
410 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
method fails to finish in time, kill it and use the answer from the fast but inaccurate
method. One candidate for the fast but inaccurate method is to take no control action
during the current time period, and another candidate is to take the same control action
as was taken during the preceding time period.
In short, it does not make sense to talk about soft real time without some measure of
exactly how soft it is.
must also be a requirement that the system meet its deadline some fraction of the time,
or perhaps that it be prohibited from missing its deadlines on more than a certain number
of consecutive operations.
We clearly cannot take a sound-bite approach to either hard or soft real time. The
next section therefore takes a more real-world approach.
2 Decades later, the acceptance tests for some types of computer systems involve large detonations, and
some types of communications networks must deal with what is delicately termed “ballistic jamming.”
15.1. WHAT IS REAL-TIME COMPUTING? 413
Just as with people, it is often possible to prevent a real-time system from meeting
its deadlines by overloading it. For example, if the system is being interrupted too
frequently, it might not have sufficient CPU bandwidth to handle its real-time application.
A hardware solution to this problem might limit the rate at which interrupts were
delivered to the system. Possible software solutions include disabling interrupts for
some time if they are being received too frequently, resetting the device generating
too-frequent interrupts, or even avoiding interrupts altogether in favor of polling.
Overloading can also degrade response times due to queueing effects, so it is not
unusual for real-time systems to overprovision CPU bandwidth, so that a running system
has (say) 80% idle time. This approach also applies to storage and networking devices.
In some cases, separate storage and networking hardware might be reserved for the sole
use of high-priority portions of the real-time application. It is of course not unusual
for this hardware to be mostly idle, given that response time is more important than
throughput in real-time systems.
Quick Quiz 15.2: But given the results from queueing theory, won’t low utilization
merely improve the average response time rather than improving the worst-case response
time? And isn’t worst-case response time all that most real-time systems really care
about?
Of course, maintaining sufficiently low utilization requires great discipline through-
out the design and implementation. There is nothing quite like a little feature creep to
destroy deadlines.
It is easier to provide bounded response time for some operations than for others. For
example, it is quite common to see response-time specifications for interrupts and for
wake-up operations, but quite rare for (say) filesystem unmount operations. One reason
for this is that it is quite difficult to bound the amount of work that a filesystem-unmount
operation might need to do, given that the unmount is required to flush all of that
filesystem’s in-memory data to mass storage.
This means that real-time applications must be confined to operations for which
bounded latencies can reasonably be provided. Other operations must either be pushed
out into the non-real-time portions of the application or forgone entirely.
414 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
There might also be constraints on the non-real-time portions of the application. For
example, is the non-real-time application permitted to use CPUs used by the real-time
portion? Are there time periods during which the real-time portion of the application is
expected to be unusually busy, and if so, is the non-real-time portion of the application
permitted to run at all during those times? Finally, by what amount is the real-time
portion of the application permitted to degrade the throughput of the non-real-time
portion?
3 Important safety tip: Worst-case response times from USB devices can be extremely long. Real-time
systems should therefore take care to place any USB devices well away from critical paths.
15.2. WHO NEEDS REAL-TIME COMPUTING? 415
as with other kinds of technical budget, a strong validation effort is required in order
to ensure proper focus on latencies and to give early warning of latency problems. A
successful validation effort will almost always include a good test suite, which might be
unsatisfying to the theorists, but has the virtue of helping to get the job done. As a point
of fact, as of early 2015, most real-world real-time system use an acceptance test rather
than formal proofs.
That said, the widespread use of test suites to validate real-time systems does have
a very real disadvantage, namely that real-time software is validated only on specific
hardware in specific hardware and software configurations. Adding additional hardware
and configurations requires additional costly and time-consuming testing. Perhaps the
field of formal verification will advance sufficiently to change this situation, but as of
early 2015, rather large advances are required.
Quick Quiz 15.3: Formal verification is already quite capable, benefiting from
decades of intensive study. Are additional advances really required, or is this just a
practitioner’s excuse to continue to be lazy and ignore the awesome power of formal
verification?
In addition to latency requirements for the real-time portions of the application, there
will likely be performance and scalability requirements for the non-real-time portions of
the application. These additional requirements reflect the fact that ultimate real-time
latencies are often attained by degrading scalability and average performance.
Software-engineering requirements can also be important, especially for large ap-
plications that must be developed and maintained by large teams. These requirements
often favor increased modularity and fault isolation.
This is a mere outline of the work that would be required to specify deadlines and
environmental constraints for a production real-time system. It is hoped that this outline
clearly demonstrates the inadequacy of the sound-bite-based approach to real-time
computing.
Stimulus
Hard Non-Real-Time
Real-Time Strategy
Response "Reflexes" and Planning
These four areas could be characterized as “in search of production”, “in search of life”,
“in search of death”, and “in search of money”.
Financial-services applications differ subtlely from applications in the other three
categories in that money is non-material, meaning that non-computational latencies are
quite small. In contrast, mechanical delays inherent in the other three categories provide
a very real point of diminishing returns beyond which further reductions in the applica-
tion’s real-time response provide little or no benefit. This means that financial-services
applications, along with other real-time information-processing applications, face an
arms race, where the application with the lowest latencies normally wins. Although the
resulting latency requirements can still be specified as described in Section 15.1.3.4, the
unusual nature of these requirements has led some to refer to financial and information-
processing applications as “low latency” rather than “real time”.
Regardless of exactly what we choose to call it, there is substantial need for real-time
computing [Pet06, Inm07].
Scripting languages 1s
100ms
Linux 2.4 kernel
10ms
Real-time Java (with GC)
1ms
Linux 2.6.x/3.x kernel
Real-time Java (no GC) 100us
Linux -rt patchset
Specialty RTOSes (no MMU) 10us
1us
Hand-coded assembly
100ns
Custom digital hardware
10ns
1ns
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
RTOS Process
RTOS Process
RCU read-side
Linux critical sections
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
RTOS
by forwarding POSIX system calls from the RTOS to a utility thread running on Linux,
there are invariably rough edges.
In addition, the RTOS must interface to both the hardware and to the Linux kernel,
thus requiring significant maintenance with changes in both hardware and kernel. Fur-
thermore, each such RTOS often has its own system-call interface and set of system
libraries, which can balkanize both ecosystems and developers. In fact, these problems
seem to be what drove the combination of RTOSes with Linux, as this approach allowed
access to the full real-time capabilities of the RTOS, while allowing the application’s
non-real-time code full access to Linux’s rich and vibrant open-source ecosystem.
Although pairing RTOSes with the Linux kernel was a clever and useful short-term
response during the time that the Linux kernel had minimal real-time capabilities, it
also motivated adding real-time capabilities to the Linux kernel. Progress towards this
goal is shown in Figure 15.7. The upper row shows a diagram of the Linux kernel with
preemption disabled, thus having essentially no real-time capabilities. The middle row
shows a set of diagrams showing the increasing real-time capabilities of the mainline
Linux kernel with preemption enabled. Finally, the bottom row shows a diagram
of the Linux kernel with the -rt patchset applied, maximizing real-time capabilities.
Functionality from the -rt patchset is added to mainline, hence the increasing capabilities
of the mainline Linux kernel over time. Nevertheless, the most demanding real-time
applications continue to use the -rt patchset.
The non-preemptible kernel shown at the top of Figure 15.7 is built with CONFIG_
PREEMPT=n, so that execution within the Linux kernel cannot be preempted. This
means that the kernel’s real-time response latency is bounded below by the longest
code path in the Linux kernel, which is indeed long. However, user-mode execution is
preemptible, so that one of the real-time Linux processes shown in the upper right may
preempt any of the non-real-time Linux processes shown in the upper left anytime the
non-real-time process is executing in user mode.
The preemptible kernels shown in the middle row of Figure 15.7 are built with
CONFIG_PREEMPT=y, so that most process-level code within the Linux kernel can be
preempted. This of course greatly improves real-time response latency, but preemption
is still disabled within RCU read-side critical sections, spinlock critical sections, inter-
rupt handlers, interrupt-disabled code regions, and preempt-disabled code regions, as
indicated by the red boxes in the left-most diagram in the middle row of the figure. The
advent of preemptible RCU allowed RCU read-side critical sections to be preempted,
as shown in the central diagram, and the advent of threaded interrupt handlers allowed
device-interrupt handlers to be preempted, as shown in the right-most diagram. Of
course, a great deal of other real-time functionality was added during this time, how-
ever, it cannot be as easily represented on this diagram. It will instead be discussed in
Section 15.4.1.1.
A final approach is simply to get everything out of the way of the real-time process,
clearing all other processing off of any CPUs that this process needs, as shown in
Figure 15.8. This was implemented in the 3.10 Linux kernel via the CONFIG_NO_HZ_
FULL Kconfig parameter [Wei12]. It is important to note that this approach requires at
least one housekeeping CPU to do background processing, for example running kernel
daemons. However, when there is only one runnable task on a given non-housekeeping
CPU, scheduling-clock interrupts are shut off on that CPU, removing an important
source of interference and OS jitter.4 With a few exceptions, the kernel does not force
Future work includes addressing these concerns and eliminating this residual interrupt.
420 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side
Linux critical sections
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
CONFIG_PREEMPT=n
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side RCU read-side RCU read-side
Linux critical sections Linux critical sections Linux critical sections
Kernel Spinlock
critical sections
Kernel Spinlock
critical sections
Kernel Spinlock
critical sections
Interrupt handlers Interrupt handlers Interrupt handlers
Scheduling Scheduling Scheduling
Clock Interrupt disable Clock Interrupt disable Clock Interrupt disable
Interrupt Preempt disable Interrupt Preempt disable Interrupt Preempt disable
RT Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side
Linux critical sections
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
-rt patchset
RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process
Linux Process
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
other processing off of the non-housekeeping CPUs, but instead simply provides better
performance when only one runnable task is present on a given CPU. If configured
properly, a non-trivial undertaking, CONFIG_NO_HZ_FULL offers real-time threads
levels of performance nearly rivaling that of bare-metal systems.
There has of course been much debate over which of these approaches is best for
real-time systems, and this debate has been going on for quite some time [Cor04a,
Cor04c]. As usual, the answer seems to be “It depends,” as discussed in the following
sections. Section 15.4.1.1 considers event-driven real-time systems, and Section 15.4.1.2
considers real-time systems that use a CPU-bound polling loop.
Timers are clearly critically important for real-time operations. After all, if you
cannot specify that something be done at a specific time, how are you going to respond
by that time? Even in non-real-time systems, large numbers of timers are generated,
so they must be handled extremely efficiently. Example uses include retransmit timers
for TCP connections (which are almost always cancelled before they have a chance to
fire),5 timed delays (as in sleep(1), which are rarely cancelled), and timeouts for the
poll() system call (which are often cancelled before they have a chance to fire). A
good data structure for such timers would therefore be a priority queue whose addition
and deletion primitives were fast and O(1) in the number of timers posted.
The classic data structure for this purpose is the calendar queue, which in the Linux
kernel is called the timer wheel. This age-old data structure is also heavily used in
discrete-event simulation. The idea is that time is quantized, for example, in the Linux
kernel, the duration of the time quantum is the period of the scheduling-clock interrupt.
5 At least assuming reasonably low packet-loss rates!
422 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
0x x0
1x x1
2x x2
3x x3
4x x4
5x x5
6x x6
7x x7
8x x8
9x x9
ax xa
bx xb
cx xc
dx xd
ex xe
fx xf
1f
A given time can be represented by an integer, and any attempt to post a timer at some
non-integral time will be rounded to a convenient nearby integral time quantum.
One straightforward implementation would be to allocate a single array, indexed
by the low-order bits of the time. This works in theory, but in practice systems create
large numbers of long-duration timeouts (for example, 45-minute keepalive timeouts for
TCP sessions) that are almost always cancelled. These long-duration timeouts cause
problems for small arrays because much time is wasted skipping timeouts that have not
yet expired. On the other hand, an array that is large enough to gracefully accommodate
a large number of long-duration timeouts would consume too much memory, especially
given that performance and scalability concerns require one such array for each and
every CPU.
A common approach for resolving this conflict is to provide multiple arrays in a
hierarchy. At the lowest level of this hierarchy, each array element represents one unit
of time. At the second level, each array element represents N units of time, where N is
the number of elements in each array. At the third level, each array element represents
N 2 units of time, and so on up the hierarchy. This approach allows the individual arrays
to be indexed by different bits, as illustrated by Figure 15.9 for an unrealistically small
eight-bit clock. Here, each array has 16 elements, so the low-order four bits of the time
(currently 0xf) index the low-order (rightmost) array, and the next four bits (currently
0x1) index the next level up. Thus, we have two arrays each with 16 elements, for a
total of 32 elements, which, taken together, is much smaller than the 256-element array
that would be required for a single array.
This approach works extremely well for throughput-based systems. Each timer
operation is O(1) with small constant, and each timer element is touched at most m + 1
times, where m is the number of levels.
Unfortunately, timer wheels do not work well for real-time systems, and for two
15.4. IMPLEMENTING PARALLEL REAL-TIME SYSTEMS 423
reasons. The first reason is that there is a harsh tradeoff between timer accuracy and timer
overhead, which is fancifully illustrated by Figures 15.10 and 15.11. In Figure 15.10,
timer processing happens only once per millisecond, which keeps overhead acceptably
low for many (but not all!) workloads, but which also means that timeouts cannot
be set for finer than one-millisecond granularities. On the other hand, Figure 15.11
shows timer processing taking place every ten microseconds, which provides acceptably
fine timer granularity for most (but not all!) workloads, but which processes timers so
frequently that the system might well not have time to do anything else.
The second reason is the need to cascade timers from higher levels to lower levels.
Referring back to Figure 15.9, we can see that any timers enqueued on element 1x
in the upper (leftmost) array must be cascaded down to the lower (rightmost) array
so that may be invoked when their time arrives. Unfortunately, there could be a large
number of timeouts waiting to be cascaded, especially for timer wheels with larger
424 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
Return From
Interrupt Interrupt Mainline
Mainline
Interrupt Handler
Code Code
Long Latency:
Degrades Response Time
Return From
Interrupt
Interrupt Interrupt Mainline
Mainline
Code Code Preemptible
IRQ Thread
Interrupt Handler
Short Latency:
Improved Response Time
numbers of levels. The power of statistics causes this cascading to be a non-problem for
throughput-oriented systems, but cascading can result in problematic degradations of
latency in real-time systems.
Of course, real-time systems could simply choose a different data structure, for
example, some form of heap or tree, giving up O(1) bounds on insertion and deletion
operations to gain O(log n) limits on data-structure-maintenance operations. This can
be a good choice for special-purpose RTOSes, but is inefficient for general-purpose
systems such as Linux, which routinely support extremely large numbers of timers.
The solution chosen for the Linux kernel’s -rt patchset is to differentiate between
timers that schedule later activity and timeouts that schedule error handling for low-
probability errors such as TCP packet losses. One key observation is that error handling
is normally not particularly time-critical, so that a timer wheel’s millisecond-level
granularity is good and sufficient. Another key observation is that error-handling
timeouts are normally cancelled very early, often before they can be cascaded. A
final observation is that systems commonly have many more error-handling timeouts
than they do timer events, so that an O(log n) data structure should provide acceptable
performance for timer events.
In short, the Linux kernel’s -rt patchset uses timer wheels for error-handling timeouts
and a tree for timer events, providing each category the required quality of service.
runs at a configurable priority. The device interrupt handler then runs for only a short
time, just long enough to make the IRQ thread aware of the new event. As shown in
the figure, threaded interrupts can greatly improve real-time latencies, in part because
interrupt handlers running in the context of the IRQ thread may be preempted by
high-priority real-time threads.
However, there is no such thing as a free lunch, and there are downsides to threaded
interrupts. One downside is increased interrupt latency. Instead of immediately running
the interrupt handler, the handler’s execution is deferred until the IRQ thread gets around
to running it. Of course, this is not a problem unless the device generating the interrupt
is on the real-time application’s critical path.
Another downside is that poorly written high-priority real-time code might starve
the interrupt handler, for example, preventing networking code from running, in turn
making it very difficult to debug the problem. Developers must therefore take great
care when writing high-priority real-time code. This has been dubbed the Spiderman
principle: With great power comes great responsibility.
Priority inheritance is used to handle priority inversion, which can be caused by,
among other things, locks acquired by preemptible interrupt handlers [SRL90b]. Sup-
pose that a low-priority thread holds a lock, but is preempted by a group of medium-
priority threads, at least one such thread per CPU. If an interrupt occurs, a high-priority
IRQ thread will preempt one of the medium-priority threads, but only until it decides to
acquire the lock held by the low-priority thread. Unfortunately, the low-priority thread
cannot release the lock until it starts running, which the medium-priority threads prevent
it from doing. So the high-priority IRQ thread cannot acquire the lock until after one of
the medium-priority threads releases its CPU. In short, the medium-priority threads are
indirectly blocking the high-priority IRQ threads, a classic case of priority inversion.
Note that this priority inversion could not happen with non-threaded interrupts
because the low-priority thread would have to disable interrupts while holding the lock,
which would prevent the medium-priority threads from preempting it.
In the priority-inheritance solution, the high-priority thread attempting to acquire
the lock donate its priority to the low-priority thread holding the lock until such time as
the lock is released, thus preventing long-term priority inversion.
Of course, priority inheritance does have its limitations. For example, if you can
design your application to avoid priority inversion entirely, you will likely obtain
426 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
somewhat better latencies [Yod04b]. This should be no surprise, given that priority
inheritance adds a pair of context switches to the worst-case latency. That said, priority
inheritance can convert indefinite postponement into a limited increase in latency, and
the software-engineering benefits of priority inheritance may outweigh its latency costs
in many applications.
Another limitation is that it addresses only lock-based priority inversions within
the context of a given operating system. One priority-inversion scenario that it cannot
address is a high-priority thread waiting on a network socket for a message that is to
be written by a low-priority process that is preempted by a set of CPU-bound medium-
priority processes. In addition, a potential disadvantage of applying priority inheritance
to user input is fancifully depicted in Figure 15.14.
A final limitation involves reader-writer locking. Suppose that we have a very large
number of low-priority threads, perhaps even thousands of them, each of which read-
holds a particular reader-writer lock. Suppose that all of these threads are preempted
by a set of medium-priority threads, with at least one medium-priority thread per CPU.
Finally, suppose that a high-priority thread awakens and attempts to write-acquire this
same reader-writer lock. No matter how vigorously we boost the priority of the threads
read-holding this lock, it could well be a good long time before the high-priority thread
can complete its write-acquisition.
There are a number of possible solutions to this reader-writer lock priority-inversion
conundrum:
Quick Quiz 15.5: But if you only allow one reader at a time to read-acquire a
reader-writer lock, isn’t that the same as an exclusive lock???
In some cases, reader-writer lock priority inversion can be avoided by converting
the reader-writer lock to RCU, as briefly discussed in the next section.
1 void __rcu_read_lock(void)
2 {
3 current->rcu_read_lock_nesting++;
4 barrier();
5 }
6
7 void __rcu_read_unlock(void)
8 {
9 struct task_struct *t = current;
10
11 if (t->rcu_read_lock_nesting != 1) {
12 --t->rcu_read_lock_nesting;
13 } else {
14 barrier();
15 t->rcu_read_lock_nesting = INT_MIN;
16 barrier();
17 if (ACCESS_ONCE(t->rcu_read_unlock_special.s))
18 rcu_read_unlock_special(t);
19 barrier();
20 t->rcu_read_lock_nesting = 0;
21 }
22 }
read-side critical sections. A grace period is permitted to end: (1) Once all CPUs have
completed any RCU read-side critical sections that were in effect before the start of the
current grace period and (2) Once all tasks that were preempted while in one of those
pre-existing critical sections have removed themselves from their lists. A simplified
version of this implementation is shown in Figure 15.15. The __rcu_read_lock()
function spans lines 1-5 and the __rcu_read_unlock() function spans lines 7-22.
Line 3 of __rcu_read_lock() increments a per-task count of the number of
nested rcu_read_lock() calls, and line 4 prevents the compiler from reordering
the subsequent code in the RCU read-side critical section to precede the rcu_read_
lock().
Line 11 of __rcu_read_unlock() checks to see if the nesting level count is
one, in other words, if this corresponds to the outermost rcu_read_unlock() of
a nested set. If not, line 12 decrements this count, and control returns to the caller.
Otherwise, this is the outermost rcu_read_unlock(), which requires the end-of-
critical-section handling carried out by lines 14-20.
Line 14 prevents the compiler from reordering the code in the critical section with
the code comprising the rcu_read_unlock(). Line 15 sets the nesting count to
a large negative number in order to prevent destructive races with RCU read-side
critical sections contained within interrupt handlers [McK11a], and line 16 prevents
the compiler from reordering this assignment with line 17’s check for special handling.
If line 17 determines that special handling is required, line 18 invokes rcu_read_
unlock_special() to carry out that special handling.
There are several types of special handling that can be required, but we will focus on
that required when the RCU read-side critical section has been preempted. In this case,
the task must remove itself from the list that it was added to when it was first preempted
within its RCU read-side critical section. However, it is important to note that these
lists are protected by locks, which means that rcu_read_unlock() is no longer
lockless. However, the highest-priority threads will not be preempted, and therefore, for
those highest-priority threads, rcu_read_unlock() will never attempt to acquire
any locks. In addition, if implemented carefully, locking can be used to synchronize
real-time software [Bra11].
428 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
Whether or not special handling is required, line 19 prevents the compiler from
reordering the check on line 17 with the zeroing of the nesting count on line 20.
Quick Quiz 15.6: Suppose that preemption occurs just after the load from t->
rcu_read_unlock_special.s on line 17 of Figure 15.15. Mightn’t that result in
the task failing to invoke rcu_read_unlock_special(), thus failing to remove
itself from the list of tasks blocking the current grace period, in turn causing that grace
period to extend indefinitely?
This preemptible RCU implementation enables real-time response for read-mostly
data structures without the delays inherent to priority boosting of large numbers of
readers.
Preemptible spinlocks are an important part of the -rt patchset due to the long-
duration spinlock-based critical sections in the Linux kernel. This functionality has
not yet reached mainline: Although they are a conceptually simple substitution of
sleeplocks for spinlocks, they have proven relatively controversial.6 However, they
are quite necessary to the task of achieving real-time latencies down in the tens of
microseconds.
There are of course any number of other Linux-kernel components that are critically
important to achieving world-class real-time latencies, most recently deadline schedul-
ing, however, those listed in this section give a good feeling for the workings of the
Linux kernel augmented by the -rt patchset.
6 In addition, development of the -rt patchset has slowed in recent years, perhaps because the real-time
functionality that is already in the mainline Linux kernel suffices for a great many use cases [Edg13, Edg14].
However, OSADL (http://osadl.org/) is working to raise funds to move the remaining code from the
-rt patchset to mainline.
15.4. IMPLEMENTING PARALLEL REAL-TIME SYSTEMS 429
file in the Linux source tree describes how to direct device interrupts to specified CPU,
which as of early 2015 involves something like the following:
echo 0f > /proc/irq/44/smp_affinity
This command would confine interrupt #44 to CPUs 0-3. Note that scheduling-clock
interrupts require special handling, and are discussed later in this section.
A second source of OS jitter is due to kernel threads and daemons. Individual
kernel threads, such as RCU’s grace-period kthreads (rcu_bh, rcu_preempt, and
rcu_sched), may be forced onto any desired CPUs using the taskset command,
the sched_setaffinity() system call, or cgroups.
Per-CPU kthreads are often more challenging, sometimes constraining hardware
configuration and workload layout. Preventing OS jitter from these kthreads requires
either that certain types of hardware not be attached to real-time systems, that all
interrupts and I/O initiation take place on housekeeping CPUs, that special kernel
Kconfig or boot parameters be selected in order to direct work away from the worker
CPUs, or that worker CPUs never enter the kernel. Specific per-kthread advice may
be found in the Linux kernel source Documentation directory at kernel-per-
CPU-kthreads.txt.
A third source of OS jitter in the Linux kernel for CPU-bound threads running
at real-time priority is the scheduler itself. This is an intentional debugging feature,
designed to ensure that important non-realtime work is allotted at least 50 milliseconds
out of each second, even if there is an infinite-loop bug in your real-time application.
However, when you are running a polling-loop-style real-time application, you will
need to disable this debugging feature. This can be done as follows:
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
You will of course need to be running as root to execute this command, and you
will also need to carefully consider the Spiderman principle. One way to minimize the
risks is to offload interrupts and kernel threads/daemons from all CPUs running CPU-
bound real-time threads, as described in the paragraphs above. In addition, you should
carefully read the material in the Documentation/scheduler directory. The
material in the sched-rt-group.txt file is particularly important, especially if
you are using the cgroups real-time features enabled by the CONFIG_RT_GROUP_
SCHED Kconfig parameter, in which case you should also read the material in the
Documentation/cgroups directory.
A fourth source of OS jitter comes from timers. In most cases, keeping a given CPU
out of the kernel will prevent timers from being scheduled on that CPU. One important
execption are recurring timers, where a given timer handler posts a later occurrence of
that same timer. If such a timer gets started on a given CPU for any reason, that timer
will continue to run periodically on that CPU, inflicting OS jitter indefinitely. One crude
but effective way to offload recurring timers is to use CPU hotplug to offline all worker
CPUs that are to run CPU-bound real-time application threads, online these same CPUs,
then start your real-time application.
A fifth source of OS jitter is provided by device drivers that were not intended for
real-time use. For an old canonical example, in 2005, the VGA driver would blank the
screen by zeroing the frame buffer with interrupts disabled, which resulted in tens of
milliseconds of OS jitter. One way of avoiding device-driver-induced OS jitter is to
carefully select devices that have been used heavily in real-time systems, and which
have therefore had their real-time bugs fixed. Another way is to confine the devices
interrupts and all code using that device to designated housekeeping CPUs. A third way
430 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
1 cd /sys/kernel/debug/tracing
2 echo 1 > max_graph_depth
3 echo function_graph > current_tracer
4 # run workload
5 cat per_cpu/cpuN/trace
is to test the device’s ability to support real-time workloads and fix any real-time bugs.7
A sixth source of OS jitter is provided by some in-kernel full-system synchronization
algorithms, perhaps most notably the global TLB-flush algorithm. This can be avoided
by avoiding memory-unmapping operations, and especially avoiding unmapping op-
erations within the kernel. As of early 2015, the way to avoid in-kernel unmapping
operations is to avoid unloading kernel modules.
A seventh source of OS jitter is provided by scheduling-clock interrrupts and RCU
callback invocation. These may be avoided by building your kernel with the NO_HZ_
FULL Kconfig parameter enabled, and then booting with the nohz_full= parameter
specifying the list of worker CPUs that are to run real-time threads. For example,
nohz_full=2-7 would designate CPUs 2, 3, 4, 5, 6, and 7 as worker CPUs, thus
leaving CPUs 0 and 1 as housekeeping CPUs. The worker CPUs would not incur
scheduling-clock interrupts as long as there is no more than one runnable task on each
worker CPU, and each worker CPU’s RCU callbacks would be invoked on one of the
housekeeping CPUs. A CPU that has suppressed scheduling-clock interrupts due to
there only being one runnable task on that CPU is said to be in adaptive ticks mode.
As an alternative to the nohz_full= boot parameter, you can build your kernel
with NO_HZ_FULL_ALL, which will designate CPU 0 as a housekeeping CPU and
all other CPUs as worker CPUs. Either way, it is important to ensure that you have
designated enough housekeeping CPUs to handle the housekeeping load imposed by
the rest of the system, which requires careful benchmarking and tuning.
Of course, there is no free lunch, and NO_HZ_FULL is no exception. As noted
earlier, NO_HZ_FULL makes kernel/user transitions more expensive due to the need for
delta process accounting and the need to inform kernel subsystems (such as RCU) of the
transitions. It also prevents CPUs running processes with POSIX CPU timers enabled
from entering adaptive-ticks mode. Additional limitations, tradeoffs, and configuration
advice may be found in Documentation/timers/NO_HZ.txt.
An eighth source of OS jitter is page faults. Because most Linux implementations
use an MMU for memory protection, real-time applications running on these systems
can be subject to page faults. Use the mlock() and mlockall() system calls to pin
your application’s pages into memory, thus avoiding major page faults. Of course, the
Spiderman principle applies, because locking down too much memory may prevent the
system from getting other work done.
A ninth source of OS jitter is unfortunately the hardware and firmware. It is therefore
important to use systems that have been designed for real-time use. OSADL runs long-
term tests of systems, so referring to their website (http://osadl.org/) can be
helpful.
Unfortunately, this list of OS-jitter sources can never be complete, as it will change
with each new version of the kernel. This makes it necessary to be able to track down
additional sources of OS jitter. Given a CPU N running a CPU-bound usermode thread,
7 If you take this approach, please submit your fixes upstream so that others can benefit. Keep in mind
that when you need to port your application to a later version of the Linux kernel, you will be one of those
“others”.
15.4. IMPLEMENTING PARALLEL REAL-TIME SYSTEMS 431
the commands shown in Figure 15.16 will produce a list of all the times that this CPU
entered the kernel. Of course, the N on line 5 must be replaced with the number of the
CPU in question, and the 1 on line 2 may be increased to show additional levels of
function call within the kernel. The resulting trace can help track down the source of
the OS jitter.
As you can see, obtaining bare-metal performance when running CPU-bound real-
time threads on a general-purpose OS such as Linux requires painstaking attention
to detail. Automation would of course help, and some automation has been applied,
but given the relatively small number of users, automation can be expected to appear
relatively slowly. Nevertheless, the ability to gain near-bare-metal performance while
running a general-purpose operating system promises to ease construction of some types
of real-time systems.
only guarantee that at least one thread will make progress in finite time.
9 This paper also introduces the notion of bounded minimal progress, which is a welcome step on the
3. No fail-stop bugs.
4. FIFO locking primitives with bounded acquisition, handoff, and release latencies.
Again, in the common case of a locking primitive that is FIFO within priorities,
the bounded latencies are provided only to the highest-priority threads.
8. Bounded time spent in any given critical section. Given a bounded number of
threads waiting on any given lock and a bounded critical-section duration, the
wait time will be bounded.
Quick Quiz 15.8: I couldn’t help but spot the word “includes” before this list. Are
there other constraints?
This result opens a vast cornucopia of algorithms and data structures for use in
real-time software—and validates long-standing real-time practice.
Of course, a careful and simple application design is also extremely important. The
best real-time components in the world cannot make up for a poorly thought-out design.
For parallel real-time applications, synchronization overheads clearly must be a key
component of the design.
1 if (clock_gettime(CLOCK_REALTIME, ×tart) != 0) {
2 perror("clock_gettime 1");
3 exit(-1);
4 }
5 if (nanosleep(&timewait, NULL) != 0) {
6 perror("nanosleep");
7 exit(-1);
8 }
9 if (clock_gettime(CLOCK_REALTIME, &timeend) != 0) {
10 perror("clock_gettime 2");
11 exit(-1);
12 }
One way of gaining much of the benefit of running on bare metal while still having
access to the full features and functions of a general-purpose operating system is to
use the Linux kernel’s NO_HZ_FULL capability, described in Section 15.4.1.2. This
support first became available in version 3.10 of the Linux kernel.
functionality, perhaps using the test program shown in Figure 15.17. Unfortunately, if
we run this program, we can get unacceptable timer jitter, even in a -rt kernel.
One problem is that POSIX CLOCK_REALTIME is, oddly enough, not intended
for real-time use. Instead, it means “realtime” as opposed to the amount of CPU time
consumed by a process or thread. For real-time use, you should instead use CLOCK_
MONOTONIC. However, even with this change, results are still unacceptable.
Another problem is that the thread must be raised to a real-time priority by using the
sched_setscheduler() system call. But even this change is insufficient, because
we can still see page faults. We also need to use the mlockall() system call to pin
the application’s memory, preventing page faults. With all of these changes, results
might finally be acceptable.
In other situations, further adjustments might be needed. It might be necessary to
affinity time-critical threads onto their own CPUs, and it might also be necessary to
affinity interrupts away from those CPUs. It might be necessary to carefully select
hardware and drivers, and it will very likely be necessary to carefully select kernel
configuration.
As can be seen from this example, real-time computing can be quite unforgiving.
1 struct calibration {
2 short a;
3 short b;
4 short c;
5 };
6 struct calibration default_cal = { 62, 33, 88 };
7 struct calibration cur_cal = &default_cal;
8
9 short calc_control(short t, short h, short press)
10 {
11 struct calibration *p;
12
13 p = rcu_dereference(cur_cal);
14 return do_control(t, h, press, p->a, p->b, p->c);
15 }
16
17 bool update_cal(short a, short b, short c)
18 {
19 struct calibration *p;
20 struct calibration *old_p;
21
22 old_p = rcu_dereference(cur_cal);
23 p = malloc(sizeof(*p);
24 if (!p)
25 return false;
26 p->a = a;
27 p->b = b;
28 p->c = c;
29 rcu_assign_pointer(cur_cal, p);
30 if (old_p == &default_cal)
31 return true;
32 synchronize_rcu();
33 free(p);
34 return true;
35 }
If the answer to any of these questions is “yes”, you should choose real-fast over
real-time, otherwise, real-time might be for you.
Choose wisely, and if you do choose real-time, make sure that your hardware,
firmware, and operating system are up to the job!
436 CHAPTER 15. PARALLEL REAL-TIME COMPUTING
Ease of Use
437
438 CHAPTER 16. EASE OF USE
close.
2. The compiler or linker won’t let you get it wrong.
3. The compiler or linker will warn you if you get it wrong.
4. The simplest use is the correct one.
5. The name tells you how to use it.
6. Do it right or it will always break at runtime.
7. Follow common convention and you will get it right. The malloc() library
function is a good example. Although it is easy to get memory allocation wrong,
a great many projects do manage to get it right, at least most of the time. Using
malloc() in conjunction with Valgrind [The11] moves malloc() almost up
to the “do it right or it will always break at runtime” point on the scale.
8. Read the documentation and you will get it right.
9. Read the implementation and you will get it right.
10. Read the right mailing-list archive and you will get it right.
11. Read the right mailing-list archive and you will get it wrong.
12. Read the implementation and you will get it wrong. The original non-CONFIG_
PREEMPT implementation of rcu_read_lock() [McK07a] is an infamous
example of this point on the scale.
13. Read the documentation and you will get it wrong. For example, the DEC Alpha
wmb instruction’s documentation [SW95] fooled a number of developers into
thinking that that this instruction had much stronger memory-order semantics
than it actually does. Later documentation clarified this point [Com01], moving
the wmb instruction up to the “read the documentation and you will get it right”
point on the scale.
14. Follow common convention and you will get it wrong. The printf() statement
is an example of this point on the scale because developers almost always fail to
check printf()’s error return.
15. Do it right and it will break at runtime.
16. The name tells you how not to use it.
17. The obvious use is wrong. The Linux kernel smp_mb() function is an exam-
ple of this point on the scale. Many developers assume that this function has
much stronger ordering semantics than it possesses. Section 14.2 contains the
information needed to avoid this mistake, as does the Linux-kernel source tree’s
Documentation directory.
18. The compiler or linker will warn you if you get it right.
19. The compiler or linker won’t let you get it right.
20. It is impossible to get right. The gets() function is a famous example of
this point on the scale. In fact, gets() can perhaps best be described as an
unconditional buffer-overflow security hole.
16.3. SHAVING THE MANDELBROT SET 439
Such shaving may seem counterproductive. After all, if an algorithm works, why
shouldn’t it be used?
To see why at least some shaving is absolutely necessary, consider a locking design
that avoids deadlock, but in perhaps the worst possible way. This design uses a circular
doubly linked list, which contains one element for each thread in the system along with
a header element. When a new thread is spawned, the parent thread must insert a new
element into this list, which requires some sort of synchronization.
One way to protect the list is to use a global lock. However, this might be a bottleneck
if threads were being created and deleted frequently.3 Another approach would be to
use a hash table and to lock the individual hash buckets, but this can perform poorly
when scanning the list in order.
A third approach is to lock the individual list elements, and to require the locks for
both the predecessor and successor to be held during the insertion. Since both locks
must be acquired, we need to decide which order to acquire them in. Two conventional
approaches would be to acquire the locks in address order, or to acquire them in the
order that they appear in the list, so that the header is always acquired first when it is
one of the two elements being locked. However, both of these methods require special
checks and branches.
The to-be-shaven solution is to unconditionally acquire the locks in list order. But
what about deadlock?
Deadlock cannot occur.
2 Due to Josh Triplett.
3 Those of you with strong operating-system backgrounds, please suspend disbelief. If you are unable to
suspend disbelief, send us a better example.
440 CHAPTER 16. EASE OF USE
To see this, number the elements in the list starting with zero for the header up to
N for the last element in the list (the one preceding the header, given that the list is
circular). Similarly, number the threads from zero to N − 1. If each thread attempts to
lock some consecutive pair of elements, at least one of the threads is guaranteed to be
able to acquire both locks.
Why?
Because there are not enough threads to reach all the way around the list. Suppose
thread 0 acquires element 0’s lock. To be blocked, some other thread must have already
acquired element 1’s lock, so let us assume that thread 1 has done so. Similarly, for
thread 1 to be blocked, some other thread must have acquired element 2’s lock, and so
on, up through thread N − 1, who acquires element N − 1’s lock. For thread N − 1 to be
blocked, some other thread must have acquired element N’s lock. But there are no more
threads, and so thread N − 1 cannot be blocked. Therefore, deadlock cannot occur.
So why should we prohibit use of this delightful little algorithm?
The fact is that if you really want to use it, we cannot stop you. We can, however,
recommend against such code being included in any project that we care about.
But, before you use this algorithm, please think through the following Quick Quiz.
Quick Quiz 16.1: Can a similar algorithm be used when deleting elements?
The fact is that this algorithm is extremely specialized (it only works on certain
sized lists), and also quite fragile. Any bug that accidentally failed to add a node to the
list could result in deadlock. In fact, simply adding the node a bit too late could result in
deadlock.
In addition, the other algorithms described above are “good and sufficient”. For
example, simply acquiring the locks in address order is fairly simple and quick, while
allowing the use of lists of any size. Just be careful of the special cases presented by
empty lists and lists containing only one element!
Quick Quiz 16.2: Yetch! What ever possessed someone to come up with an
algorithm that deserves to be shaved as much as this one does???
In summary, we do not use algorithms simply because they happen to work. We
instead restrict ourselves to algorithms that are useful enough to make it worthwhile
learning about them. The more difficult and complex the algorithm, the more generally
useful it must be in order for the pain of learning it and fixing its bugs to be worthwhile.
Quick Quiz 16.3: Give an exception to this rule.
Exceptions aside, we must continue to shave the software “Mandelbrot set” so that
our programs remain maintainable, as shown in Figure 16.2.
Niels Bohr
Chapter 17
This chapter presents some conflicting visions of the future of parallel programming.
It is not clear which of these will come to pass, in fact, it is not clear that any of them
will. They are nevertheless important because each vision has its devoted adherents, and
if enough people believe in something fervently enough, you will need to deal with at
least the shadow of that thing’s existence in the form of its influence on the thoughts,
words, and deeds of its adherents. Besides which, it is entirely possible that one or more
of these visions will actually come to pass. But most are bogus. Tell which is which and
you’ll be rich [Spi77]!
Therefore, the following sections give an overview of transactional memory, hard-
ware transactional memory, and parallel functional programming. But first, a cautionary
tale on prognostication taken from the early 2000s.
441
442 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Unlikely indeed! But the larger software community was reluctant to accept the
fact that they would need to embrace parallelism, and so it was some time before this
community concluded that the “free lunch” of Moore’s-Law-induced CPU core-clock
frequency increases was well and truly finished. Never forget: belief is an emotion, not
necessarily the result of a rational technical thought process!
And we all know how this story has played out, with multiple multi-threaded cores
on a single die plugged into a single socket. The question then becomes whether or not
future shared-memory systems will always fit into a single socket.
This scenario actually represents a change, since to have more of the same,
interconnect performance must begin keeping up with the Moore’s-Law
increases in core CPU performance. In this scenario, overhead due to
pipeline stalls, memory latency, and contention remains significant, and
RCU retains the high level of applicability that it enjoys today.
And the change has been the ever-increasing levels of integration that Moore’s Law
is still providing. But longer term, which will it be? More CPUs per die? Or more I/O,
cache, and memory?
Servers seem to be choosing the former, while embedded systems on a chip (SoCs)
continue choosing the latter.
10000
1000
100
10
0.1
82 84 86 88 90 92 94 96 98 00 02
Year
Figure 17.5: Instructions per Local Memory Reference for Sequent Computers
contrast, systems with minor use of RCU will require increasingly high
degrees of read intensity for use of RCU to pay off, as shown in Figure 17.7.
As can be seen in this figure, if RCU is lightly used, increasing memory-
latency ratios put RCU at an increasing disadvantage compared to other
synchronization mechanisms. Since Linux has been observed with over
1,600 callbacks per grace period under heavy load [SM04], it seems safe to
say that Linux falls into the former category.
On the one hand, this passage failed to anticipate the cache-warmth issues that
RCU can suffer from in workloads with significant update intensity, in part because
it seemed unlikely that RCU would really be used for such workloads. In the event,
the SLAB_DESTROY_BY_RCU has been pressed into service in a number of instances
446 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
RCU
0.1
1 10 100 1000
Memory-Latency Ratio
1
Breakeven Update Fraction
spinlock
0.1
drw
0.01
0.001 RCU
0.0001
1 10 100 1000
Memory-Latency Ratio
Many computer users feel that input and output are not actually part of
“real programming,” they are merely things that (unfortunately) must be
done in order to get information in and out of the machine.
448 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Whether we believe that input and output are “real programming” or not, the fact
is that for most computer systems, interaction with the outside world is a first-class
requirement. This section therefore critiques transactional memory’s ability to so
interact, whether via I/O operations, time delays, or persistent storage.
1. Restrict I/O within transactions to buffered I/O with in-memory buffers. These
buffers may then be included in the transaction in the same way that any other
memory location might be included. This seems to be the mechanism of choice,
and it does work well in many common cases of situations such as stream I/O and
mass-storage I/O. However, special handling is required in cases where multiple
record-oriented output streams are merged onto a single file from multiple pro-
cesses, as might be done using the “a+” option to fopen() or the O_APPEND
flag to open(). In addition, as will be seen in the next section, common net-
working operations cannot be handled via buffering.
2. Prohibit I/O within transactions, so that any attempt to execute an I/O operation
aborts the enclosing transaction (and perhaps multiple nested transactions). This
approach seems to be the conventional TM approach for unbuffered I/O, but re-
quires that TM interoperate with other synchronization primitives that do tolerate
I/O.
3. Prohibit I/O within transactions, but enlist the compiler’s aid in enforcing this
prohibition.
4. Permit only one special irrevocable transaction [SMS08] to proceed at any given
time, thus allowing irrevocable transactions to contain I/O operations.1 This works
in general, but severely limits the scalability and performance of I/O operations.
Given that scalability and performance is a first-class goal of parallelism, this
approach’s generality seems a bit self-limiting. Worse yet, use of irrevocability
to tolerate I/O operations seems to prohibit use of manual transaction-abort
operations.2 Finally, if there is an irrevocable transaction manipulating a given
data item, any other transaction manipulating that same data item cannot have
non-blocking semantics.
5. Create new hardware and protocols such that I/O operations can be pulled into
the transactional substrate. In the case of input operations, the hardware would
need to correctly predict the result of the operation, and to abort the transaction if
the prediction failed.
I/O operations are a well-known weakness of TM, and it is not clear that the
problem of supporting I/O in transactions has a reasonable general solution, at least if
“reasonable” is to include usable performance and scalability. Nevertheless, continued
time and attention to this problem will likely produce additional progress.
The transaction’s memory footprint cannot be determined until after the RPC re-
sponse is received, and until the transaction’s memory footprint can be determined, it is
impossible to determine whether the transaction can be allowed to commit. The only
action consistent with transactional semantics is therefore to unconditionally abort the
transaction, which is, to say the least, unhelpful.
Here are some options available to TM:
1. Prohibit RPC within transactions, so that any attempt to execute an RPC opera-
tion aborts the enclosing transaction (and perhaps multiple nested transactions).
Alternatively, enlist the compiler to enforce RPC-free transactions. This approach
does work, but will require TM to interact with other synchronization primitives.
2. Permit only one special irrevocable transaction [SMS08] to proceed at any given
time, thus allowing irrevocable transactions to contain RPC operations. This
works in general, but severely limits the scalability and performance of RPC oper-
ations. Given that scalability and performance is a first-class goal of parallelism,
this approach’s generality seems a bit self-limiting. Furthermore, use of irrevo-
cable transactions to permit RPC operations rules out manual transaction-abort
operations once the RPC operation has started. Finally, if there is an irrevocable
transaction manipulating a given data item, any other transaction manipulating
that same data item cannot have non-blocking semantics.
3. Identify special cases where the success of the transaction may be determined be-
fore the RPC response is received, and automatically convert these to irrevocable
450 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
17.2.1.4 Persistence
There are many different types of locking primitives. One interesting distinction is
persistence, in other words, whether the lock can exist independently of the address
space of the process using the lock.
Non-persistent locks include pthread_mutex_lock(), pthread_rwlock_
rdlock(), and most kernel-level locking primitives. If the memory locations instanti-
ating a non-persistent lock’s data structures disappear, so does the lock. For typical use
of pthread_mutex_lock(), this means that when the process exits, all of its locks
vanish. This property can be exploited in order to trivialize lock cleanup at program
shutdown time, but makes it more difficult for unrelated applications to share locks, as
such sharing requires the applications to share memory.
Persistent locks help avoid the need to share memory among unrelated applications.
Persistent locking APIs include the flock family, lockf(), System V semaphores, or
the O_CREAT flag to open(). These persistent APIs can be used to protect large-scale
operations spanning runs of multiple applications, and, in the case of O_CREAT even
surviving operating-system reboot. If need be, locks can even span multiple computer
systems via distributed lock managers and distributed filesystems—and persist across
reboots of any or all of these computer systems.
Persistent locks can be used by any application, including applications written using
multiple languages and software environments. In fact, a persistent lock might well be
acquired by an application written in C and released by an application written in Python.
How could a similar persistent functionality be provided for TM?
Of course, the fact that it is called transactional memory should give us pause, as
the name itself conflicts with the concept of a persistent transaction. It is nevertheless
worthwhile to consider this possibility as an important test case probing the inherent
limitations of transactional memory.
1 pthread_mutex_lock(...);
2 for (i = 0; i < ncpus; i++)
3 pthread_create(&tid[i], ...);
4 for (i = 0; i < ncpus; i++)
5 pthread_join(tid[i], ...);
6 pthread_mutex_unlock(...);
4. Extend the transaction to cover the parent and all child threads. This approach
raises interesting questions about the nature of conflicting accesses, given that
the parent and children are presumably permitted to conflict with each other,
but not with other threads. It also raises interesting questions as to what should
happen if the parent thread does not wait for its children before committing the
transaction. Even more interesting, what happens if the parent conditionally
executes pthread_join() based on the values of variables participating in
the transaction? The answers to these questions are reasonably straightforward in
the case of locking. The answers for TM are left as an exercise for the reader.
The exec() system call is perhaps the strangest example of an obstacle to universal
TM applicability, as it is not completely clear what approach makes sense, and some
might argue that this is merely a reflection of the perils of interacting with execs in real
life. That said, the two options prohibiting exec() within transactions are perhaps the
most logical of the group.
Similar issues surround the exit() and kill() system calls.
1. Treat the dynamic linking and loading in a manner similar to a page fault, so that
the function is loaded and linked, possibly aborting the transaction in the process.
If the transaction is aborted, the retry will find the function already present, and
the transaction can thus be expected to proceed normally.
Options for part (b), the inability to detect TM-unfriendly operations in a not-yet-
loaded function, possibilities include the following:
1. Just execute the code: if there are any TM-unfriendly operations in the function,
simply abort the transaction. Unfortunately, this approach makes it impossible for
the compiler to determine whether a given group of transactions may be safely
composed. One way to permit composability regardless is irrevocable transactions,
however, current implementations permit only a single irrevocable transaction to
proceed at any given time, which can severely limit performance and scalability.
Irrevocable transactions also seem to rule out use of manual transaction-abort
operations. Finally, if there is an irrevocable transaction manipulating a given
data item, any other transaction manipulating that same data item cannot have
non-blocking semantics.
3. As above, disallow dynamic linking and loading of functions from within transac-
tions.
I/O operations are of course a known weakness of TM, and dynamic linking and
loading can be thought of as yet another special case of I/O. Nevertheless, the proponents
of TM must either solve this problem, or resign themselves to a world where TM is but
one tool of several in the parallel programmer’s toolbox. (To be fair, a number of TM
proponents have long since resigned themselves to a world containing more than just
TM.)
17.2. TRANSACTIONAL MEMORY 455
1. Memory remapping is illegal within a transaction, and will result in all enclosing
transactions being aborted. This does simplify things somewhat, but also requires
that TM interoperate with synchronization primitives that do tolerate remapping
from within their critical sections.
2. Memory remapping is illegal within a transaction, and the compiler is enlisted to
enforce this prohibition.
3. Memory mapping is legal within a transaction, but aborts all other transactions
having variables in the region mapped over.
4. Memory mapping is legal within a transaction, but the mapping operation will
fail if the region being mapped overlaps with the current transaction’s footprint.
5. All memory-mapping operations, whether within or outside a transaction, check
the region being mapped against the memory footprint of all transactions in the
system. If there is overlap, then the memory-mapping operation fails.
17.2.2.5 Debugging
The usual debugging operations such as breakpoints work normally within lock-based
critical sections and from RCU read-side critical sections. However, in initial transactional-
memory hardware implementations [DLMN09] an exception within a transaction will
abort that transaction, which in turn means that breakpoints abort all enclosing transac-
tions.
So how can transactions be debugged?
4 This difference between mapping and unmapping was noted by Josh Triplett.
456 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
3. Use only software TM implementations, which are (very roughly speaking) more
tolerant of exceptions than are the simpler of the hardware TM implementations.
Of course, software TM tends to have higher overhead than hardware TM, so this
approach may not be acceptable in all situations.
4. Program more carefully, so as to avoid having bugs in the transactions in the first
place. As soon as you figure out how to do this, please do let everyone know the
secret!
There is some reason to believe that transactional memory will deliver productivity
improvements compared to other synchronization mechanisms, but it does seem quite
possible that these improvements could easily be lost if traditional debugging techniques
cannot be applied to transactions. This seems especially true if transactional memory is
to be used by novices on large transactions. In contrast, macho “top-gun” programmers
might be able to dispense with such debugging aids, especially for small transactions.
Therefore, if transactional memory is to deliver on its productivity promises to
novice programmers, the debugging problem does need to be solved.
17.2.3 Synchronization
If transactional memory someday proves that it can be everything to everyone, it will
not need to interact with any other synchronization mechanism. Until then, it will need
to work with synchronization mechanisms that can do what it cannot, or that work more
naturally in a given situation. The following sections outline the current challenges in
this area.
17.2.3.1 Locking
It is commonplace to acquire locks while holding other locks, which works quite well,
at least as long as the usual well-known software-engineering techniques are employed
to avoid deadlock. It is not unusual to acquire locks from within RCU read-side critical
sections, which eases deadlock concerns because RCU read-side primitives cannot
participate in lock-based deadlock cycles. But what happens when you attempt to
acquire a lock from within a transaction?
In theory, the answer is trivial: simply manipulate the data structure representing
the lock as part of the transaction, and everything works out perfectly. In practice, a
number of non-obvious complications [VGS08] can arise, depending on implementation
details of the TM system. These complications can be resolved, but at the cost of a 45%
increase in overhead for locks acquired outside of transactions and a 300% increase in
17.2. TRANSACTIONAL MEMORY 457
overhead for locks acquired within transactions. Although these overheads might be
acceptable for transactional programs containing small amounts of locking, they are
often completely unacceptable for production-quality lock-based programs wishing to
use the occasional transaction.
The fact that there could possibly be a problem interfacing TM and locking came as a
surprise to many, which underscores the need to try out new mechanisms and primitives
in real-world production software. Fortunately, the advent of open source means that a
huge quantity of such software is now freely available to everyone, including researchers.
high, (2) the memory overhead of per-CPU/thread locking can be prohibitive, and
(3) this transformation is available only when you have access to the source code
in question. Other more-recent scalable reader-writer locks [LLO09] might avoid
some or all of these problems.
2. Use TM only “in the small” when introducing TM to lock-based programs,
thereby avoiding read-acquiring reader-writer locks from within transactions.
3. Set aside locking-based legacy systems entirely, re-implementing everything
in terms of transactions. This approach has no shortage of advocates, but this
requires that all the issues described in this series be resolved. During the time
it takes to resolve these issues, competing synchronization mechanisms will of
course also have the opportunity to improve.
4. Use TM strictly as an optimization in lock-based systems, as was done by the
TxLinux [RHP+ 07] group. This approach seems sound, but leaves the locking
design constraints (such as the need to avoid deadlock) firmly in place. Further-
more, this approach can result in unnecessary transaction rollbacks when multiple
transactions attempt to read-acquire the same lock.
17.2.3.3 RCU
Because read-copy update (RCU) finds its main use in the Linux kernel, one might
be forgiven for assuming that there had been no academic work on combining RCU
and TM.5 However, the TxLinux group from the University of Texas at Austin had no
choice [RHP+ 07]. The fact that they applied TM to the Linux 2.6 kernel, which uses
RCU, forced them to integrate TM and RCU, with TM taking the place of locking for
RCU updates. Unfortunately, although the paper does state that the RCU implementa-
tion’s locks (e.g., rcu_ctrlblk.lock) were converted to transactions, it is silent
about what happened to locks used in RCU-based updates (e.g., dcache_lock).
It is important to note that RCU permits readers and updaters to run concurrently,
further permitting RCU readers to access data that is in the act of being updated. Of
course, this property of RCU, whatever its performance, scalability, and real-time-
response benefits might be, flies in the face of the underlying atomicity properties of
TM.
So how should TM-based updates interact with concurrent RCU readers? Some
possibilities are as follows:
1. RCU readers abort concurrent conflicting TM updates. This is in fact the approach
taken by the TxLinux project. This approach does preserve RCU semantics, and
also preserves RCU’s read-side performance, scalability, and real-time-response
properties, but it does have the unfortunate side-effect of unnecessarily aborting
conflicting updates. In the worst case, a long sequence of RCU readers could
potentially starve all updaters, which could in theory result in system hangs.
In addition, not all TM implementations offer the strong atomicity required to
implement this approach.
5 However, the in-kernel excuse is wearing thin with the advent of user-space RCU [Des09, DMS+ 12].
17.2. TRANSACTIONAL MEMORY 459
2. RCU readers that run concurrently with conflicting TM updates get old (pre-
transaction) values from any conflicting RCU loads. This preserves RCU se-
mantics and performance, and also prevents RCU-update starvation. However,
not all TM implementations can provide timely access to old values of vari-
ables that have been tentatively updated by an in-flight transaction. In particular,
log-based TM implementations that maintain old values in the log (thus mak-
ing for excellent TM commit performance) are not likely to be happy with this
approach. Perhaps the rcu_dereference() primitive can be leveraged to
permit RCU to access the old values within a greater range of TM implementa-
tions, though performance might still be an issue. Nevertheless, there are popular
TM implementations that can be easily and efficiently integrated with RCU in
this manner [PW07, HW11, HW13].
6. Prohibit use of TM in RCU updates. This is guaranteed to work, but seems a bit
restrictive.
It seems likely that additional approaches will be uncovered, especially given the
advent of user-level RCU implementations.6
6 Kudos to the TxLinux group, Maged Michael, and Josh Triplett for coming up with a number of the
above alternatives.
460 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
common example being statistical counters. The same thing is possible within RCU
read-side critical sections, and is in fact the common case.
Given mechanisms such as the so-called “dirty reads” that are prevalent in production
database systems, it is not surprising that extra-transactional accesses have received
serious attention from the proponents of TM, with the concepts of weak and strong
atomicity [BLM06] being but one case in point.
Here are some extra-transactional options available to TM:
4. Produce hardware extensions that permit some operations (for example, addition)
to be carried out concurrently on a single variable by multiple transactions.
17.2.4 Discussion
The obstacles to universal TM adoption lead to the following conclusions:
1. One interesting property of TM is the fact that transactions are subject to rollback
and retry. This property underlies TM’s difficulties with irreversible operations,
including unbuffered I/O, RPCs, memory-mapping operations, time delays, and
the exec() system call. This property also has the unfortunate consequence
of introducing all the complexities inherent in the possibility of failure into
synchronization primitives, often in a developer-visible manner.
primitives, including locking and RCU, maintain a clear separation between the
synchronization primitives and the data that they protect.
3. One of the stated goals of many workers in the TM area is to ease parallelization
of large sequential programs. As such, individual transactions are commonly
expected to execute serially, which might do much to explain TM’s issues with
multithreaded transactions.
earlier papers [MMW07, MMTW10], but focused on HTM rather than TM as a whole.7
Section 17.3.3 then describes HTM’s weaknesses with respect to the combination of
synchronization primitives used in the Linux kernel (and in some user-space applica-
tions). Section 17.3.4 looks at where HTM might best fit into the parallel programmer’s
toolbox, and Section 17.3.5 lists some events that might greatly increase HTM’s scope
and appeal. Finally, Section 17.3.6 presents concluding remarks.
7 And I gratefully acknowledge many stimulating discussions with the other authors, Maged Michael,
execution of the same instance of that synchronization primitive on some other CPU
will result in a cache miss. These communications cache misses severely degrade both
the performance and scalability of conventional synchronization mechanisms [ABD+ 97,
Section 4.2.3].
In contrast, HTM synchronizes by using the CPU’s cache, avoiding the need for a
synchronization data structure and resultant cache misses. HTM’s advantage is greatest
in cases where a lock data structure is placed in a separate cache line, in which case,
converting a given critical section to an HTM transaction can reduce that critical section’s
overhead by a full cache miss. These savings can be quite significant for the common
case of short critical sections, at least for those situations where the elided lock does not
share a cache line with an oft-written variable protected by that lock.
Quick Quiz 17.2: Why would it matter that oft-written variables shared the cache
line with the lock variable?
operations acquire the locks for all relevant chains in hash order.
17.3. HARDWARE TRANSACTIONAL MEMORY 465
proven preferable [Mil06], as will be discussed in Section 17.3.3. Given its avoidance
of synchronization cache misses, HTM is therefore a very real possibility for large
non-partitionable data structures, at least assuming relatively small updates.
Quick Quiz 17.3: Why are relatively small updates important to HTM performance
and scalability?
1. Lock elision for in-memory data access and update [MT01, RG02].
However, HTM also has some very real shortcomings, which will be discussed in
the next section.
1. Transaction-size limitations.
2. Conflict handling.
5. Irrevocable operations.
6. Semantic differences.
Of course, modern CPUs tend to have large caches, and the data required for many
transactions would fit easily in a one-megabyte cache. Unfortunately, with caches, sheer
size is not all that matters. The problem is that most caches can be thought of hash
tables implemented in hardware. However, hardware caches do not chain their buckets
(which are normally called sets), but rather provide a fixed number of cachelines per set.
The number of elements provided for each set in a given cache is termed that cache’s
associativity.
Although cache associativity varies, the eight-way associativity of the level-0 cache
on the laptop I am typing this on is not unusual. What this means is that if a given
transaction needed to touch nine cache lines, and if all nine cache lines mapped to
the same set, then that transaction cannot possibly complete, never mind how many
megabytes of additional space might be available in that cache. Yes, given randomly
selected data elements in a given data structure, the probability of that transaction being
able to commit is quite high, but there can be no guarantee.
There has been some research work to alleviate this limitation. Fully associative vic-
tim caches would alleviate the associativity constraints, but there are currently stringent
performance and energy-efficiency constraints on the sizes of victim caches. That said,
HTM victim caches for unmodified cache lines can be quite small, as they need to retain
only the address: The data itself can be written to memory or shadowed by other caches,
while the address itself is sufficient to detect a conflicting write [RD12].
Unbounded transactional memory (UTM) schemes [AAKL06, MBM+ 06] use
DRAM as an extremely large victim cache, but integrating such schemes into a
production-quality cache-coherence mechanism is still an unsolved problem. In addition,
use of DRAM as a victim cache may have unfortunate performance and energy-efficiency
consequences, particularly if the victim cache is to be fully associative. Finally, the
“unbounded” aspect of UTM assumes that all of DRAM could be used as a victim cache,
while in reality the large but still fixed amount of DRAM assigned to a given CPU would
limit the size of that CPU’s transactions. Other schemes use a combination of hardware
and software transactional memory [KCH+ 06] and one could imagine using STM as a
fallback mechanism for HTM.
However, to the best of my knowledge, currently available systems do not implement
any of these research ideas, and perhaps for good reason.
x = 1; y = 2;
y = 3; x = 4;
Suppose that each transaction executes concurrently on its own processor. If trans-
action A stores to x at the same time that transaction B stores to y, neither transaction
can progress. To see this, suppose that transaction A executes its store to y. Then trans-
action A will be interleaved within transaction B, in violation of the requirement that
transactions execute atomically with respect to each other. Allowing transaction B to
execute its store to x similarly violates the atomic-execution requirement. This situation
is termed a conflict, which happens whenever two concurrent transactions access the
same variable where at least one of the accesses is a store. The system is therefore
17.3. HARDWARE TRANSACTIONAL MEMORY 467
obligated to abort one or both of the transactions in order to allow execution to progress.
The choice of exactly which transaction to abort is an interesting topic that will very
likely retain the ability to generate Ph.D. dissertations for some time to come, see for
example [ATC+ 11].9 For the purposes of this section, we can assume that the system
makes a random choice.
Another complication is conflict detection, which is comparatively straightforward,
at least in the simplest case. When a processor is executing a transaction, it marks
every cache line touched by that transaction. If the processor’s cache receives a request
involving a cache line that has been marked as touched by the current transaction,
a potential conflict has occurred. More sophisticated systems might try to order the
current processors’ transaction to precede that of the processor sending the request,
and optimization of this process will likely also retain the ability to generate Ph.D.
dissertations for quite some time. However this section assumes a very simple conflict-
detection strategy.
However, for HTM to work effectively, the probability of conflict must be suitably
low, which in turn requires that the data structures be organized so as to maintain
a sufficiently low probability of conflict. For example, a red-black tree with simple
insertion, deletion, and search operations fits this description, but a red-black tree that
maintains an accurate count of the number of elements in the tree does not.10 For another
example, a red-black tree that enumerates all elements in the tree in a single transaction
will have high conflict probabilities, degrading performance and scalability. As a result,
many serial programs will require some restructuring before HTM can work effectively.
In some cases, practitioners will prefer to take the extra steps (in the red-black-tree case,
perhaps switching to a partitionable data structure such as a radix tree or a hash table),
and just use locking, particularly during the time before HTM is readily available on all
relevant architectures [Cli09].
Quick Quiz 17.4: How could a red-black tree possibly efficiently enumerate all
elements of the tree regardless of choice of synchronization mechanism???
Furthermore, the fact that conflicts can occur brings failure handling into the picture,
as discussed in the next section.
tics.
17.3. HARDWARE TRANSACTIONAL MEMORY 469
1 void boostee(void)
2 {
3 int i = 0;
4
5 acquire_lock(&boost_lock[i]);
6 for (;;) {
7 acquire_lock(&boost_lock[!i]);
8 release_lock(&boost_lock[i]);
9 i = i ^ 1;
10 do_something();
11 }
12 }
13
14 void booster(void)
15 {
16 int i = 0;
17
18 for (;;) {
19 usleep(1000); /* sleep 1 ms. */
20 acquire_lock(&boost_lock[i]);
21 release_lock(&boost_lock[i]);
22 i = i ^ 1;
23 }
24 }
One important semantic difference between locking and transactions is the priority
boosting that is used to avoid priority inversion in lock-based real-time programs. One
way in which priority inversion can occur is when a low-priority thread holding a lock
is preempted by a medium-priority CPU-bound thread. If there is at least one such
medium-priority thread per CPU, the low-priority thread will never get a chance to run.
If a high-priority thread now attempts to acquire the lock, it will block. It cannot acquire
the lock until the low-priority thread releases it, the low-priority thread cannot release
the lock until it gets a chance to run, and it cannot get a chance to run until one of the
medium-priority threads gives up its CPU. Therefore, the medium-priority threads are
in effect blocking the high-priority process, which is the rationale for the name “priority
inversion.”
One way to avoid priority inversion is priority inheritance, in which a high-priority
thread blocked on a lock temporarily donates its priority to the lock’s holder, which is
also called priority boosting. However, priority boosting can be used for things other
than avoiding priority inversion, as shown in Figure 17.12. Lines 1-12 of this figure
show a low-priority process that must nevertheless run every millisecond or so, while
lines 14-24 of this same figure show a high-priority process that uses priority boosting
to ensure that boostee() runs periodically as needed.
The boostee() function arranges this by always holding one of the two boost_
lock[] locks, so that lines 20-21 of booster() can boost priority as needed.
Quick Quiz 17.9: But the boostee() function in Figure 17.12 alternatively
acquires its locks in reverse order! Won’t this result in deadlock?
This arrangement requires that boostee() acquire its first lock on line 5 before
the system becomes busy, but this is easily arranged, even on modern hardware.
Unfortunately, this arrangement can break down in presence of transactional lock
elision. The boostee() function’s overlapping critical sections become one infinite
transaction, which will sooner or later abort, for example, on the first time that the thread
running the boostee() function is preempted. At this point, boostee() will fall
back to locking, but given its low priority and that the quiet initialization period is now
complete (which after all is why boostee() was preempted), this thread might never
17.3. HARDWARE TRANSACTIONAL MEMORY 471
17.3.2.7 Summary
Although it seems likely that HTM will have compelling use cases, current imple-
mentations have serious transaction-size limitations, conflict-handling complications,
abort-and-rollback issues, and semantic differences that will require careful handling.
HTM’s current situation relative to locking is summarized in Table 17.1. As can be
seen, although the current state of HTM alleviates some serious shortcomings of lock-
ing,14 it does so by introducing a significant number of shortcomings of its own. These
shortcomings are acknowledged by leaders in the TM community [MS12].15
In addition, this is not the whole story. Locking is not normally used by itself, but is
instead typically augmented by other synchronization mechanisms, including reference
counting, atomic operations, non-blocking data structures, hazard pointers [Mic04,
HLM02], and read-copy update (RCU) [MS98a, MAK+ 01, HMBW07, McK12a]. The
next section looks at how such augmentation changes the equation.
used engineering solutions, including deadlock detectors [Cor06a], a wealth of data structures that have been
adapted to locking, and a long history of augmentation, as discussed in Section 17.3.3. In addition, if locking
really were as horrible as a quick skim of many academic papers might reasonably lead one to believe, where
did all the large lock-based parallel programs (both FOSS and proprietary) come from, anyway?
15 In addition, in early 2011, I was invited to deliver a critique of some of the assumptions underlying
transactional memory [McK11d]. The audience was surprisingly non-hostile, though perhaps they were taking
it easy on me due to the fact that I was heavily jet-lagged while giving the presentation.
472 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Table 17.1: Comparison of Locking and HTM (“+” is Advantage, “−” is Disadvantage,
“⇓” is Strong Disadvantage)
17.3. HARDWARE TRANSACTIONAL MEMORY 473
Table 17.2: Comparison of Locking (Augmented by RCU or Hazard Pointers) and HTM
(“+” is Advantage, “−” is Disadvantage, “⇓” is Strong Disadvantage)
474 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
2. Read-side mechanisms such as hazard pointers and RCU can operate efficiently
on non-partitionable data.
3. Hazard pointers and RCU do not contend with each other or with updaters,
allowing excellent performance and scalability for read-mostly workloads.
4. Hazard pointers and RCU provide forward-progress guarantees (lock freedom
and wait-freedom, respectively).
5. Privatization operations for hazard pointers and RCU are straightforward.
16 It is quite ironic that strictly transactional mechanisms are appearing in shared-memory systems at
just about the time that NoSQL databases are relaxing the traditional database-application reliance on strict
transactions.
17.3. HARDWARE TRANSACTIONAL MEMORY 475
1. Forward-progress guarantees.
2. Transaction-size increases.
3. Improved debugging support.
4. Weak atomicity.
the extra-transactional accesses for all of these problems, but the folly of this line of
thinking is easily demonstrated by placing each of the extra-transactional accesses into
its own single-access transaction. It is the pattern of accesses that is the issue, not
whether or not they happen to be enclosed in a transaction.
Finally, any forward-progress guarantees for transactions also depend on the sched-
uler, which must let the thread executing the transaction run long enough to successfully
commit.
So there are significant obstacles to HTM vendors offering forward-progress guaran-
tees. However, the impact of any of them doing so would be enormous. It would mean
that HTM transactions would no longer need software fallbacks, which would mean that
HTM could finally deliver on the TM promise of deadlock elimination.
And as of late 2012, the IBM Mainframe announced an HTM implementation that
includes constrained transactions in addition to the usual best-effort HTM implementa-
tion [JSG12]. A constrained transaction starts with the tbeginc instruction instead of
the tbegin instruction that is used for best-effort transactions. Constrained transac-
tions are guaranteed to always complete (eventually), so if a transaction aborts, rather
than branching to a fallback path (as is done for best-effort transactions), the hardware
instead restarts the transaction at the tbeginc instruction.
The Mainframe architects needed to take extreme measures to deliver on this forward-
progress guarantee. If a given constrained transaction repeatedly fails, the CPU might
disable branch prediction, force in-order execution, and even disable pipelining. If
the repeated failures are due to high contention, the CPU might disable speculative
fetches, introduce random delays, and even serialize execution of the conflicting CPUs.
“Interesting” forward-progress scenarios involve as few as two CPUs or as many as one
hundred CPUs. Perhaps these extreme measures provide some insight as to why other
CPUs have thus far refrained from offering constrained transactions.
As the name implies, constrained transactions are in fact severely constrained:
1. The maximum data footprint is four blocks of memory, where each block can be
no larger than 32 bytes.
2. The maximum code footprint is 256 bytes.
3. If a given 4K page contains a constrained transaction’s code, then that page may
not contain that transaction’s data.
4. The maximum number of assembly instructions that may be executed is 32.
5. Backwards branches are forbidden.
Therefore, increasing the size of the guarantee also increases the usefulness of HTM,
thereby increasing the need for CPUs to either provide it or provide good-and-sufficient
workarounds.
Another inhibitor to transaction size is the need to debug the transactions. The problem
with current mechanisms is that a single-step exception aborts the enclosing transaction.
There are a number of workarounds for this issue, including emulating the processor
(slow!), substituting STM for HTM (slow and slightly different semantics!), playback
techniques using repeated retries to emulate forward progress (strange failure modes!),
and full support of debugging HTM transactions (complex!).
Should one of the HTM vendors produce an HTM system that allows straightforward
use of classical debugging techniques within transactions, including breakpoints, single
stepping, and print statements, this will make HTM much more compelling. Some
transactional-memory researchers are starting to recognize this problem as of 2013,
with at least one proposal involving hardware-assisted debugging facilities [GKP13].
Of course, this proposal depends on readily available hardware gaining such facilities.
Given that HTM is likely to face some sort of size limitations for the foreseeable future,
it will be necessary for HTM to interoperate smoothly with other mechanisms. HTM’s
interoperability with read-mostly mechanisms such as hazard pointers and RCU would
be improved if extra-transactional reads did not unconditionally abort transactions with
conflicting writes—instead, the read could simply be provided with the pre-transaction
value. In this way, hazard pointers and RCU could be used to allow HTM to handle
larger data structures and to reduce conflict probabilities.
This is not necessarily simple, however. The most straightforward way of imple-
menting this requires an additional state in each cache line and on the bus, which is a
non-trivial added expense. The benefit that goes along with this expense is permitting
large-footprint readers without the risk of starving updaters due to continual conflicts.
17.3.6 Conclusions
Although current HTM implementations appear to be poised to deliver real benefits,
they also have significant shortcomings. The most significant shortcomings appear
to be limited transaction sizes, the need for conflict handling, the need for aborts and
rollbacks, the lack of forward-progress guarantees, the inability to handle irrevocable
operations, and subtle semantic differences from locking.
Some of these shortcomings might be alleviated in future implementations, but it ap-
pears that there will continue to be a strong need to make HTM work well with the many
other types of synchronization mechanisms, as noted earlier [MMW07, MMTW10].
In short, current HTM implementations appear to be welcome and useful additions
to the parallel programmer’s toolbox, and much interesting and challenging work is
required to make use of them. However, they cannot be considered to be a magic wand
with which to wave away all parallel-programming problems.
478 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
1. Procedural languages often make heavy use of global variables, which can be
updated independently by different functions, or, worse yet, by multiple threads.
Note that Haskell’s monads were invented to deal with single-threaded global
state, and that multi-threaded access to global state requires additional violence
to the functional model.
Appendix A
Important Questions
The following sections discuss some important questions relating to SMP programming.
Each section also shows how to avoid having to worry about the corresponding question,
which can be extremely important if your goal is to simply get your SMP code working
as quickly and painlessly as possible—which is an excellent goal, by the way!
Although the answers to these questions are often quite a bit less intuitive than they
would be in a single-threaded setting, with a bit of work, they are not that difficult to
understand. If you managed to master recursion, there is nothing in here that should
pose an overwhelming challenge.
481
482 APPENDIX A. IMPORTANT QUESTIONS
2. Consumer is preempted.
5. Consumer starts running again, and picks up the producer’s timestamp (Figure A.2,
line 14).
In this scenario, the producer’s timestamp might be an arbitrary amount of time after
the consumer’s timestamp.
How do you avoid agonizing over the meaning of “after” in your SMP code?
Simply use SMP primitives as designed.
In this example, the easiest fix is to use locking, for example, acquire a lock in the
producer before line 10 in Figure A.1 and in the consumer before line 13 in Figure A.2.
This lock must also be released after line 13 in Figure A.1 and after line 17 in Figure A.2.
These locks cause the code segments in lines 10-13 of Figure A.1 and in lines 13-17 of
Figure A.2 to exclude each other, in other words, to run atomically with respect to each
other. This is represented in Figure A.3: the locking prevents any of the boxes of code
from overlapping in time, so that the consumer’s timestamp must be collected after the
prior producer’s timestamp. The segments of code in each box in this figure are termed
“critical sections”; only one such critical section may be executing at a given time.
This addition of locking results in output as shown in Table A.2. Here there are no
instances of time going backwards, instead, there are only cases with more than 1,000
counts difference between consecutive reads by the consumer.
Quick Quiz A.2: How could there be such a large gap between successive consumer
reads? See timelocked.c for full code.
In summary, if you acquire an exclusive lock, you know that anything you do
while holding that lock will appear to happen after anything done by any prior holder
of that lock. No need to worry about which CPU did or did not execute a memory
barrier, no need to worry about the CPU or compiler reordering operations—life is
simple. Of course, the fact that this locking prevents these two pieces of code from
484 APPENDIX A. IMPORTANT QUESTIONS
Time
Producer
ss.t = dgettimeofday();
ss.a = ss.c + 1;
ss.b = ss.a + 1;
ss.c = ss.b + 1;
Consumer
curssc.tc = gettimeofday();
curssc.t = ss.t;
curssc.a = ss.a;
curssc.b = ss.b;
curssc.c = ss.c;
Producer
ss.t = dgettimeofday();
ss.a = ss.c + 1;
ss.b = ss.a + 1;
ss.c = ss.b + 1;
running concurrently might limit the program’s ability to gain increased performance on
multiprocessors, possibly resulting in a “safe but slow” situation. Chapter 6 describes
ways of gaining performance and scalability in many situations.
However, in most cases, if you find yourself worrying about what happens before
or after a given piece of code, you should take this as a hint to make better use of the
standard primitives. Let these primitives do the worrying for you.
1 Yes, this does mean that parallel-computing programs are best-suited for sequential execution. Why did
you ask?
486 APPENDIX A. IMPORTANT QUESTIONS
Worse yet, the thread reading the time might be interrupted or preempted. Further-
more, there will likely be some computation between reading out the time and the actual
use of the time that has been read out. Both of these possibilities further extend the
interval of uncertainty.
One approach is to read the time twice, and take the arithmetic mean of the two
readings, perhaps one on each side of the operation being timestamped. The difference
between the two readings is then a measure of uncertainty of the time at which the
intervening operation occurred.
Of course, in many cases, the exact time is not necessary. For example, when
printing the time for the benefit of a human user, we can rely on slow human reflexes to
render internal hardware and software delays irrelevant. Similarly, if a server needs to
timestamp the response to a client, any time between the reception of the request and
the transmission of the response will do equally well.
Appendix B
So what possessed CPU designers to cause them to inflict memory barriers on poor
unsuspecting SMP software designers?
In short, because reordering memory references allows much better performance,
and so memory barriers are needed to force ordering in things like synchronization
primitives whose correct operation depends on ordered memory references.
Getting a more detailed answer to this question requires a good understanding of
how CPU caches work, and especially what is required to make caches really work well.
The following sections:
1. present the structure of a cache,
2. describe how cache-coherency protocols ensure that CPUs agree on the value of
each location in memory, and, finally,
3. outline how store buffers and invalidate queues help caches and cache-coherency
protocols achieve high performance.
We will see that memory barriers are a necessary evil that is required to enable good
performance and scalability, an evil that stems from the fact that CPUs are orders of
magnitude faster than are both the interconnects between them and the memory they are
attempting to access.
1 It is standard practice to use multiple levels of cache, with a small level-one cache close to the CPU
with single-cycle access time, and a larger level-two cache with a longer access time, perhaps roughly ten
clock cycles. Higher-performance CPUs often have three or even four levels of cache.
487
488 APPENDIX B. WHY MEMORY BARRIERS?
CPU 0 CPU 1
Cache Cache
Interconnect
Memory
When a given data item is first accessed by a given CPU, it will be absent from that
CPU’s cache, meaning that a “cache miss” (or, more specifically, a “startup” or “warmup”
cache miss) has occurred. The cache miss means that the CPU will have to wait (or be
“stalled”) for hundreds of cycles while the item is fetched from memory. However, the
item will be loaded into that CPU’s cache, so that subsequent accesses will find it in the
cache and therefore run at full speed.
After some time, the CPU’s cache will fill, and subsequent misses will likely need
to eject an item from the cache in order to make room for the newly fetched item. Such
a cache miss is termed a “capacity miss”, because it is caused by the cache’s limited
capacity. However, most caches can be forced to eject an old item to make room for a
new item even when they are not yet full. This is due to the fact that large caches are
implemented as hardware hash tables with fixed-size hash buckets (or “sets”, as CPU
designers call them) and no chaining, as shown in Figure B.2.
This cache has sixteen “sets” and two “ways” for a total of 32 “lines”, each entry
containing a single 256-byte “cache line”, which is a 256-byte-aligned block of memory.
This cache line size is a little on the large size, but makes the hexadecimal arithmetic
much simpler. In hardware parlance, this is a two-way set-associative cache, and is
analogous to a software hash table with sixteen buckets, where each bucket’s hash
chain is limited to at most two elements. The size (32 cache lines in this case) and the
associativity (two in this case) are collectively called the cache’s “geometry”. Since this
cache is implemented in hardware, the hash function is extremely simple: extract four
bits from the memory address.
In Figure B.2, each box corresponds to a cache entry, which can contain a 256-byte
cache line. However, a cache entry can be empty, as indicated by the empty boxes in the
figure. The rest of the boxes are flagged with the memory address of the cache line that
they contain. Since the cache lines must be 256-byte aligned, the low eight bits of each
address are zero, and the choice of hardware hash function means that the next-higher
four bits match the hash line number.
The situation depicted in the figure might arise if the program’s code were located at
address 0x43210E00 through 0x43210EFF, and this program accessed data sequentially
from 0x12345000 through 0x12345EFF. Suppose that the program were now to access
location 0x12345F00. This location hashes to line 0xF, and both ways of this line are
empty, so the corresponding 256-byte line can be accommodated. If the program were
to access location 0x1233000, which hashes to line 0x0, the corresponding 256-byte
B.2. CACHE-COHERENCE PROTOCOLS 489
Way 0 Way 1
0x0 0x12345000
0x1 0x12345100
0x2 0x12345200
0x3 0x12345300
0x4 0x12345400
0x5 0x12345500
0x6 0x12345600
0x7 0x12345700
0x8 0x12345800
0x9 0x12345900
0xA 0x12345A00
0xB 0x12345B00
0xC 0x12345C00
0xD 0x12345D00
0xE 0x12345E00 0x43210E00
0xF
cache line can be accommodated in way 1. However, if the program were to access
location 0x1233E00, which hashes to line 0xE, one of the existing lines must be ejected
from the cache to make room for the new cache line. If this ejected line were accessed
later, a cache miss would result. Such a cache miss is termed an “associativity miss”.
Thus far, we have been considering only cases where a CPU reads a data item. What
happens when it does a write? Because it is important that all CPUs agree on the value
of a given data item, before a given CPU writes to that data item, it must first cause it
to be removed, or “invalidated”, from other CPUs’ caches. Once this invalidation has
completed, the CPU may safely modify the data item. If the data item was present in
this CPU’s cache, but was read-only, this process is termed a “write miss”. Once a given
CPU has completed invalidating a given data item from other CPUs’ caches, that CPU
may repeatedly write (and read) that data item.
Later, if one of the other CPUs attempts to access the data item, it will incur a cache
miss, this time because the first CPU invalidated the item in order to write to it. This
type of cache miss is termed a “communication miss”, since it is usually due to several
CPUs using the data items to communicate (for example, a lock is a data item that is
used to communicate among CPUs using a mutual-exclusion algorithm).
Clearly, much care must be taken to ensure that all CPUs maintain a coherent view
of the data. With all this fetching, invalidating, and writing, it is easy to imagine data
being lost or (perhaps worse) different CPUs having conflicting values for the same
data item in their respective caches. These problems are prevented by “cache-coherency
protocols”, described in the next section.
and Sequent (now IBM) NUMA-Q, respectively. Both diagrams are significantly simpler than real life.
490 APPENDIX B. WHY MEMORY BARRIERS?
• Read: The “read” message contains the physical address of the cache line to be
read.
• Read Response: The “read response” message contains the data requested by an
earlier “read” message. This “read response” message might be supplied either
by memory or by one of the other caches. For example, if one of the caches has
the desired data in “modified” state, that cache must supply the “read response”
message.
• Invalidate: The “invalidate” message contains the physical address of the cache
line to be invalidated. All other caches must remove the corresponding data from
their caches and respond.
B.2. CACHE-COHERENCE PROTOCOLS 491
a f
b c d e
g
E S
h
j k
i l
• Transition (a): A cache line is written back to memory, but the CPU retains it
in its cache and further retains the right to modify it. This transition requires a
“writeback” message.
• Transition (b): The CPU writes to the cache line that it already had exclusive
access to. This transition does not require any messages to be sent or received.
• Transition (c): The CPU receives a “read invalidate” message for a cache line that
it has modified. The CPU must invalidate its local copy, then respond with both a
“read response” and an “invalidate acknowledge” message, both sending the data
to the requesting CPU and indicating that it no longer has a local copy.
• Transition (f): Some other CPU reads the cache line, and it is supplied from this
CPU’s cache, which retains a read-only copy, possibly also writing it back to
memory. This transition is initiated by the reception of a “read” message, and this
CPU responds with a “read response” message containing the requested data.
• Transition (g): Some other CPU reads a data item in this cache line, and it is
supplied either from this CPU’s cache or from memory. In either case, this CPU
retains a read-only copy. This transition is initiated by the reception of a “read”
message, and this CPU responds with a “read response” message containing the
requested data.
• Transition (h): This CPU realizes that it will soon need to write to some data item
in this cache line, and thus transmits an “invalidate” message. The CPU cannot
complete the transition until it receives a full set of “invalidate acknowledge”
responses. Alternatively, all other CPUs eject this cache line from their caches
via “writeback” messages (presumably to make room for other cache lines), so
that this CPU is the last CPU caching it.
• Transition (j): This CPU does a store to a data item in a cache line that was not
in its cache, and thus transmits a “read invalidate” message. The CPU cannot
complete the transition until it receives the “read response” and a full set of
“invalidate acknowledge” messages. The cache line will presumably transition to
“modified” state via transition (b) as soon as the actual store completes.
B.2. CACHE-COHERENCE PROTOCOLS 493
• Transition (k): This CPU loads a data item in a cache line that was not in its cache.
The CPU transmits a “read” message, and completes the transition upon receiving
the corresponding “read response”.
• Transition (l): Some other CPU does a store to a data item in this cache line, but
holds this cache line in read-only state due to its being held in other CPUs’ caches
(such as the current CPU’s cache). This transition is initiated by the reception of
an “invalidate” message, and this CPU responds with an “invalidate acknowledge”
message.
Quick Quiz B.5: How does the hardware handle the delayed transitions described
above?
CPU 0 CPU 1
Write
Invalidate
Stall
Acknowledgement
But there is no real reason to force CPU 0 to stall for so long—after all, regardless
of what data happens to be in the cache line that CPU 1 sends it, CPU 0 is going to
unconditionally overwrite it.
CPU 0 CPU 1
Store Store
Buffer Buffer
Cache Cache
Interconnect
Memory
store buffer assigned to it. For example, in Figure B.5, CPU 0 cannot access CPU 1’s
store buffer and vice versa. This restriction simplifies the hardware by separating
concerns: The store buffer improves performance for consecutive writes, while the
responsibility for communicating among CPUs (or cores, as the case may be) is fully
shouldered by the cache-coherence protocol. However, even given this restriction, there
are complications that must be addressed, which are covered in the next two sections.
One would not expect the assertion to fail. However, if one were foolish enough to
use the very simple architecture shown in Figure B.5, one would be surprised. Such a
system could potentially see the following sequence of events:
5. CPU 1 receives the “read invalidate” message, and responds by transmitting the
cache line and removing that cacheline from its cache.
496 APPENDIX B. WHY MEMORY BARRIERS?
7. CPU 0 receives the cache line from CPU 1, which still has a value of zero for “a”.
8. CPU 0 loads “a” from its cache, finding the value zero.
9. CPU 0 applies the entry from its store buffer to the newly arrived cache line,
setting the value of “a” in its cache to one.
10. CPU 0 adds one to the value zero loaded for “a” above, and stores it into the cache
line containing “b” (which we will assume is already owned by CPU 0).
The problem is that we have two copies of “a”, one in the cache and the other in the
store buffer.
This example breaks a very important guarantee, namely that each CPU will always
see its own operations as if they happened in program order. Breaking this guarantee is
violently counter-intuitive to software types, so much so that the hardware guys took
pity and implemented “store forwarding”, where each CPU refers to (or “snoops”) its
store buffer as well as its cache when performing loads, as shown in Figure B.6. In other
words, a given CPU’s stores are directly forwarded to its subsequent loads, without
having to pass through the cache.
CPU 0 CPU 1
Store Store
Buffer Buffer
Cache Cache
Interconnect
Memory
With store forwarding in place, item 8 in the above sequence would have found the
correct value of 1 for “a” in the store buffer, so that the final value of “b” would have
been 2, as one would hope.
1 void foo(void)
2 {
3 a = 1;
4 b = 1;
5 }
6
7 void bar(void)
8 {
9 while (b == 0) continue;
10 assert(a == 1);
11 }
Suppose CPU 0 executes foo() and CPU 1 executes bar(). Suppose further that
the cache line containing “a” resides only in CPU 1’s cache, and that the cache line
containing “b” is owned by CPU 0. Then the sequence of operations might be as follows:
1. CPU 0 executes a = 1. The cache line is not in CPU 0’s cache, so CPU 0 places
the new value of “a” in its store buffer and transmits a “read invalidate” message.
3. CPU 0 executes b = 1. It already owns this cache line (in other words, the cache
line is already in either the “modified” or the “exclusive” state), so it stores the
new value of “b” in its cache line.
4. CPU 0 receives the “read” message, and transmits the cache line containing the
now-updated value of “b” to CPU 1, also marking the line as “shared” in its own
cache.
5. CPU 1 receives the cache line containing “b” and installs it in its cache.
7. CPU 1 executes the assert(a == 1), and, since CPU 1 is working with the
old value of “a”, this assertion fails.
8. CPU 1 receives the “read invalidate” message, and transmits the cache line
containing “a” to CPU 0 and invalidates this cache line from its own cache. But it
is too late.
9. CPU 0 receives the cache line containing “a” and applies the buffered store just
in time to fall victim to CPU 1’s failed assertion.
Quick Quiz B.8: In step 1 above, why does CPU 0 need to issue a “read invalidate”
rather than a simple “invalidate”?
The hardware designers cannot help directly here, since the CPUs have no idea
which variables are related, let alone how they might be related. Therefore, the hardware
designers provide memory-barrier instructions to allow the software to tell the CPU
about such relations. The program fragment must be updated to contain the memory
barrier:
498 APPENDIX B. WHY MEMORY BARRIERS?
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 assert(a == 1);
12 }
The memory barrier smp_mb() will cause the CPU to flush its store buffer before
applying each subsequent store to its variable’s cache line. The CPU could either simply
stall until the store buffer was empty before proceeding, or it could use the store buffer to
hold subsequent stores until all of the prior entries in the store buffer had been applied.
With this latter approach the sequence of operations might be as follows:
1. CPU 0 executes a = 1. The cache line is not in CPU 0’s cache, so CPU 0 places
the new value of “a” in its store buffer and transmits a “read invalidate” message.
3. CPU 0 executes smp_mb(), and marks all current store-buffer entries (namely,
the a = 1).
4. CPU 0 executes b = 1. It already owns this cache line (in other words, the cache
line is already in either the “modified” or the “exclusive” state), but there is a
marked entry in the store buffer. Therefore, rather than store the new value of “b”
in the cache line, it instead places it in the store buffer (but in an unmarked entry).
5. CPU 0 receives the “read” message, and transmits the cache line containing the
original value of “b” to CPU 1. It also marks its own copy of this cache line as
“shared”.
6. CPU 1 receives the cache line containing “b” and installs it in its cache.
7. CPU 1 can now load the value of “b”, but since it finds that the value of “b” is
still 0, it repeats the while statement. The new value of “b” is safely hidden in
CPU 0’s store buffer.
8. CPU 1 receives the “read invalidate” message, and transmits the cache line
containing “a” to CPU 0 and invalidates this cache line from its own cache.
9. CPU 0 receives the cache line containing “a” and applies the buffered store,
placing this line into the “modified” state.
10. Since the store to “a” was the only entry in the store buffer that was marked by
the smp_mb(), CPU 0 can also store the new value of “b”—except for the fact
that the cache line containing “b” is now in “shared” state.
12. CPU 1 receives the “invalidate” message, invalidates the cache line containing “b”
from its cache, and sends an “acknowledgement” message to CPU 0.
13. CPU 1 executes while (b == 0)continue, but the cache line containing
“b” is not in its cache. It therefore transmits a “read” message to CPU 0.
14. CPU 0 receives the “acknowledgement” message, and puts the cache line contain-
ing “b” into the “exclusive” state. CPU 0 now stores the new value of “b” into the
cache line.
15. CPU 0 receives the “read” message, and transmits the cache line containing the
new value of “b” to CPU 1. It also marks its own copy of this cache line as
“shared”.
16. CPU 1 receives the cache line containing “b” and installs it in its cache.
17. CPU 1 can now load the value of “b”, and since it finds that the value of “b” is 1,
it exits the while loop and proceeds to the next statement.
18. CPU 1 executes the assert(a == 1), but the cache line containing “a” is no
longer in its cache. Once it gets this cache from CPU 0, it will be working with
the up-to-date value of “a”, and the assertion therefore passes.
As you can see, this process involves no small amount of bookkeeping. Even
something intuitively simple, like “load the value of a” can involve lots of complex steps
in silicon.
CPU 0 CPU 1
Store Store
Buffer Buffer
Cache Cache
Invalidate Invalidate
Queue Queue
Interconnect
Memory
Placing an entry into the invalidate queue is essentially a promise by the CPU to
process that entry before transmitting any MESI protocol messages regarding that cache
line. As long as the corresponding data structures are not highly contended, the CPU
will rarely be inconvenienced by such a promise.
However, the fact that invalidate messages can be buffered in the invalidate queue
provides additional opportunity for memory-misordering, as discussed in the next
section.
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 assert(a == 1);
12 }
memory barrier, it marks all the entries currently in its invalidate queue, and forces any
subsequent load to wait until all marked entries have been applied to the CPU’s cache.
Therefore, we can add a memory barrier to function bar as follows:
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 smp_mb();
12 assert(a == 1);
13 }
Quick Quiz B.10: Say what??? Why do we need a memory barrier here, given that
the CPU cannot possibly execute the assert() until after the while loop completes?
3. CPU 1 receives CPU 0’s “invalidate” message, queues it, and immediately re-
sponds to it.
4. CPU 0 receives the response from CPU 1, and is therefore free to proceed past
the smp_mb() on line 4 above, moving the value of “a” from its store buffer to
its cache line.
5. CPU 0 executes b = 1. It already owns this cache line (in other words, the cache
line is already in either the “modified” or the “exclusive” state), so it stores the
new value of “b” in its cache line.
6. CPU 0 receives the “read” message, and transmits the cache line containing the
now-updated value of “b” to CPU 1, also marking the line as “shared” in its own
cache.
7. CPU 1 receives the cache line containing “b” and installs it in its cache.
9. CPU 1 must now stall until it processes all pre-existing messages in its invalidation
queue.
B.5. READ AND WRITE MEMORY BARRIERS 503
10. CPU 1 now processes the queued “invalidate” message, and invalidates the cache
line containing “a” from its own cache.
11. CPU 1 executes the assert(a == 1), and, since the cache line containing “a”
is no longer in CPU 1’s cache, it transmits a “read” message.
12. CPU 0 responds to this “read” message with the cache line containing the new
value of “a”.
13. CPU 1 receives this cache line, which contains a value of 1 for “a”, so that the
assertion does not trigger.
With much passing of MESI messages, the CPUs arrive at the correct answer. This
section illustrates why CPU designers must be extremely careful with their cache-
coherence optimizations.
Some computers have even more flavors of memory barriers, but understanding
these three variants will provide a good introduction to memory barriers in general.
504 APPENDIX B. WHY MEMORY BARRIERS?
Quick Quiz B.11: Does the guarantee that each CPU sees its own memory accesses
in order also guarantee that each user-level thread will see its own memory accesses in
order? Why or why not?
Imagine a large non-uniform cache architecture (NUCA) system that, in order to
provide fair allocation of interconnect bandwidth to CPUs in a given node, provided
per-CPU queues in each node’s interconnect interface, as shown in Figure B.8. Although
a given CPU’s accesses are ordered as specified by memory barriers executed by that
CPU, however, the relative order of a given pair of CPUs’ accesses could be severely
reordered, as we will see.5
4 Readers preferring a detailed look at real hardware architectures are encouraged to consult CPU
vendors’ manuals [SW95, Adv02, Int02b, IBM94, LHF05, SPA94, Int04b, Int04a, Int04c], Gharachorloo’s
dissertation [Gha95], Peter Sewell’s work [Sew], or the excellent hardware-oriented primer by Sorin, Hill,
and Wood [SHW11].
5 Any real hardware architect or designer will no doubt be objecting strenuously, as they just might be
just a bit upset about the prospect of working out which queue should handle a message involving a cache line
that both CPUs accessed, to say nothing of the many races that this example poses. All I can say is “Give me
a better example”.
B.6. EXAMPLE MEMORY-BARRIER SEQUENCES 505
Node 0 Node 1
CPU 0 CPU 1 CPU 2 CPU 3
Cache Cache
Interconnect
Memory
B.6.2 Example 1
Table B.2 shows three code fragments, executed concurrently by CPUs 0, 1, and 2. Each
of “a”, “b”, and “c” are initially zero.
Suppose CPU 0 recently experienced many cache misses, so that its message queue
is full, but that CPU 1 has been running exclusively within the cache, so that its message
queue is empty. Then CPU 0’s assignment to “a” and “b” will appear in Node 0’s cache
immediately (and thus be visible to CPU 1), but will be blocked behind CPU 0’s prior
traffic. In contrast, CPU 1’s assignment to “c” will sail through CPU 1’s previously
empty queue. Therefore, CPU 2 might well see CPU 1’s assignment to “c” before it sees
CPU 0’s assignment to “a”, causing the assertion to fire, despite the memory barriers.
Therefore, portable code cannot rely on this assertion not firing, as both the compiler
and the CPU can reorder the code so as to trip the assertion.
Quick Quiz B.12: Could this code be fixed by inserting a memory barrier between
CPU 1’s “while” and assignment to “c”? Why or why not?
B.6.3 Example 2
Table B.3 shows three code fragments, executed concurrently by CPUs 0, 1, and 2. Both
“a” and “b” are initially zero.
Again, suppose CPU 0 recently experienced many cache misses, so that its message
queue is full, but that CPU 1 has been running exclusively within the cache, so that its
message queue is empty. Then CPU 0’s assignment to “a” will appear in Node 0’s cache
506 APPENDIX B. WHY MEMORY BARRIERS?
immediately (and thus be visible to CPU 1), but will be blocked behind CPU 0’s prior
traffic. In contrast, CPU 1’s assignment to “b” will sail through CPU 1’s previously
empty queue. Therefore, CPU 2 might well see CPU 1’s assignment to “b” before it sees
CPU 0’s assignment to “a”, causing the assertion to fire, despite the memory barriers.
In theory, portable code should not rely on this example code fragment, however, as
before, in practice it actually does work on most mainstream computer systems.
B.6.4 Example 3
Table B.4 shows three code fragments, executed concurrently by CPUs 0, 1, and 2. All
variables are initially zero.
Note that neither CPU 1 nor CPU 2 can proceed to line 5 until they see CPU 0’s
assignment to “b” on line 3. Once CPU 1 and 2 have executed their memory barriers on
line 4, they are both guaranteed to see all assignments by CPU 0 preceding its memory
barrier on line 2. Similarly, CPU 0’s memory barrier on line 8 pairs with those of CPUs 1
and 2 on line 4, so that CPU 0 will not execute the assignment to “e” on line 9 until after
its assignment to “a” is visible to both of the other CPUs. Therefore, CPU 2’s assertion
on line 9 is guaranteed not to fire.
Quick Quiz B.13: Suppose that lines 3-5 for CPUs 1 and 2 in Table B.4 are in an
interrupt handler, and that the CPU 2’s line 9 is run at process level. What changes, if
any, are required to enable the code to work correctly, in other words, to prevent the
assertion from firing?
Quick Quiz B.14: If CPU 2 executed an assert(e==0||c==1) in the example
in Table B.4, would this assert ever trigger?
The Linux kernel’s synchronize_rcu() primitive uses an algorithm similar to
that shown in this example.
B.7. MEMORY-BARRIER INSTRUCTIONS FOR SPECIFIC CPUS 507
x86
IA64
MIPS
Alpha
AMD64
zSeries®
POWER™
(PA-RISC)
SPARC TSO
ARMv7-A/R
(SPARC PSO)
(x86 OOStore)
(SPARC RMO)
PA-RISC CPUs
Y
Y
Y
Y
Y
Y
Y
Y
Loads Reordered After Loads?
Y
Y
Y
Y
Y
Y
Y
Y
Loads Reordered After Stores?
Y
Y
Y
Y
Y
Y
Y
Y
Y
Stores Reordered After Stores?
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Stores Reordered After Loads?
Y
Y
Y
Y
Y
Y
Atomic Instructions Reordered With Loads?
Y
Y
Y
Y
Y
Y
Y
Atomic Instructions Reordered With Stores?
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Incoherent Instruction Cache/Pipeline?
APPENDIX B. WHY MEMORY BARRIERS?
B.7. MEMORY-BARRIER INSTRUCTIONS FOR SPECIFIC CPUS 509
B.7.1 Alpha
It may seem strange to say much of anything about a CPU whose end of life has been
announced, but Alpha is interesting because, with the weakest memory ordering model,
it reorders memory operations the most aggressively. It therefore has defined the Linux-
kernel memory-ordering primitives, which must work on all CPUs, including Alpha.
Understanding Alpha is therefore surprisingly important to the Linux kernel hacker.
The difference between Alpha and the other CPUs is illustrated by the code shown
in Figure B.9. This smp_wmb() on line 9 of this figure guarantees that the element
initialization in lines 6-8 is executed before the element is added to the list on line 10,
so that the lock-free search will work correctly. That is, it makes this guarantee on all
CPUs except Alpha.
Alpha has extremely weak memory ordering such that the code on line 20 of
Figure B.9 could see the old garbage values that were present before the initialization
on lines 6-8.
Figure B.10 shows how this can happen on an aggressively parallel machine with
partitioned caches, so that alternating cache lines are processed by the different partitions
of the caches. Assume that the list header head will be processed by cache bank 0, and
that the new element will be processed by cache bank 1. On Alpha, the smp_wmb()
will guarantee that the cache invalidates performed by lines 6-8 of Figure B.9 will reach
the interconnect before that of line 10 does, but makes absolutely no guarantee about
the order in which the new values will reach the reading CPU’s core. For example, it
is possible that the reading CPU’s cache bank 1 is very busy, but cache bank 0 is idle.
This could result in the cache invalidates for the new element being delayed, so that the
reading CPU gets the new value for the pointer, but sees the old cached values for the
new element. See the documentation [Com01] called out earlier for more information,
510 APPENDIX B. WHY MEMORY BARRIERS?
or, again, if you think that I am just making all this up.6
One could place an smp_rmb() primitive between the pointer fetch and deref-
erence. However, this imposes unneeded overhead on systems (such as i386, IA64,
PPC, and SPARC) that respect data dependencies on the read side. A smp_read_
barrier_depends() primitive has been added to the Linux 2.6 kernel to eliminate
overhead on these systems. This primitive may be used as shown on line 19 of Fig-
ure B.11.
It is also possible to implement a software barrier that could be used in place of
smp_wmb(), which would force all reading CPUs to see the writing CPU’s writes in
order. However, this approach was deemed by the Linux community to impose excessive
6 Of course, the astute reader will have already recognized that Alpha is nowhere near as mean and nasty
as it could be, the (thankfully) mythical architecture in Section B.6.1 being a case in point.
B.7. MEMORY-BARRIER INSTRUCTIONS FOR SPECIFIC CPUS 511
overhead on extremely weakly ordered CPUs such as Alpha. This software barrier could
be implemented by sending inter-processor interrupts (IPIs) to all other CPUs. Upon
receipt of such an IPI, a CPU would execute a memory-barrier instruction, implementing
a memory-barrier shootdown. Additional logic is required to avoid deadlocks. Of course,
CPUs that respect data dependencies would define such a barrier to simply be smp_
wmb(). Perhaps this decision should be revisited in the future as Alpha fades off into
the sunset.
The Linux memory-barrier primitives took their names from the Alpha instructions,
so smp_mb() is mb, smp_rmb() is rmb, and smp_wmb() is wmb. Alpha is the
only CPU where smp_read_barrier_depends() is an smp_mb() rather than
a no-op.
Quick Quiz B.15: Why is Alpha’s smp_read_barrier_depends() an smp_
mb() rather than smp_rmb()?
For more detail on Alpha, see the reference manual [SW95].
B.7.2 AMD64
AMD64 is compatible with x86, and has updated its documented memory model [Adv07]
to enforce the tighter ordering that actual implementations have provided for some time.
The AMD64 implementation of the Linux smp_mb() primitive is mfence, smp_
rmb() is lfence, and smp_wmb() is sfence. In theory, these might be relaxed,
but any such relaxation must take SSE and 3DNOW instructions into account.
B.7.3 ARMv7-A/R
The ARM family of CPUs is extremely popular in embedded applications, particularly
for power-constrained applications such as cellphones. Its memory model is similar
to that of Power (see Section B.7.7, but ARM uses a different set of memory-barrier
instructions [ARM10]:
512 APPENDIX B. WHY MEMORY BARRIERS?
1. DMB (data memory barrier) causes the specified type of operations to appear to
have completed before any subsequent operations of the same type. The “type”
of operations can be all operations or can be restricted to only writes (similar to
the Alpha wmb and the POWER eieio instructions). In addition, ARM allows
cache coherence to have one of three scopes: single processor, a subset of the
processors (“inner”) and global (“outer”).
3. ISB (instruction synchronization barrier) flushes the CPU pipeline, so that all
instructions following the ISB are fetched only after the ISB completes. For
example, if you are writing a self-modifying program (such as a JIT), you should
execute an ISB between generating the code and executing it.
None of these instructions exactly match the semantics of Linux’s rmb() primitive,
which must therefore be implemented as a full DMB. The DMB and DSB instructions
have a recursive definition of accesses ordered before and after the barrier, which has an
effect similar to that of POWER’s cumulativity.
ARM also implements control dependencies, so that if a conditional branch depends
on a load, then any store executed after that conditional branch will be ordered after
the load. However, loads following the conditional branch will not be guaranteed to be
ordered unless there is an ISB instruction between the branch and the load. Consider
the following example:
1 r1 = x;
2 if (r1 == 0)
3 nop();
4 y = 1;
5 r2 = z;
6 ISB();
7 r3 = z;
In this example, load-store control dependency ordering causes the load from x on
line 1 to be ordered before the store to y on line 4. However, ARM does not respect
load-load control dependencies, so that the load on line 1 might well happen after the
load on line 5. On the other hand, the combination of the conditional branch on line 2
and the ISB instruction on line 6 ensures that the load on line 7 happens after the load
on line 1. Note that inserting an additional ISB instruction somewhere between lines 3
and 4 would enforce ordering between lines 1 and 5.
B.7.4 IA64
IA64 offers a weak consistency model, so that in absence of explicit memory-barrier
instructions, IA64 is within its rights to arbitrarily reorder memory references [Int02b].
IA64 has a memory-fence instruction named mf, but also has “half-memory fence”
modifiers to loads, stores, and to some of its atomic instructions [Int02a]. The acq
modifier prevents subsequent memory-reference instructions from being reordered
before the acq, but permits prior memory-reference instructions to be reordered after
B.7. MEMORY-BARRIER INSTRUCTIONS FOR SPECIFIC CPUS 513
LD,ACQ
the acq, as fancifully illustrated by Figure B.12. Similarly, the rel modifier prevents
prior memory-reference instructions from being reordered after the rel, but allows
subsequent memory-reference instructions to be reordered before the rel.
These half-memory fences are useful for critical sections, since it is safe to push
operations into a critical section, but can be fatal to allow them to bleed out. However,
as one of the only CPUs with this property,7 IA64 defines Linux’s semantics of memory
ordering associated with lock acquisition and release.
The IA64 mf instruction is used for the smp_rmb(), smp_mb(), and smp_
wmb() primitives in the Linux kernel. Oh, and despite rumors to the contrary, the “mf”
mnemonic really does stand for “memory fence”.
Finally, IA64 offers a global total order for “release” operations, including the “mf”
instruction. This provides the notion of transitivity, where if a given code fragment sees
a given access as having happened, any later code fragment will also see that earlier
access as having happened. Assuming, that is, that all the code fragments involved
correctly use memory barriers.
B.7.5 MIPS
The MIPS memory model [Ima15, Table 6.6] appears to resemble that of ARM, IA64,
and Power, being weakly ordered by default, but respecting dependencies. MIPS has a
wide variety of memory-barrier instructions, but ties them not to hardware considerations,
but rather to the use cases provided by the Linux kernel and the C++11 standard [Smi15]
in a manner similar to the ARM64 additions:
SYNC Full barrier for a number of hardware operations in addition to memory refer-
ences.
SYNC_WMB Write memory barrier, which can be used to implement the smp_wmb()
primitive in the Linux kernel.
SYNC_MB Full memory barrier, but only for memory operations. This may be used
to implement the Linux-kernel smp_mb() and the C++ atomic_thread_
fence(memory_order_seq_cst).
SYNC_RMB Read memory barrier, which can be used to implement the smp_rmb()
primitive in the Linux kernel.
Informal discussions with MIPS architects indicates that MIPS has a definition of
transitivity or cumulativity similar to that of ARM and Power. However, it appears that
different MIPS implementations can have different memory-ordering properties, so it is
important to consult the documentation for the specific MIPS implementation you are
using.
B.7.6 PA-RISC
Although the PA-RISC architecture permits full reordering of loads and stores, actual
CPUs run fully ordered [Kan96]. This means that the Linux kernel’s memory-ordering
primitives generate no code, however, they do use the gcc memory attribute to disable
compiler optimizations that would reorder code across the memory barrier.
1. sync causes all preceding operations to appear to have completed before any
subsequent operations are started. This instruction is therefore quite expensive.
2. lwsync (light-weight sync) orders loads with respect to subsequent loads and
stores, and also orders stores. However, it does not order stores with respect to
subsequent loads. Interestingly enough, the lwsync instruction enforces the
same ordering as does zSeries, and coincidentally, SPARC TSO. The lwsync
instruction may be used to implement load-acquire and store-release operations.
3. eieio (enforce in-order execution of I/O, in case you were wondering) causes
all preceding cacheable stores to appear to have completed before all subsequent
stores. However, stores to cacheable memory are ordered separately from stores
to non-cacheable memory, which means that eieio will not force an MMIO
store to precede a spinlock release.
4. isync forces all preceding instructions to appear to have completed before any
subsequent instructions start execution. This means that the preceding instruc-
tions must have progressed far enough that any traps they might generate have
either happened or are guaranteed not to happen, and that any side-effects of
B.7. MEMORY-BARRIER INSTRUCTIONS FOR SPECIFIC CPUS 515
these instructions (for example, page-table changes) are seen by the subsequent
instructions.
Unfortunately, none of these instructions line up exactly with Linux’s wmb() primi-
tive, which requires all stores to be ordered, but does not require the other high-overhead
actions of the sync instruction. But there is no choice: ppc64 versions of wmb()
and mb() are defined to be the heavyweight sync instruction. However, Linux’s
smp_wmb() instruction is never used for MMIO (since a driver must carefully order
MMIOs in UP as well as SMP kernels, after all), so it is defined to be the lighter
weight eieio instruction. This instruction may well be unique in having a five-vowel
mnemonic. The smp_mb() instruction is also defined to be the sync instruction, but
both smp_rmb() and rmb() are defined to be the lighter-weight lwsync instruction.
Power features “cumulativity”, which can be used to obtain transitivity. When
used properly, any code seeing the results of an earlier code fragment will also see the
accesses that this earlier code fragment itself saw. Much more detail is available from
McKenney and Silvera [MS09].
Power respects control dependencies in much the same way that ARM does, with the
exception that the Power isync instruction is substituted for the ARM ISB instruction.
Many members of the POWER architecture have incoherent instruction caches,
so that a store to memory will not necessarily be reflected in the instruction cache.
Thankfully, few people write self-modifying code these days, but JITs and compilers
do it all the time. Furthermore, recompiling a recently run program looks just like
self-modifying code from the CPU’s viewpoint. The icbi instruction (instruction
cache block invalidate) invalidates a specified cache line from the instruction cache, and
may be used in these situations.
• LoadLoad: order preceding loads before subsequent loads. (This option is used
by the Linux smp_rmb() primitive.)
• Sync: fully complete all preceding operations before starting any subsequent
operations.
516 APPENDIX B. WHY MEMORY BARRIERS?
The Linux smp_mb() primitive uses the first four options together, as in membar
#LoadLoad | #LoadStore | #StoreStore | #StoreLoad, thus fully or-
dering memory operations.
So, why is membar #MemIssue needed? Because a membar #StoreLoad
could permit a subsequent load to get its value from a store buffer, which would be
disastrous if the write was to an MMIO register that induced side effects on the value
to be read. In contrast, membar #MemIssue would wait until the store buffers were
flushed before permitting the loads to execute, thereby ensuring that the load actually
gets its value from the MMIO register. Drivers could instead use membar #Sync, but
the lighter-weight membar #MemIssue is preferred in cases where the additional
function of the more-expensive membar #Sync are not required.
The membar #Lookaside is a lighter-weight version of membar #MemIssue,
which is useful when writing to a given MMIO register affects the value that will next
be read from that register. However, the heavier-weight membar #MemIssue must
be used when a write to a given MMIO register affects the value that will next be read
from some other MMIO register.
It is not clear why SPARC does not define wmb() to be membar #MemIssue
and smb_wmb() to be membar #StoreStore, as the current definitions seem
vulnerable to bugs in some drivers. It is quite possible that all the SPARC CPUs
that Linux runs on implement a more conservative memory-ordering model than the
architecture would permit.
SPARC requires a flush instruction be used between the time that an instruction
is stored and executed [SPA94]. This is needed to flush any prior value for that location
from the SPARC’s instruction cache. Note that flush takes an address, and will flush
only that address from the instruction cache. On SMP systems, all CPUs’ caches are
flushed, but there is no convenient way to determine when the off-CPU flushes complete,
though there is a reference to an implementation note.
B.7.9 x86
Since the x86 CPUs provide “process ordering” so that all CPUs agree on the order
of a given CPU’s writes to memory, the smp_wmb() primitive is a no-op for the
CPU [Int04b]. However, a compiler directive is required to prevent the compiler
from performing optimizations that would result in reordering across the smp_wmb()
primitive.
On the other hand, x86 CPUs have traditionally given no ordering guarantees for
loads, so the smp_mb() and smp_rmb() primitives expand to lock;addl. This
atomic instruction acts as a barrier to both loads and stores.
Intel has also published a memory model for x86 [Int07]. It turns out that Intel’s
actual CPUs enforced tighter ordering than was claimed in the previous specifications,
so this model is in effect simply mandating the earlier de-facto behavior. Even more
recently, Intel published an updated memory model for x86 [Int11, Section 8.2], which
mandates a total global order for stores, although individual CPUs are still permitted
B.8. ARE MEMORY BARRIERS FOREVER? 517
to see their own stores as having happened earlier than this total global order would
indicate. This exception to the total ordering is needed to allow important hardware
optimizations involving store buffers. In addition, memory ordering obeys causality,
so that if CPU 0 sees a store by CPU 1, then CPU 0 is guaranteed to see all stores that
CPU 1 saw prior to its store. Software may use atomic operations to override these
hardware optimizations, which is one reason that atomic operations tend to be more
expensive than their non-atomic counterparts. This total store order is not guaranteed
on older processors.
It is also important to note that atomic instructions operating on a given memory
location should all be of the same size [Int11, Section 8.1.2.2]. For example, if you write
a program where one CPU atomically increments a byte while another CPU executes a
4-byte atomic increment on that same location, you are on your own.
However, note that some SSE instructions are weakly ordered (clflush and
non-temporal move instructions [Int04a]). CPUs that have SSE can use mfence for
smp_mb(), lfence for smp_rmb(), and sfence for smp_wmb().
A few versions of the x86 CPU have a mode bit that enables out-of-order stores, and
for these CPUs, smp_wmb() must also be defined to be lock;addl.
Although newer x86 implementations accommodate self-modifying code without
any special instructions, to be fully compatible with past and potential future x86 imple-
mentations, a given CPU must execute a jump instruction or a serializing instruction
(e.g., cpuid) between modifying the code and executing it [Int11, Section 8.1.3].
B.7.10 zSeries
The zSeries machines make up the IBM™ mainframe family, previously known as
the 360, 370, and 390 [Int04c]. Parallelism came late to zSeries, but given that these
mainframes first shipped in the mid 1960s, this is not saying much. The bcr 15,0 in-
struction is used for the Linux smp_mb(), smp_rmb(), and smp_wmb() primitives.
It also has comparatively strong memory-ordering semantics, as shown in Table B.5,
which should allow the smp_wmb() primitive to be a nop (and by the time you read
this, this change may well have happened). The table actually understates the situation,
as the zSeries memory model is otherwise sequentially consistent, meaning that all
CPUs will agree on the order of unrelated stores from different CPUs.
As with most CPUs, the zSeries architecture does not guarantee a cache-coherent
instruction stream, hence, self-modifying code must execute a serializing instruction
between updating the instructions and executing them. That said, many actual zSeries
machines do in fact accommodate self-modifying code without serializing instructions.
The zSeries instruction set provides a large set of serializing instructions, including
compare-and-swap, some types of branches (for example, the aforementioned bcr
15,0 instruction), and test-and-set, among others.
architecture, there would be no need for memory barriers, because a given thread would
simply wait for all outstanding operations to complete before proceeding to the next
instruction. Because there would be potentially thousands of other threads, the CPU
would be completely utilized, so no CPU time would be wasted.
The argument against would cite the extremely limited number of applications
capable of scaling up to a thousand threads, as well as increasingly severe realtime
requirements, which are in the tens of microseconds for some applications. The realtime-
response requirements are difficult enough to meet as is, and would be even more difficult
to meet given the extremely low single-threaded throughput implied by the massive
multi-threaded scenarios.
Another argument in favor would cite increasingly sophisticated latency-hiding
hardware implementation techniques that might well allow the CPU to provide the
illusion of fully sequentially consistent execution while still providing almost all of the
performance advantages of out-of-order execution. A counter-argument would cite the
increasingly severe power-efficiency requirements presented both by battery-operated
devices and by environmental responsibility.
Who is right? We have no clue, so are preparing to live with either scenario.
the time the system gets around to dumping the offending input buffer, the DMA
will most likely have completed.
4. Inter-processor interrupts (IPIs) that ignore cache coherence.
This can be problematic if the IPI reaches its destination before all of the cache
lines in the corresponding message buffer have been committed to memory.
Appendix C
Answer:
In Appendix C starting on page 521.
Hey, I thought I owed you an easy one! q
Answer:
Indeed it is! Many are questions that Paul E. McKenney would probably have asked if
he was a novice student in a class covering this material. It is worth noting that Paul
was taught most of this material by parallel hardware and software, not by professors.
In Paul’s experience, professors are much more likely to provide answers to verbal
questions than are parallel systems, Watson notwithstanding. Of course, we could
have a lengthy debate over which of professors or parallel systems provide the most
useful answers to these sorts of questions, but for the time being let’s just agree that
usefulness of answers varies widely across the population both of professors and of
parallel systems.
Other quizzes are quite similar to actual questions that have been asked during
conference presentations and lectures covering the material in this book. A few others
are from the viewpoint of the author. q
Answer:
Here are a few possible strategies:
521
522 APPENDIX C. ANSWERS TO QUICK QUIZZES
1. Just ignore the Quick Quizzes and read the rest of the book. You might miss out
on the interesting material in some of the Quick Quizzes, but the rest of the book
has lots of good material as well. This is an eminently reasonable approach if
your main goal is to gain a general understanding of the material or if you are
skimming through to book to find a solution to a specific problem.
2. If you find the Quick Quizzes distracting but impossible to ignore, you can always
clone the LATEX source for this book from the git archive. You can then modify
Makefile and qqz.sty to eliminate the Quick Quizzes from the PDF output.
Alternatively, you could modify these two files so as to pull the answers inline,
immediately following the questions.
3. Look at the answer immediately rather than investing a large amount of time in
coming up with your own answer. This approach is reasonable when a given
Quick Quiz’s answer holds the key to a specific problem you are trying to solve.
This approach is also reasonable if you want a somewhat deeper understanding of
the material, but when you do not expect to be called upon to generate parallel
solutions given only a blank sheet of paper.
Note that as of mid-2016 the quick quizzes are hyperlinked to the answers and vice
versa. Click either the “Quick Quiz” heading or the small black square to move to the
beginning of the answer. From the answer, click on the heading or the small black
square to move to the beginning of the quiz, or, alternatively, click on the small white
square at the end of the answer to move to the end of the corresponding quiz. q
C.2 Introduction
Quick Quiz 2.1:
Come on now!!! Parallel programming has been known to be exceedingly hard for
many decades. You seem to be hinting that it is not so hard. What sort of game are you
playing?
Answer:
If you really believe that parallel programming is exceedingly hard, then you should
have a ready answer to the question “Why is parallel programming hard?” One could list
any number of reasons, ranging from deadlocks to race conditions to testing coverage,
but the real answer is that it is not really all that hard. After all, if parallel programming
was really so horribly difficult, how could a large number of open-source projects,
ranging from Apache to MySQL to the Linux kernel, have managed to master it?
A better question might be: ”Why is parallel programming perceived to be so
difficult?” To see the answer, let’s go back to the year 1991. Paul McKenney was
walking across the parking lot to Sequent’s benchmarking center carrying six dual-
80486 Sequent Symmetry CPU boards, when he suddenly realized that he was carrying
several times the price of the house he had just purchased.1 This high cost of parallel
systems meant that parallel programming was restricted to a privileged few who worked
for an employer who either manufactured or could afford to purchase machines costing
upwards of $100,000—in 1991 dollars US.
1 Yes, this sudden realization did cause him to walk quite a bit more carefully. Why do you ask?
C.2. INTRODUCTION 523
In contrast, in 2006, Paul finds himself typing these words on a dual-core x86 laptop.
Unlike the dual-80486 CPU boards, this laptop also contains 2GB of main memory,
a 60GB disk drive, a display, Ethernet, USB ports, wireless, and Bluetooth. And the
laptop is more than an order of magnitude cheaper than even one of those dual-80486
CPU boards, even before taking inflation into account.
Parallel systems have truly arrived. They are no longer the sole domain of a
privileged few, but something available to almost everyone.
The earlier restricted availability of parallel hardware is the real reason that parallel
programming is considered so difficult. After all, it is quite difficult to learn to program
even the simplest machine if you have no access to it. Since the age of rare and
expensive parallel machines is for the most part behind us, the age during which parallel
programming is perceived to be mind-crushingly difficult is coming to a close.2 q
Answer:
It depends on the programming environment. SQL [Int92] is an underappreciated
success story, as it permits programmers who know nothing about parallelism to keep a
large parallel system productively busy. We can expect more variations on this theme as
parallel computers continue to become cheaper and more readily available. For example,
one possible contender in the scientific and technical computing arena is MATLAB*P,
which is an attempt to automatically parallelize common matrix operations.
Finally, on Linux and UNIX systems, consider the following shell command:
get_input | grep "interesting" | sort
This shell pipeline runs the get_input, grep, and sort processes in parallel.
There, that wasn’t so hard, now was it?
In short, parallel programming is just as easy as sequential programming—at least
in those environments that hide the parallelism from the user! q
Answer:
These are important goals, but they are just as important for sequential programs as they
are for parallel programs. Therefore, important though they are, they do not belong on a
list specific to parallel programming. q
Answer:
Given that parallel programming is perceived to be much harder than sequential pro-
gramming, productivity is tantamount and therefore must not be omitted. Furthermore,
2 Parallel programming is in some ways more difficult than sequential programming, for example, parallel
Answer:
From an engineering standpoint, the difficulty in proving correctness, either formally or
informally, would be important insofar as it impacts the primary goal of productivity.
So, in cases where correctness proofs are important, they are subsumed under the
“productivity” rubric. q
Answer:
Having fun is important as well, but, unless you are a hobbyist, would not normally be a
primary goal. On the other hand, if you are a hobbyist, go wild! q
Answer:
There certainly are cases where the problem to be solved is inherently parallel, for
example, Monte Carlo methods and some numerical computations. Even in these cases,
however, there will be some amount of extra work managing the parallelism.
Parallelism is also sometimes used for reliability. For but one example, triple-modulo
redundancy has three systems run in parallel and vote on the result. In extreme cases,
the three systems will be independently implemented using different algorithms and
technologies. q
Answer:
If you are a pure hobbyist, perhaps you don’t need to care. But even pure hobbyists
will often care about how much they can get done, and how quickly. After all, the
most popular hobbyist tools are usually those that are the best suited for the job, and an
important part of the definition of “best suited” involves productivity. And if someone
is paying you to write parallel code, they will very likely care deeply about your
productivity. And if the person paying you cares about something, you would be most
wise to pay at least some attention to it!
Besides, if you really didn’t care about productivity, you would be doing it by hand
rather than using a computer! q
C.2. INTRODUCTION 525
Answer:
There are a number of answers to this question:
1. Given a large computational cluster of parallel machines, the aggregate cost of the
cluster can easily justify substantial developer effort, because the development
cost can be spread over the large number of machines.
2. Popular software that is run by tens of millions of users can easily justify substan-
tial developer effort, as the cost of this development can be spread over the tens of
millions of users. Note that this includes things like kernels and system libraries.
3. If the low-cost parallel machine is controlling the operation of a valuable piece of
equipment, then the cost of this piece of equipment might easily justify substantial
developer effort.
4. If the software for the low-cost parallel machine produces an extremely valuable
result (e.g., mineral exploration), then the valuable result might again justify
substantial developer cost.
5. Safety-critical systems protect lives, which can clearly justify very large developer
effort.
6. Hobbyists and researchers might seek knowledge, experience, fun, or glory rather
than gold.
So it is not the case that the decreasing cost of hardware renders software worthless, but
rather that it is no longer possible to “hide” the cost of software development within the
cost of the hardware, at least not unless there are extremely large quantities of hardware.
q
Answer:
This is eminently achievable. The cellphone is a computer that can be used to make
phone calls and to send and receive text messages with little or no programming or
configuration on the part of the end user.
This might seem to be a trivial example at first glance, but if you consider it carefully
you will see that it is both simple and profound. When we are willing to sacrifice
generality, we can achieve truly astounding increases in productivity. Those who
indulge in excessive generality will therefore fail to set the productivity bar high enough
to succeed near the top of the software stack. This fact of life even has its own acronym:
YAGNI, or “You Ain’t Gonna Need It.” q
Answer:
Exactly! And that is the whole point of using existing software. One team’s work can
be used by many other teams, resulting in a large decrease in overall effort compared to
all teams needlessly reinventing the wheel. q
Answer:
There are any number of potential bottlenecks:
2. Cache. If a single thread’s cache footprint completely fills any shared CPU
cache(s), then adding more threads will simply thrash those affected caches.
4. I/O bandwidth. If a single thread is I/O bound, adding more threads will simply
result in them all waiting in line for the affected I/O resource.
Specific hardware systems might have any number of additional bottlenecks. The
fact is that every resource which is shared between multiple CPUs or threads is a
potential bottleneck. q
Answer:
There are any number of potential limits on the number of threads:
1. Main memory. Each thread consumes some memory (for its stack if nothing else),
so that excessive numbers of threads can exhaust memory, resulting in excessive
paging or memory-allocation failures.
Specific applications and platforms may have any number of additional limiting
factors. q
Answer:
There are a great many other potential obstacles to parallel programming. Here are a
few of them:
1. The only known algorithms for a given project might be inherently sequential in
nature. In this case, either avoid parallel programming (there being no law saying
that your project has to run in parallel) or invent a new parallel algorithm.
2. The project allows binary-only plugins that share the same address space, such
that no one developer has access to all of the source code for the project. Because
many parallel bugs, including deadlocks, are global in nature, such binary-only
plugins pose a severe challenge to current software development methodologies.
This might well change, but for the time being, all developers of parallel code
sharing a given address space need to be able to see all of the code running in that
address space.
3. The project contains heavily used APIs that were designed without regard to
parallelism [AGH+ 11a, CKZ+ 13]. Some of the more ornate features of the
System V message-queue API form a case in point. Of course, if your project has
been around for a few decades, and its developers did not have access to parallel
hardware, it undoubtedly has at least its share of such APIs.
4. The project was implemented without regard to parallelism. Given that there are a
great many techniques that work extremely well in a sequential environment, but
that fail miserably in parallel environments, if your project ran only on sequential
hardware for most of its lifetime, then your project undoubtably has at least its
share of parallel-unfriendly code.
6. The people who originally did the development on your project have since moved
on, and the people remaining, while well able to maintain it or add small features,
are unable to make “big animal” changes. In this case, unless you can work out a
very simple way to parallelize your project, you will probably be best off leaving
it sequential. That said, there are a number of simple approaches that you might
use to parallelize your project, including running multiple instances of it, using a
parallel implementation of some heavily used library function, or making use of
some other parallel project, such as a database.
One can argue that many of these obstacles are non-technical in nature, but that does
not make them any less real. In short, parallelization of a large body of code can be a
528 APPENDIX C. ANSWERS TO QUICK QUIZZES
large and complex effort. As with any large and complex effort, it makes sense to do
your homework beforehand. q
Answer:
It might well be easier to ignore the detailed properties of the hardware, but in most cases
it would be quite foolish to do so. If you accept that the only purpose of parallelism is
to increase performance, and if you further accept that performance depends on detailed
properties of the hardware, then it logically follows that parallel programmers are going
to need to know at least a few hardware properties.
This is the case in most engineering disciplines. Would you want to use a bridge
designed by an engineer who did not understand the properties of the concrete and
steel making up that bridge? If not, why would you expect a parallel programmer to be
able to develop competent parallel software without at least some understanding of the
underlying hardware? q
Answer:
One answer to this question is that it is often possible to pack multiple elements of data
into a single machine word, which can then be manipulated atomically.
A more trendy answer would be machines supporting transactional memory [Lom77].
As of early 2014, several mainstream systems provide limited hardware transactional
memory implementations, which is covered in more detail in Section 17.3. The jury
is still out on the applicability of software transactional memory [MMW07, PW07,
RHP+ 07, CBM+ 08, DFGG11, MS12]. Additional information on software transac-
tional memory may be found in Section 17.2. q
Answer:
Unfortunately, not so much. There has been some reduction given constant numbers of
CPUs, but the finite speed of light and the atomic nature of matter limits their ability
to reduce cache-miss overhead for larger systems. Section 3.3 discusses some possible
avenues for possible future progress. q
Answer:
This sequence ignored a number of possible complications, including:
Answer:
If the cacheline was not flushed from CPU 7’s cache, then CPUs 0 and 7 might have
different values for the same set of variables in the cacheline. This sort of incoherence
would greatly complicate parallel software, and so hardware architects have been
convinced to avoid it. q
Answer:
The hardware designers have been working on this problem, and have consulted with
no less a luminary than the physicist Stephen Hawking. Hawking’s observation was that
the hardware designers have two basic problems [Gar07]:
The first problem limits raw speed, and the second limits miniaturization, which
in turn limits frequency. And even this sidesteps the power-consumption issue that is
currently holding production frequencies to well below 10 GHz.
Nevertheless, some progress is being made, as may be seen by comparing Table C.1
with Table 3.1 on page 29. Integration of hardware threads in a single core and multiple
cores on a die have improved latencies greatly, at least within the confines of a single
530 APPENDIX C. ANSWERS TO QUICK QUIZZES
Ratio
Operation Cost (ns) (cost/clock)
Clock period 0.4 1.0
“Best-case” CAS 12.2 33.8
Best-case lock 25.6 71.2
Single cache miss 12.9 35.8
CAS cache miss 7.0 19.4
Off-Core
Single cache miss 31.2 86.6
CAS cache miss 31.2 86.5
Off-Socket
Single cache miss 92.4 256.7
CAS cache miss 95.9 266.4
Comms Fabric 2,600.0 7,220.0
Global Comms 195,000,000.0 542,000,000.0
core or single die. There has been some improvement in overall system latency, but only
by about a factor of two. Unfortunately, neither the speed of light nor the atomic nature
of matter has changed much in the past few years.
Section 3.3 looks at what else hardware designers might be able to do to ease the
plight of parallel programmers. q
Answer:
Get a roll of toilet paper. In the USA, each roll will normally have somewhere around
350-500 sheets. Tear off one sheet to represent a single clock cycle, setting it aside.
Now unroll the rest of the roll.
The resulting pile of toilet paper will likely represent a single CAS cache miss.
For the more-expensive inter-system communications latencies, use several rolls (or
multiple cases) of toilet paper to represent the communications latency.
Important safety tip: make sure to account for the needs of those you live with when
appropriating toilet paper! q
Answer:
Electron drift velocity tracks the long-term movement of individual electrons. It turns
out that individual electrons bounce around quite randomly, so that their instantaneous
speed is very high, but over the long term, they don’t move very far. In this, electrons
resemble long-distance commuters, who might spend most of their time traveling at
full highway speed, but over the long term going nowhere. These commuters’ speed
C.3. HARDWARE AND ITS HABITS 531
might be 70 miles per hour (113 kilometers per hour), but their long-term drift velocity
relative to the planet’s surface is zero.
Therefore, we should pay attention not to the electrons’ drift velocity, but to their
instantaneous velocities. However, even their instantaneous velocities are nowhere
near a significant fraction of the speed of light. Nevertheless, the measured velocity of
electric waves in conductors is a substantial fraction of the speed of light, so we still
have a mystery on our hands.
The other trick is that electrons interact with each other at significant distances (from
an atomic perspective, anyway), courtesy of their negative charge. This interaction is
carried out by photons, which do move at the speed of light. So even with electricity’s
electrons, it is photons doing most of the fast footwork.
Extending the commuter analogy, a driver might use a smartphone to inform other
drivers of an accident or congestion, thus allowing a change in traffic flow to propagate
much faster than the instantaneous velocity of the individual cars. Summarizing the
analogy between electricity and traffic flow:
1. The (very low) drift velocity of an electron is similar to the long-term velocity of
a commuter, both being very nearly zero.
2. The (still rather low) instantaneous velocity of an electron is similar to the instan-
taneous velocity of a car in traffic. Both are much higher than the drift velocity,
but quite small compared to the rate at which changes propagate.
Answer:
There are a number of reasons:
1. Shared-memory multiprocessor systems have strict size limits. If you need more
than a few thousand CPUs, you have no choice but to use a distributed system.
It is likely that continued work on parallel applications will increase the number
of embarrassingly parallel applications that can run well on machines and/or clusters
having long communications latencies. That said, greatly reduced hardware latencies
would be an extremely welcome development. q
Answer:
Because it is often the case that only a small fraction of the program is performance-
critical. Shared-memory parallelism allows us to focus distributed-programming tech-
niques on that small fraction, allowing simpler shared-memory techniques to be used on
the non-performance-critical bulk of the program. q
Answer:
They look that way because they are in fact low-level synchronization primitives. But as
such, they are in fact the fundamental tools for building low-level concurrent software.
q
Answer:
Because you should never forget the simple stuff!
Please keep in mind that the title of this book is “Is Parallel Programming Hard,
And, If So, What Can You Do About It?”. One of the most effective things you can do
about it is to avoid forgetting the simple stuff! After all, if you choose to do parallel
programming the hard way, you have no one but yourself to blame. q
Answer:
One straightforward approach is the shell pipeline:
For a sufficiently large input file, grep will pattern-match in parallel with sed
editing and with the input processing of sort. See the file parallel.sh for a
demonstration of shell-script parallelism and pipelining. q
Answer:
In fact, it is quite likely that a very large fraction of parallel programs in use today are
script-based. However, script-based parallelism does have its limitations:
4. Scripting languages are often too slow, but are often quite useful when coordi-
nating execution of long-running programs written in lower-level programming
languages.
Answer:
Some parallel applications need to take special action when specific children exit, and
therefore need to wait for each child individually. In addition, some parallel applications
need to detect the reason that the child died. As we saw in Figure 4.3, it is not hard to
build a waitall() function out of the wait() function, but it would be impossible
to do the reverse. Once the information about a specific child is lost, it is lost. q
Answer:
Indeed there is, and it is quite possible that this section will be expanded in future
versions to include messaging features (such as UNIX pipes, TCP/IP, and shared file
I/O) and memory mapping (such as mmap() and shmget()). In the meantime, there
are any number of textbooks that cover these primitives in great detail, and the truly
motivated can read manpages, existing parallel applications using these primitives, as
well as the source code of the Linux-kernel implementations themselves.
It is important to note that the parent process in Figure 4.4 waits until after the child
terminates to do its printf(). Using printf()’s buffered I/O concurrently to the
same file from multiple processes is non-trivial, and is best avoided. If you really need
to do concurrent buffered I/O, consult the documentation for your OS. For UNIX/Linux
systems, Stewart Weiss’s lecture notes provide a good introduction with informative
examples [Wei13]. q
Answer:
In this simple example, there is no reason whatsoever. However, imagine a more
complex example, where mythread() invokes other functions, possibly separately
compiled. In such a case, pthread_exit() allows these other functions to end the
thread’s execution without having to pass some sort of error return all the way back up
to mythread(). q
Answer:
Ah, but the Linux kernel is written in a carefully selected superset of the C language
that includes special gcc extensions, such as asms, that permit safe execution even
in presence of data races. In addition, the Linux kernel does not run on a number of
platforms where data races would be especially problematic. For an example, consider
embedded systems with 32-bit pointers and 16-bit busses. On such a system, a data
race involving a store to and a load from a given pointer might well result in the load
returning the low-order 16 bits of the old value of the pointer concatenated with the
high-order 16 bits of the new value of the pointer.
Nevertheless, even in the Linux kernel, data races can be quite dangerous and should
be avoided where feasible [Cor12]. q
Answer:
The first thing you should do is to ask yourself why you would want to do such a thing.
If the answer is “because I have a lot of data that is read by many threads, and only
C.4. TOOLS OF THE TRADE 535
occasionally updated”, then POSIX reader-writer locks might be what you are looking
for. These are introduced in Section 4.2.4.
Another way to get the effect of multiple threads holding the same lock is for one
thread to acquire the lock, and then use pthread_create() to create the other
threads. The question of why this would ever be a good idea is left to the reader. q
Answer:
Because we will need to pass lock_reader() to pthread_create(). Although
we could cast the function when passing it to pthread_create(), function casts
are quite a bit uglier and harder to get right than are simple pointer casts. q
Answer:
Indeed! And for that reason, the pthread_mutex_lock() and pthread_mutex_
unlock() primitives are normally wrapped in functions that do this error check-
ing. Later on, we will wrap them with the Linux kernel spin_lock() and spin_
unlock() APIs. q
Answer:
No. The reason that “x = 0” was output was that lock_reader() acquired the lock
first. Had lock_writer() instead acquired the lock first, then the output would
have been “x = 3”. However, because the code fragment started lock_reader()
first and because this run was performed on a multiprocessor, one would normally
expect lock_reader() to acquire the lock first. However, there are no guarantees,
especially on a busy system. q
Answer:
Although it is sometimes possible to write a program using a single global lock that both
performs and scales well, such programs are exceptions to the rule. You will normally
need to use multiple locks to attain good performance and scalability.
One possible exception to this rule is “transactional memory”, which is currently
a research topic. Transactional-memory semantics can be loosely thought of as those
536 APPENDIX C. ANSWERS TO QUICK QUIZZES
of a single global lock with optimizations permitted and with the addition of roll-
back [Boe09]. q
Answer:
No. On a busy system, lock_reader() might be preempted for the entire dura-
tion of lock_writer()’s execution, in which case it would not see any of lock_
writer()’s intermediate states for x. q
Answer:
See line 3 of Figure 4.6. Because the code in Figure 4.7 ran first, it could rely on the
compile-time initialization of x. The code in Figure 4.8 ran next, so it had to re-initialize
x. q
Answer:
A volatile declaration is in fact a reasonable alternative in this particular case.
However, use of READ_ONCE() has the benefit of clearly flagging to the reader that
goflag is subject to concurrent reads and updates. However, READ_ONCE() is
especially useful in cases where most of the accesses are protected by a lock (and thus
not subject to change), but where a few of the accesses are made outside of the lock.
Using a volatile declaration in this case would make it harder for the reader to note the
special accesses outside of the lock, and would also make it harder for the compiler to
generate good code under the lock. q
Answer:
No, memory barriers are not needed and won’t help here. Memory barriers only enforce
ordering among multiple memory references: They do absolutely nothing to expedite
the propagation of data from one part of the system to another. This leads to a quick
rule of thumb: You do not need memory barriers unless you are using more than one
variable to communicate between multiple threads.
But what about nreadersrunning? Isn’t that a second variable used for com-
munication? Indeed it is, and there really are the needed memory-barrier instructions
C.4. TOOLS OF THE TRADE 537
Answer:
It depends. If the per-thread variable was accessed only from its thread, and never from
a signal handler, then no. Otherwise, it is quite possible that READ_ONCE() is needed.
We will see examples of both situations in Section 5.4.4.
This leads to the question of how one thread can gain access to another thread’s
__thread variable, and the answer is that the second thread must store a pointer to
its __thread pointer somewhere that the first thread has access to. One common
approach is to maintain a linked list with one element per thread, and to store the address
of each thread’s __thread variable in the corresponding element. q
Answer:
Not at all. In fact, this comparison was, if anything, overly lenient. A more bal-
anced comparison would be against single-CPU throughput with the locking primitives
commented out. q
Answer:
If the data being read never changes, then you do not need to hold any locks while ac-
cessing it. If the data changes sufficiently infrequently, you might be able to checkpoint
execution, terminate all threads, change the data, then restart at the checkpoint.
Another approach is to keep a single exclusive lock per thread, so that a thread
read-acquires the larger aggregate reader-writer lock by acquiring its own lock, and
write-acquires by acquiring all the per-thread locks [HW92]. This can work quite well
for readers, but causes writers to incur increasingly large overheads as the number of
threads increases.
Some other ways of handling very small critical sections are described in Chapter 9.
q
Answer:
Your first clue is that 64 CPUs is exactly half of the 128 CPUs on the machine. The
difference is an artifact of hardware threading. This system has 64 cores with two
hardware threads per core. As long as fewer than 64 threads are running, each can run
in its own core. But as soon as there are more than 64 threads, some of the threads
must share cores. Because the pair of threads in any given core share some hardware
resources, the throughput of two threads sharing a core is not quite as high as that of
two threads each in their own core. So the performance of the 100M trace is limited
not by the reader-writer lock, but rather by the sharing of hardware resources between
hardware threads in a single core.
This can also be seen in the 10M trace, which deviates gently from the ideal line up
to 64 threads, then breaks sharply down, parallel to the 100M trace. Up to 64 threads,
the 10M trace is limited primarily by reader-writer lock scalability, and beyond that,
also by sharing of hardware resources between hardware threads in a single core. q
Answer:
In general, newer hardware is improving. However, it will need to improve more than
two orders of magnitude to permit reader-writer lock to achieve ideal performance
on 128 CPUs. Worse yet, the greater the number of CPUs, the larger the required
performance improvement. The performance problems of reader-writer locking are
therefore very likely to be with us for quite some time to come. q
Answer:
Strictly speaking, no. One could implement any member of the second set using the
corresponding member of the first set. For example, one could implement __sync_
nand_and_fetch() in terms of __sync_fetch_and_nand() as follows:
tmp = v;
ret = __sync_fetch_and_nand(p, tmp);
ret = ~ret & tmp;
Answer:
Unfortunately, no. See Chapter 5 for some stark counterexamples. q
Answer:
They don’t really exist. All tasks executing within the Linux kernel share memory, at
least unless you want to do a huge amount of memory-mapping work by hand. q
Answer:
On CPUs with load-store architectures, incrementing counter might compile into
something like the following:
LOAD counter,r0
INC r0
STORE r0,counter
On such machines, two threads might simultaneously load the value of counter,
each increment it, and each store the result. The new value of counter will then only
be one greater than before, despite two threads each incrementing it. q
Answer:
One approach would be to create an array indexed by smp_thread_id(), and
another would be to use a hash table to map from smp_thread_id() to an array
index—which is in fact what this set of APIs does in pthread environments.
Another approach would be for the parent to allocate a structure containing fields
for each desired per-thread variable, then pass this to the child during thread creation.
However, this approach can impose large software-engineering costs in large systems.
To see this, imagine if all global variables in a large system had to be declared in a single
file, regardless of whether or not they were C static variables! q
Answer:
It might well do that, however, checking is left as an exercise for the reader. But in the
meantime, I hope that we can agree that vfork() is a variant of fork(), so that we
can use fork() as a generic term covering both. q
540 APPENDIX C. ANSWERS TO QUICK QUIZZES
C.5 Counting
Quick Quiz 5.1:
Why on earth should efficient and scalable counting be hard? After all, computers have
special hardware for the sole purpose of doing counting, addition, subtraction, and lots
more besides, don’t they???
Answer:
Because the straightforward counting algorithms, for example, atomic operations on
a shared counter, either are slow and scale badly, or are inaccurate, as will be seen in
Section 5.1. q
Answer:
Hint: The act of updating the counter must be blazingly fast, but because the counter is
read out only about once in five million updates, the act of reading out the counter can be
quite slow. In addition, the value read out normally need not be all that accurate—after
all, since the counter is updated a thousand times per millisecond, we should be able
to work with a value that is within a few thousand counts of the “true value”, whatever
“true value” might mean in this context. However, the value read out should maintain
roughly the same absolute error over time. For example, a 1% error might be just fine
when the count is on the order of a million or so, but might be absolutely unacceptable
once the count reaches a trillion. See Section 5.2. q
Answer:
Hint: The act of updating the counter must again be blazingly fast, but the counter is
read out each time that the counter is increased. However, the value read out need not be
accurate except that it must distinguish approximately between values below the limit
and values greater than or equal to the limit. See Section 5.3. q
always at least one structure in use, and suppose further still that it is necessary to know
exactly when this counter reaches zero, for example, in order to free up some memory
that is not required unless there is at least one structure in use.
Answer:
Hint: The act of updating the counter must once again be blazingly fast, but the counter
is read out each time that the counter is increased. However, the value read out need not
be accurate except that it absolutely must distinguish perfectly between values between
the limit and zero on the one hand, and values that either are less than or equal to zero
or are greater than or equal to the limit on the other hand. See Section 5.4. q
Answer:
Hint: Yet again, the act of updating the counter must be blazingly fast and scalable
in order to avoid slowing down I/O operations, but because the counter is read out
only when the user wishes to remove the device, the counter read-out operation can
be extremely slow. Furthermore, there is no need to be able to read out the counter at
all unless the user has already indicated a desire to remove the device. In addition, the
value read out need not be accurate except that it absolutely must distinguish perfectly
between non-zero and zero values, and even then only when the device is in the process
of being removed. However, once it has read out a zero value, it must act to keep the
value at zero until it has taken some action to prevent subsequent threads from gaining
access to the device being removed. See Section 5.5. q
Answer:
Although the ++ operator could be atomic, there is no requirement that it be so. And
indeed, gcc often chooses to load the value to a register, increment the register, then
store the value to memory, which is decidedly non-atomic. q
Answer:
Not only are there very few trivial parallel programs, and most days I am not so sure
that there are many trivial sequential programs, either.
542 APPENDIX C. ANSWERS TO QUICK QUIZZES
No matter how small or simple the program, if you haven’t tested it, it does not
work. And even if you have tested it, Murphy’s Law says that there will be at least a
few bugs still lurking.
Furthermore, while proofs of correctness certainly do have their place, they never
will replace testing, including the counttorture.h test setup used here. After all,
proofs are only as good as the assumptions that they are based on. Furthermore, proofs
can have bugs just as easily as programs can! q
Answer:
Because of the overhead of the atomic operation. The dashed line on the x axis
represents the overhead of a single non-atomic increment. After all, an ideal algorithm
would not only scale linearly, it would also incur no performance penalty compared to
single-threaded code.
This level of idealism may seem severe, but if it is good enough for Linus Torvalds,
it is good enough for you. q
Answer:
In many cases, atomic increment will in fact be fast enough for you. In those cases,
you should by all means use atomic increment. That said, there are many real-world
situations where more elaborate counting algorithms are required. The canonical
example of such a situation is counting packets and bytes in highly optimized networking
stacks, where it is all too easy to find much of the execution time going into these sorts
of accounting tasks, especially on large multiprocessors.
In addition, as noted at the beginning of this chapter, counting provides an excellent
view of the issues encountered in shared-memory parallel programs. q
Answer:
It might well be possible to do this in some cases. However, there are a few complica-
tions:
1. If the value of the variable is required, then the thread will be forced to wait for
the operation to be shipped to the data, and then for the result to be shipped back.
2. If the atomic increment must be ordered with respect to prior and/or subsequent
operations, then the thread will be forced to wait for the operation to be shipped
to the data, and for an indication that the operation completed to be shipped back.
C.5. COUNTING 543
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
3. Shipping operations among CPUs will likely require more lines in the system
interconnect, which will consume more die area and more electrical power.
But what if neither of the first two conditions holds? Then you should think carefully
about the algorithms discussed in Section 5.2, which achieve near-ideal performance on
commodity hardware.
If either or both of the first two conditions hold, there is some hope for improved
hardware. One could imagine the hardware implementing a combining tree, so that the
increment requests from multiple CPUs are combined by the hardware into a single
addition when the combined request reaches the hardware. The hardware could also
apply an order to the requests, thus returning to each CPU the return value corresponding
to its particular atomic increment. This results in instruction latency that varies as
O(logN), where N is the number of CPUs, as shown in Figure C.1. And CPUs with this
sort of hardware optimization are starting to appear as of 2011.
This is a great improvement over the O(N) performance of current hardware shown
in Figure 5.4, and it is possible that hardware latencies might decrease further if innova-
tions such as three-dimensional fabrication prove practical. Nevertheless, we will see
that in some important special cases, software can do much better. q
Answer:
No, because modulo addition is still commutative and associative. At least as long as
you use unsigned integers. Recall that in the C standard, overflow of signed integers
results in undefined behavior, never mind the fact that machines that do anything other
than wrap on overflow are quite rare these days. Unfortunately, compilers frequently
carry out optimizations that assume that signed integers will not overflow, so if your code
allows signed integers to overflow, you can run into trouble even on twos-complement
hardware.
That said, one potential source of additional complexity arises when attempting
to gather (say) a 64-bit sum from 32-bit per-thread counters. Dealing with this added
complexity is left as an exercise for the reader, for whom some of the techniques
introduced later in this chapter could be quite helpful. q
544 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
It can, and in this toy implementation, it does. But it is not that hard to come up with
an alternative implementation that permits an arbitrary number of threads, for example,
using the gcc __thread facility, as shown in Section 5.2.4. q
Answer:
According to the C standard, the effects of fetching a variable that might be concurrently
modified by some other thread are undefined. It turns out that the C standard really has
no other choice, given that C must support (for example) eight-bit architectures which
are incapable of atomically loading a long. An upcoming version of the C standard
aims to fill this gap, but until then, we depend on the kindness of the gcc developers.
Alternatively, use of volatile accesses such as those provided by ACCESS_ONCE() [Cor12]
can help constrain the compiler, at least in cases where the hardware is capable of ac-
cessing the value with a single memory-reference instruction. q
Answer:
The C standard specifies that the initial value of global variables is zero, unless they
are explicitly initialized. So the initial value of all the instances of counter will be
zero. Furthermore, in the common case where the user is interested only in differences
between consecutive reads from statistical counters, the initial value is irrelevant. q
Answer:
Indeed, this toy example does not support more than one counter. Modifying it so that it
can provide multiple counters is left as an exercise to the reader. q
Answer:
Let’s do worst-case analysis first, followed by a less conservative analysis.
In the worst case, the read operation completes immediately, but is then delayed for
∆ time units before returning, in which case the worst-case error is simply r∆.
C.5. COUNTING 545
This worst-case behavior is rather unlikely, so let us instead consider the case where
the reads from each of the N counters is spaced equally over the time period ∆. There
∆
will be N + 1 intervals of duration N+1 between the N reads. The error due to the delay
r∆
after the read from the last thread’s counter will be given by N(N+1) , the second-to-last
2r∆ 3r∆
thread’s counter by N(N+1) , the third-to-last by N(N+1) , and so on. The total error is
given by the sum of the errors due to the reads from each thread’s counter, which is:
N
r∆
∑ i (C.1)
N (N + 1) i=1
Expressing the summation in closed form yields:
r∆ N (N + 1)
(C.2)
N (N + 1) 2
Cancelling yields the intuitively expected result:
r∆
(C.3)
2
It is important to remember that error continues accumulating as the caller executes
code making use of the count returned by the read operation. For example, if the caller
spends time t executing some computation based on the result of the returned count, the
worst-case error will have increased to r (∆ + t).
The expected error will have similarly increased to:
∆
r +t (C.4)
2
Of course, it is sometimes unacceptable for the counter to continue incrementing
during the read operation. Section 5.5 discusses a way to handle this situation.
Thus far, we have been considering a counter that is only increased, never decreased.
If the counter value is being changed by r counts per unit time, but in either direction,
we should expect the error to reduce. However, the worst case is unchanged because
although the counter could move in either direction, the worst case is when the read
operation completes immediately, but then is delayed for ∆ time units, during which
time all the changes in the counter’s value move it in the same direction, again giving us
an absolute error of r∆.
There are a number of ways to compute the average error, based on a variety of
assumptions about the patterns of increments and decrements. For simplicity, let’s
assume that the f fraction of the operations are decrements, and that the error of
interest is the deviation from the counter’s long-term trend line. Under this assumption,
if f is less than or equal to 0.5, each decrement will be cancelled by an increment,
so that 2 f of the operations will cancel each other, leaving 1 − 2 f of the operations
being uncancelled increments. On the other hand, if f is greater than 0.5, 1 − f of
the decrements are cancelled by increments, so that the counter moves in the negative
direction by −1 + 2 (1 − f ), which simplifies to 1 − 2 f , so that the counter moves an
average of 1 − 2 f per operation in either case. Therefore, that the long-term movement
of the counter is given by (1 − 2 f ) r. Plugging this into Equation C.3 yields:
(1 − 2 f ) r∆
(C.5)
2
546 APPENDIX C. ANSWERS TO QUICK QUIZZES
All that aside, in most uses of statistical counters, the error in the value returned by
read_count() is irrelevant. This irrelevance is due to the fact that the time required
for read_count() to execute is normally extremely small compared to the time
interval between successive calls to read_count(). q
Answer:
Because one of the two threads only reads, and because the variable is aligned and
machine-sized, non-atomic instructions suffice. That said, the ACCESS_ONCE() macro
is used to prevent compiler optimizations that might otherwise prevent the counter
updates from becoming visible to eventual() [Cor12].
An older version of this algorithm did in fact use atomic instructions, kudos to
Ersoy Bayramoglu for pointing out that they are in fact unnecessary. That said, atomic
instructions would be needed in cases where the per-thread counter variables were
smaller than the global global_count. However, note that on a 32-bit system, the
per-thread counter variables might need to be limited to 32 bits in order to sum them
accurately, but with a 64-bit global_count variable to avoid overflow. In this case,
it is necessary to zero the per-thread counter variables periodically in order to avoid
overflow. It is extremely important to note that this zeroing cannot be delayed too long
or overflow of the smaller per-thread variables will result. This approach therefore
imposes real-time requirements on the underlying system, and in turn must be used with
extreme care.
In contrast, if all variables are the same size, overflow of any variable is harmless
because the eventual sum will be modulo the word size. q
Answer:
In this case, no. What will happen instead is that as the number of threads increases,
the estimate of the counter value returned by read_count() will become more
inaccurate. q
Answer:
Yes. If this proves problematic, one fix is to provide multiple eventual() threads,
each covering its own subset of the other threads. In more extreme cases, a tree-like
hierarchy of eventual() threads might be required. q
updates have extremely low overhead and are extremely scalable, why would anyone
bother with the implementation described in Section 5.2.2, given its costly read-side
code?
Answer:
The thread executing eventual() consumes CPU time. As more of these eventually-
consistent counters are added, the resulting eventual() threads will eventually
consume all available CPUs. This implementation therefore suffers a different sort
of scalability limitation, with the scalability limit being in terms of the number of
eventually consistent counters rather than in terms of the number of threads or CPUs.
Of course, it is possible to make other tradeoffs. For example, a single thread could
be created to handle all eventually-consistent counters, which would limit the overhead
to a single CPU, but would result in increasing update-to-read latencies as the number
of counters increased. Alternatively, that single thread could track the update rates
of the counters, visiting the frequently-updated counters more frequently. In addition,
the number of threads handling the counters could be set to some fraction of the total
number of CPUs, and perhaps also adjusted at runtime. Finally, each counter could
specify its latency, and deadline-scheduling techniques could be used to provide the
required latencies to each counter.
There are no doubt many other tradeoffs that could be made. q
Answer:
Why indeed?
To be fair, gcc faces some challenges that the Linux kernel gets to ignore. When
a user-level thread exits, its per-thread variables all disappear, which complicates the
problem of per-thread-variable access, particularly before the advent of user-level RCU
(see Section 9.5). In contrast, in the Linux kernel, when a CPU goes offline, that CPU’s
per-CPU variables remain mapped and accessible.
Similarly, when a new user-level thread is created, its per-thread variables suddenly
come into existence. In contrast, in the Linux kernel, all per-CPU variables are mapped
and initialized at boot time, regardless of whether the corresponding CPU exists yet, or
indeed, whether the corresponding CPU will ever exist.
A key limitation that the Linux kernel imposes is a compile-time maximum bound
on the number of CPUs, namely, CONFIG_NR_CPUS, along with a typically tighter
boot-time bound of nr_cpu_ids. In contrast, in user space, there is no hard-coded
upper limit on the number of threads.
Of course, both environments must handle dynamically loaded code (dynamic
libraries in user space, kernel modules in the Linux kernel), which increases the com-
plexity of per-thread variables.
These complications make it significantly harder for user-space environments to
provide access to other threads’ per-thread variables. Nevertheless, such access is highly
useful, and it is hoped that it will someday appear. q
548 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
This is a reasonable strategy. Checking for the performance difference is left as an
exercise for the reader. However, please keep in mind that the fastpath is not read_
count(), but rather inc_count(). q
Answer:
Remember, when a thread exits, its per-thread variables disappear. Therefore, if we
attempt to access a given thread’s per-thread variables after that thread exits, we will get
a segmentation fault. The lock coordinates summation and thread exit, preventing this
scenario.
Of course, we could instead read-acquire a reader-writer lock, but Chapter 9 will
introduce even lighter-weight mechanisms for implementing the required coordination.
Another approach would be to use an array instead of a per-thread variable, which,
as Alexey Roytman notes, would eliminate the tests against NULL. However, array
accesses are often slower than accesses to per-thread variables, and use of an array
would imply a fixed upper bound on the number of threads. Also, note that neither tests
nor locks are needed on the inc_count() fastpath. q
Answer:
This lock could in fact be omitted, but better safe than sorry, especially given that this
function is executed only at thread startup, and is therefore not on any critical path. Now,
if we were testing on machines with thousands of CPUs, we might need to omit the lock,
but on machines with “only” a hundred or so CPUs, there is no need to get fancy. q
Answer:
Remember, the Linux kernel’s per-CPU variables are always accessible, even if the
corresponding CPU is offline—even if the corresponding CPU never existed and never
will exist.
One workaround is to ensure that each thread continues to exist until all threads
are finished, as shown in Figure C.2 (count_tstat.c). Analysis of this code is
left as an exercise to the reader, however, please note that it does not fit well into the
C.5. COUNTING 549
Answer:
When counting packets, the counter is only incremented by the value one. On the other
hand, when counting bytes, the counter might be incremented by largish numbers.
Why does this matter? Because in the increment-by-one case, the value returned
will be exact in the sense that the counter must necessarily have taken on that value at
some point in time, even if it is impossible to say precisely when that point occurred.
In contrast, when counting bytes, two different threads might return values that are
inconsistent with any global ordering of operations.
To see this, suppose that thread 0 adds the value three to its counter, thread 1 adds
the value five to its counter, and threads 2 and 3 sum the counters. If the system is
“weakly ordered” or if the compiler uses aggressive optimizations, thread 2 might find
the sum to be three and thread 3 might find the sum to be five. The only possible global
550 APPENDIX C. ANSWERS TO QUICK QUIZZES
orders of the sequence of values of the counter are 0,3,8 and 0,5,8, and neither order is
consistent with the results obtained.
If you missed this one, you are not alone. Michael Scott used this question to stump
Paul E. McKenney during Paul’s Ph.D. defense. q
Answer:
One approach would be to maintain a global approximation to the value. Readers
would increment their per-thread variable, but when it reached some predefined limit,
atomically add it to a global variable, then zero their per-thread variable. This would
permit a tradeoff between average increment overhead and accuracy of the value read
out.
The reader is encouraged to think up and try out other approaches, for example,
using a combining tree. q
Answer:
Because structures come in different sizes. Of course, a limit counter corresponding to a
specific size of structure might still be able to use inc_count() and dec_count().
q
Answer:
Two words. “Integer overflow.”
Try the above formulation with counter equal to 10 and delta equal to ULONG_
MAX. Then try it again with the code shown in Figure 5.12.
A good understanding of integer overflow will be required for the rest of this
example, so if you have never dealt with integer overflow before, please try several
examples to get the hang of it. Integer overflow can sometimes be more difficult to get
right than parallel algorithms! q
C.5. COUNTING 551
Answer:
That is in fact what an earlier version of this code did. But addition and subtraction are
extremely cheap, and handling all of the special cases that arise is quite complex. Again,
feel free to try it yourself, but beware of integer overflow! q
Answer:
The globalreserve variable tracks the sum of all threads’ countermax vari-
ables. The sum of these threads’ counter variables might be anywhere from zero to
globalreserve. We must therefore take a conservative approach, assuming that all
threads’ counter variables are full in add_count() and that they are all empty in
sub_count().
But remember this question, as we will come back to it later. q
Answer:
Indeed it will! In many cases, this will be a problem, as discussed in Section 5.3.3, and
in those cases the algorithms from Section 5.4 will likely be preferable. q
Answer:
Given that add_count() takes an unsigned long as its argument, it is going to
be a bit tough to pass it a negative number. And unless you have some anti-matter
memory, there is little point in allowing negative numbers when counting the number of
structures in use! q
Answer:
First, it really is reserving countermax counts (see line 14), however, it adjusts so
that only half of these are actually in use by the thread at the moment. This allows
552 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
The reason this happened is that thread 0’s counter was set to half of its countermax.
Thus, of the quarter assigned to thread 0, half of that quarter (one eighth) came from
globalcount, leaving the other half (again, one eighth) to come from the remaining
count.
There are two purposes for taking this approach: (1) To allow thread 0 to use the
fastpath for decrements as well as increments, and (2) To reduce the inaccuracies if all
threads are monotonically incrementing up towards the limit. To see this last point, step
through the algorithm and watch what it does. q
Answer:
This might well be possible, but great care is required. Note that removing counter
without first zeroing countermax could result in the corresponding thread increasing
counter immediately after it was zeroed, completely negating the effect of zeroing
the counter.
The opposite ordering, namely zeroing countermax and then removing counter,
can also result in a non-zero counter. To see this, consider the following sequence of
events:
4. Thread A, having found that its countermax is non-zero, proceeds to add to its
counter, resulting in a non-zero value for counter.
Answer:
It assumes eight bits per byte. This assumption does hold for all current commodity
microprocessors that can be easily assembled into shared-memory multiprocessors, but
certainly does not hold for all computer systems that have ever run C code. (What could
you do instead in order to comply with the C standard? What drawbacks would it have?)
q
Answer:
There is only one ctrandmax variable per thread. Later, we will see code that needs
to pass other threads’ ctrandmax variables to split_ctrandmax(). q
Answer:
Later, we will see that we need the int return to pass to the atomic_cmpxchg()
primitive. q
Answer:
Replacing the goto with a break would require keeping a flag to determine whether
or not line 15 should return, which is not the sort of thing you want on a fastpath. If
you really hate the goto that much, your best bet would be to pull the fastpath into a
separate function that returned success or failure, with “failure” indicating a need for
the slowpath. This is left as an exercise for goto-hating readers. q
Answer:
Later, we will see how the flush_local_count() function in Figure 5.20 might
update this thread’s ctrandmax variable concurrently with the execution of the fast-
path on lines 8-14 of Figure 5.18. q
554 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
This other thread cannot refill its ctrandmax until the caller of flush_local_
count() releases the gblcnt_mutex. By that time, the caller of flush_local_
count() will have finished making use of the counts, so there will be no problem with
this other thread refilling—assuming that the value of globalcount is large enough
to permit a refill. q
Answer:
Nothing. Consider the following three cases:
Answer:
The caller of both balance_count() and flush_local_count() hold gblcnt_
mutex, so only one may be executing at a given time. q
Answer:
No. If the signal handler is migrated to another CPU, then the interrupted thread is also
migrated along with it. q
Answer:
To indicate that only the fastpath is permitted to change the theft state, and that if the
thread remains in this state for too long, the thread running the slowpath will resend the
POSIX signal. q
Answer:
Reasons why collapsing the REQ and ACK states would be a very bad idea include:
1. The slowpath uses the REQ and ACK states to determine whether the signal
should be retransmitted. If the states were collapsed, the slowpath would have no
choice but to send redundant signals, which would have the unhelpful effect of
needlessly slowing down the fastpath.
The basic problem here is that the combined REQACK state can be referenced by
both the signal handler and the fastpath. The clear separation maintained by the
four-state setup ensures orderly state transitions.
That said, you might well be able to make a three-state setup work correctly. If you
do succeed, compare carefully to the four-state setup. Is the three-state solution really
preferable, and why or why not? q
Answer:
The first one (on line 11) can be argued to be unnecessary. The last two (lines 14 and 16)
are important. If these are removed, the compiler would be within its rights to rewrite
lines 14-17 as follows:
14 theft = THEFT_READY;
15 if (counting) {
16 theft = THEFT_ACK;
17 }
This would be fatal, as the slowpath might see the transient value of THEFT_READY,
and start stealing before the corresponding thread was ready. q
Answer:
Because the other thread is not permitted to change the value of its countermax
variable unless it holds the gblcnt_mutex lock. But the caller has acquired this lock,
so it is not possible for the other thread to hold it, and therefore the other thread is not
permitted to change its countermax variable. We can therefore safely access it—but
not change it. q
Answer:
There is no need for an additional check. The caller of flush_local_count() has
already invoked globalize_count(), so the check on line 28 will have succeeded,
skipping the later pthread_kill(). q
Answer:
The theft variable must be of type sig_atomic_t to guarantee that it can be safely
shared between the signal handler and the code interrupted by the signal. q
Answer:
Because many operating systems over several decades have had the property of losing
the occasional signal. Whether this is a feature or a bug is debatable, but irrelevant. The
obvious symptom from the user’s viewpoint will not be a kernel bug, but rather a user
application hanging.
C.5. COUNTING 557
Answer:
One approach is to use the techniques shown in Section 5.2.3, summarizing an approx-
imation to the overall counter value in a single variable. Another approach would be
to use multiple threads to carry out the reads, with each such thread interacting with a
specific subset of the updating threads. q
Answer:
One simple solution is to overstate the upper limit by the desired amount. The limiting
case of such overstatement results in the upper limit being set to the largest value that
the counter is capable of representing. q
Answer:
You had better have set the upper limit to be large enough accommodate the bias, the
expected maximum number of accesses, and enough “slop” to allow the counter to work
efficiently even when the number of accesses is at its maximum. q
Answer:
Strange, perhaps, but true! Almost enough to make you think that the name “reader-
writer lock” was poorly chosen, isn’t it? q
Answer:
A huge number!
Here are a few to start with:
1. There could be any number of devices, so that the global variables are inappropri-
ate, as are the lack of arguments to functions like do_io().
558 APPENDIX C. ANSWERS TO QUICK QUIZZES
2. Polling loops can be problematic in real systems. In many cases, it is far better to
have the last completing I/O wake up the device-removal thread.
3. The I/O might fail, and so do_io() will likely need a return value.
4. If the device fails, the last I/O might never complete. In such cases, there might
need to be some sort of timeout to allow error recovery.
5. Both add_count() and sub_count() can fail, but their return values are
not checked.
6. Reader-writer locks do not scale well. One way of avoiding the high read-
acquisition costs of reader-writer locks is presented in Chapters 7 and 9.
7. The polling loops result in poor energy efficiency. An event-driven design is
preferable.
Answer:
The read-side code must scan the entire fixed-size array, regardless of the number of
threads, so there is no difference in performance. In contrast, in the last two algorithms,
readers must do more work when there are more threads. In addition, the last two
algorithms interpose an additional level of indirection because they map from integer
thread ID to the corresponding __thread variable. q
Answer:
“Use the right tool for the job.”
As can be seen from Figure 5.3, single-variable atomic increment need not apply
for any job involving heavy use of parallel updates. In contrast, the algorithms shown
in Table 5.1 do an excellent job of handling update-heavy situations. Of course, if you
have a read-mostly situation, you should use something else, for example, an eventually
consistent design featuring a single atomically incremented variable that can be read out
using a single load, similar to the approach used in Section 5.2.3. q
Answer:
That depends on the workload. Note that on a 64-core system, you need more than one
hundred non-atomic operations (with roughly a 40-nanosecond performance gain) to
C.5. COUNTING 559
make up for even one signal (with almost a 5-microsecond performance loss). Although
there are no shortage of workloads with far greater read intensity, you will need to
consider your particular workload.
In addition, although memory barriers have historically been expensive compared
to ordinary instructions, you should check this on the specific hardware you will be
running. The properties of computer hardware do change over time, and algorithms
must change accordingly. q
Answer:
One approach is to give up some update-side performance, as is done with scalable
non-zero indicators (SNZI) [ELLM07]. There are a number of other ways one might
go about this, and these are left as exercises for the reader. Any number of approaches
that apply hierarchy, which replace frequent global-lock acquisitions with local lock
acquisitions corresponding to lower levels of the hierarchy, should work quite well. q
Answer:
In the C++ language, you might well be able to use ++ on a 1,000-digit number,
assuming that you had access to a class implementing such numbers. But as of 2010,
the C language does not permit operator overloading. q
Answer:
Indeed, multiple processes with separate address spaces can be an excellent way to
exploit parallelism, as the proponents of the fork-join methodology and the Erlang
language would be very quick to tell you. However, there are also some advantages to
shared-memory parallelism:
1. Only the most performance-critical portions of the application must be partitioned,
and such portions are usually a small fraction of the application.
2. Although cache misses are quite slow compared to individual register-to-register
instructions, they are typically considerably faster than inter-process-communication
primitives, which in turn are considerably faster than things like TCP/IP network-
ing.
3. Shared-memory multiprocessors are readily available and quite inexpensive, so,
in stark contrast to the 1990s, there is little cost penalty for use of shared-memory
parallelism.
560 APPENDIX C. ANSWERS TO QUICK QUIZZES
P1
P5 P2
P4 P3
Answer:
One such improved solution is shown in Figure C.3, where the philosophers are simply
provided with an additional five forks. All five philosophers may now eat simultaneously,
and there is never any need for philosophers to wait on one another. In addition, this
approach offers greatly improved disease control.
This solution might seem like cheating to some, but such “cheating” is key to finding
good solutions to many concurrency problems. q
Answer:
Inman was working with protocol stacks, which are normally depicted vertically, with
the application on top and the hardware interconnect on the bottom. Data flows up and
down this stack. “Horizontal parallelism” processes packets from different network con-
nections in parallel, while “vertical parallelism” handles different protocol-processing
steps for a given packet in parallel.
“Vertical parallelism” is also called “pipelining”. q
Answer:
In this case, simply dequeue an item from the non-empty queue, release both locks, and
return. q
Answer:
The best way to answer this is to run lockhdeq.c on a number of different multipro-
cessor systems, and you are encouraged to do so in the strongest possible terms. One
reason for concern is that each operation on this implementation must acquire not one
but two locks.
The first well-designed performance study will be cited.3 Do not forget to compare
to a sequential implementation! q
Answer:
It is optimal in the case where data flow switches direction only rarely. It would of
course be an extremely poor choice if the double-ended queue was being emptied from
both ends concurrently. This of course raises another question, namely, in what possible
universe emptying from both ends concurrently would be a reasonable thing to do.
Work-stealing queues are one possible answer to this question. q
Answer:
The need to avoid deadlock by imposing a lock hierarchy forces the asymmetry, just
as it does in the fork-numbering solution to the Dining Philosophers Problem (see
Section 6.1.1). q
Answer:
This retry is necessary because some other thread might have enqueued an element
between the time that this thread dropped d->rlock on line 25 and the time that it
reacquired this same lock on line 27. q
3 The studies by Dalessandro et al. [DCW+ 11] and Dice et al. [DLM+ 10] are good starting points.
562 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
It would be possible to use spin_trylock() to attempt to acquire the left-hand
lock when it was available. However, the failure case would still need to drop the
right-hand lock and then re-acquire the two locks in order. Making this transformation
(and determining whether or not it is worthwhile) is left as an exercise for the reader. q
Answer:
There are actually at least three. The third, by Dominik Dingel, makes interesting use of
reader-writer locking, and may be found in lockrwdeq.c. q
Answer:
The hashed double-ended queue’s locking design only permits one thread at a time at
each end, and further requires two lock acquisitions for each operation. The tandem
double-ended queue also permits one thread at a time at each end, and in the common
case requires only one lock acquisition per operation. Therefore, the tandem double-
ended queue should be expected to outperform the hashed double-ended queue.
Can you created a double-ended queue that allows multiple concurrent operations at
each end? If so, how? If not, why not? q
Answer:
One approach is to transform the problem to be solved so that multiple double-ended
queues can be used in parallel, allowing the simpler single-lock double-ended queue to
be used, and perhaps also replace each double-ended queue with a pair of conventional
single-ended queues. Without such “horizontal scaling”, the speedup is limited to 2.0. In
contrast, horizontal-scaling designs can achieve very large speedups, and are especially
attractive if there are multiple threads working either end of the queue, because in this
multiple-thread case the dequeue simply cannot provide strong ordering guarantees.
After all, the fact that a given thread removed an item first in no way implies that it will
process that item first [HKLP12]. And if there are no guarantees, we may as well obtain
the performance benefits that come with refusing to provide these guarantees.
Regardless of whether or not the problem can be transformed to use multiple queues,
it is worth asking whether work can be batched so that each enqueue and dequeue oper-
ation corresponds to larger units of work. This batching approach decreases contention
on the queue data structures, which increases both performance and scalability, as will
be seen in Section 6.3. After all, if you must incur high synchronization overheads, be
sure you are getting your money’s worth.
C.6. PARTITIONING AND SYNCHRONIZATION DESIGN 563
Other researchers are working on other ways to take advantage of limited ordering
guarantees in queues [KLP12]. q
Answer:
Although non-blocking synchronization can be very useful in some situations, it is no
panacea. Also, non-blocking synchronization really does have critical sections, as noted
by Josh Triplett. For example, in a non-blocking algorithm based on compare-and-swap
operations, the code starting at the initial load and continuing to the compare-and-swap
is in many ways analogous to a lock-based critical section. q
Answer:
Here are a few possible solutions to this existence guarantee problem:
1. Provide a statically allocated lock that is held while the per-structure lock is being
acquired, which is an example of hierarchical locking (see Section 6.4.2). Of
course, using a single global lock for this purpose can result in unacceptably high
levels of lock contention, dramatically reducing performance and scalability.
6. Use transactional memory (TM) [HM93, Lom77, ST95], so that each reference
and modification to the data structure in question is performed atomically. Al-
though TM has engendered much excitement in recent years, and seems likely
to be of some use in production software, developers should exercise some cau-
tion [BLM05, BLM06, MMW07], particularly in performance-critical code. In
particular, existence guarantees require that the transaction cover the full path
from a global reference to the data elements being updated.
Answer:
The matmul.c program creates the specified number of worker threads, so even the
single-worker-thread case incurs thread-creation overhead. Making the changes required
to optimize away thread-creation overhead in the single-worker-thread case is left as an
exercise to the reader. q
Answer:
I am glad that you are paying attention! This example serves to show that although
data parallelism can be a very good thing, it is not some magic wand that automatically
wards off any and all sources of inefficiency. Linear scaling at full performance, even to
“only” 64 threads, requires care at all phases of design and implementation.
In particular, you need to pay careful attention to the size of the partitions. For
example, if you split a 64-by-64 matrix multiply across 64 threads, each thread gets
only 64 floating-point multiplies. The cost of a floating-point multiply is minuscule
compared to the overhead of thread creation.
Moral: If you have a parallel program with variable input, always include a check
for the input size being too small to be worth parallelizing. And when it is not helpful to
parallelize, it is not helpful to incur the overhead required to spawn a thread, now is it?
q
Answer:
If the comparison on line 31 of Figure 6.26 were replaced by a much heavier-weight
C.6. PARTITIONING AND SYNCHRONIZATION DESIGN 565
Answer:
This is due to the per-CPU target value being three. A run length of 12 must acquire the
global-pool lock twice, while a run length of 13 must acquire the global-pool lock three
times. q
Answer:
This solution is adapted from one put forward by Alexey Roytman. It is based on the
following definitions:
i Number of blocks left in the initializing thread’s per-thread pool. (This is one reason
you needed to look at the code!)
p Per-thread maximum block consumption, including both the blocks actually allocated
and the blocks remaining in the per-thread pool.
The values g, m, and n are given. The value for p is m rounded up to the next
multiple of s, as follows:
m+s−1
p=s (C.6)
s
The value for i is as follows:
g (mod 2s) = 0 : 2s
i= (C.7)
g (mod 2s) 6= 0 : g (mod 2s)
566 APPENDIX C. ANSWERS TO QUICK QUIZZES
Per-Thread Allocation 0 0 m m
The relationships between these quantities is shown in Figure C.4. The global pool
is shown on the top of this figure, and the “extra” initializer thread’s per-thread pool
and per-thread allocations are the left-most pair of boxes. The initializer thread has no
blocks allocated, but has i blocks stranded in its per-thread pool. The rightmost two
pairs of boxes are the per-thread pools and per-thread allocations of threads holding
the maximum possible number of blocks, while the second-from-left pair of boxes
represents the thread currently trying to allocate.
The total number of blocks is g, and adding up the per-thread allocations and
per-thread pools, we see that the global pool contains g − i − p(n − 1) blocks. If the
allocating thread is to be successful, it needs at least m blocks in the global pool, in
other words:
g − i − p(n − 1) ≥ m (C.8)
The question has g = 40, s = 3, and n = 2. Equation C.7 gives i = 4, and Equa-
tion C.6 gives p = 18 for m = 18 and p = 21 for m = 19. Plugging these into Equa-
tion C.8 shows that m = 18 will not overflow, but that m = 19 might well do so.
The presence of i could be considered to be a bug. After all, why allocate memory
only to have it stranded in the initialization thread’s cache? One way of fixing this
would be to provide a memblock_flush() function that flushed the current thread’s
pool into the global pool. The initialization thread could then invoke this function after
freeing all of the blocks. q
C.7 Locking
Quick Quiz 7.1:
Just how can serving as a whipping boy be considered to be in any way honorable???
Answer:
The reason locking serves as a research-paper whipping boy is because it is heavily used
in practice. In contrast, if no one used or cared about locking, most research papers
would not bother even mentioning it. q
and waiting on another lock that was held by some thread. How do you know that there
is a cycle?
Answer:
Suppose that there is no cycle in the graph. We would then have a directed acyclic graph
(DAG), which would have at least one leaf node.
If this leaf node was a lock, then we would have a thread that was waiting on a lock
that wasn’t held by any thread, which violates the definition. (And in this case the thread
would immediately acquire the lock.)
On the other hand, if this leaf node was a thread, then we would have a thread that
was not waiting on any lock, again violating the definition. (And in this case, the thread
would either be running or be blocked on something that is not a lock.)
Therefore, given this definition of deadlock, there must be a cycle in the correspond-
ing graph. q
Answer:
Indeed there are! Here are a few of them:
1. If one of the library function’s arguments is a pointer to a lock that this library
function acquires, and if the library function holds one of its locks while acquiring
the caller’s lock, then we could have a deadlock cycle involving both caller and
library locks.
2. If one of the library functions returns a pointer to a lock that is acquired by the
caller, and if the caller acquires one of its locks while holding the library’s lock,
we could again have a deadlock cycle involving both caller and library locks.
3. If one of the library functions acquires a lock and then returns while still holding
it, and if the caller acquires one of its locks, we have yet another way to create a
deadlock cycle involving both caller and library locks.
4. If the caller has a signal handler that acquires locks, then the deadlock cycle can
involve both caller and library locks. In this case, however, the library’s locks are
innocent bystanders in the deadlock cycle. That said, please note that acquiring a
lock from within a signal handler is a no-no in most environments—it is not just
a bad idea, it is unsupported.
Answer:
By privatizing the data elements being compared (as discussed in Chapter 8) or through
use of deferral mechanisms such as reference counting (as discussed in Chapter 9). q
568 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
Locking primitives, of course! q
Answer:
Absolutely not!
Consider a program that acquires mutex_a, and then mutex_b, in that order, and
then passes mutex_a to pthread_cond_wait. Now, pthread_cond_wait
will release mutex_a, but will re-acquire it before returning. If some other thread
acquires mutex_a in the meantime and then blocks on mutex_b, the program will
deadlock. q
Answer:
Absolutely not!
This transformation assumes that the layer_2_processing() function is
idempotent, given that it might be executed multiple times on the same packet when the
layer_1() routing decision changes. Therefore, in real life, this transformation can
become arbitrarily complex. q
Answer:
Maybe.
If the routing decision in layer_1() changes often enough, the code will always
retry, never making forward progress. This is termed “livelock” if no thread makes any
forward progress or “starvation” if some threads make forward progress but others do
not (see Section 7.1.2). q
Answer:
Provide an additional global lock. If a given thread has repeatedly tried and failed to
acquire the needed locks, then have that thread unconditionally acquire the new global
lock, and then unconditionally acquire any needed locks. (Suggested by Doug Lea.) q
C.7. LOCKING 569
Answer:
Because this would lead to deadlock. Given that Lock A is held outside of a signal
handler without blocking signals, a signal might be handled while holding this lock.
The corresponding signal handler might then acquire Lock B, so that Lock B is acquired
while holding Lock A. Therefore, if we also acquire Lock A while holding Lock B as
called out in the question, we will have a deadlock cycle.
Therefore, it is illegal to acquire a lock that is acquired outside of a signal handler
without blocking signals while holding a another lock that is acquired within a signal
handler. q
Answer:
One of the simplest and fastest ways to do so is to use the sa_mask field of the
struct sigaction that you pass to sigaction() when setting up the signal. q
Answer:
Because these same rules apply to the interrupt handlers used in operating-system
kernels and in some embedded applications.
In many application environments, acquiring locks in signal handlers is frowned
upon [Ope97]. However, that does not stop clever developers from (usually unwisely)
fashioning home-brew locks out of atomic operations. And atomic operations are in
many cases perfectly legal in signal handlers. q
Answer:
There are a number of approaches:
1. In the case of parametric search via simulation, where a large number of sim-
ulations will be run in order to converge on (for example) a good design for a
mechanical or electrical device, leave the simulation single-threaded, but run many
instances of the simulation in parallel. This retains the object-oriented design,
and gains parallelism at a higher level, and likely also avoids synchronization
overhead.
4 Also known as “object-oriented spaghetti code.”
570 APPENDIX C. ANSWERS TO QUICK QUIZZES
2. Partition the objects into groups such that there is no need to operate on ob-
jects in more than one group at a given time. Then associate a lock with each
group. This is an example of a single-lock-at-a-time design, which discussed in
Section 7.1.1.7.
3. Partition the objects into groups such that threads can all operate on objects in the
groups in some groupwise ordering. Then associate a lock with each group, and
impose a locking hierarchy over the groups.
4. Impose an arbitrarily selected hierarchy on the locks, and then use conditional
locking if it is necessary to acquire a lock out of order, as was discussed in
Section 7.1.1.5.
5. Before carrying out a given group of operations, predict which locks will be
acquired, and attempt to acquire them before actually carrying out any updates.
If the prediction turns out to be incorrect, drop all the locks and retry with an
updated prediction that includes the benefit of experience. This approach was
discussed in Section 7.1.1.6.
Answer:
Figure 7.10 provides some good hints. In many cases, livelocks are a hint that you
should revisit your locking design. Or visit it in the first place if your locking design
“just grew”.
That said, one good-and-sufficient approach due to Doug Lea is to use conditional
locking as described in Section 7.1.1.5, but combine this with acquiring all needed
locks first, before modifying shared data, as described in Section 7.1.1.6. If a given
critical section retries too many times, unconditionally acquire a global lock, then
unconditionally acquire all the needed locks. This avoids both deadlock and livelock,
and scales reasonably assuming that the global lock need not be acquired too often. q
Answer:
Here are a couple:
C.7. LOCKING 571
1. A one-second wait is way too long for most uses. Wait intervals should begin
with roughly the time required to execute the critical section, which will normally
be in the microsecond or millisecond range.
2. The code does not check for overflow. On the other hand, this bug is nullified by
the previous bug: 32 bits worth of seconds is more than 50 years.
Answer:
It would be better in some sense, but there are situations where it can be appropriate to
use designs that sometimes result in high lock contentions.
For example, imagine a system that is subject to a rare error condition. It might
well be best to have a simple error-handling design that has poor performance and
scalability for the duration of the rare error condition, as opposed to a complex and
difficult-to-debug design that is helpful only when one of those rare error conditions is
in effect.
That said, it is usually worth putting some effort into attempting to produce a design
that both simple as well as efficient during error conditions, for example by partitioning
the problem. q
Answer:
If the data protected by the lock is in the same cache line as the lock itself, then attempts
by other CPUs to acquire the lock will result in expensive cache misses on the part of
the CPU holding the lock. This is a special case of false sharing, which can also occur if
a pair of variables protected by different locks happen to share a cache line. In contrast,
if the lock is in a different cache line than the data that it protects, the CPU holding the
lock will usually suffer a cache miss only on first access to a given variable.
Of course, the downside of placing the lock and data into separate cache lines is that
the code will incur two cache misses rather than only one in the uncontended case. q
Answer:
This usage is rare, but is occasionally used. The point is that the semantics of exclu-
sive locks have two components: (1) the familiar data-protection semantic and (2) a
messaging semantic, where releasing a given lock notifies a waiting acquisition of
that same lock. An empty critical section uses the messaging component without the
data-protection component.
572 APPENDIX C. ANSWERS TO QUICK QUIZZES
The rest of this answer provides some example uses of empty critical sections,
however, these examples should be considered “gray magic.”5 As such, empty critical
sections are almost never used in practice. Nevertheless, pressing on into this gray area
...
One historical use of empty critical sections appeared in the networking stack of the
2.4 Linux kernel. This usage pattern can be thought of as a way of approximating the
effects of read-copy update (RCU), which is discussed in Section 9.5.
The empty-lock-critical-section idiom can also be used to reduce lock contention in
some situations. For example, consider a multithreaded user-space application where
each thread processes unit of work maintained in a per-thread list, where thread are
prohibited from touching each others’ lists. There could also be updates that require that
all previously scheduled units of work have completed before the update can progress.
One way to handle this is to schedule a unit of work on each thread, so that when all of
these units of work complete, the update may proceed.
In some applications, threads can come and go. For example, each thread might
correspond to one user of the application, and thus be removed when that user logs
out or otherwise disconnects. In many applications, threads cannot depart atomically:
They must instead explicitly unravel themselves from various portions of the application
using a specific sequence of actions. One specific action will be refusing to accept
further requests from other threads, and another specific action will be disposing of any
remaining units of work on its list, for example, by placing these units of work in a
global work-item-disposal list to be taken by one of the remaining threads. (Why not
just drain the thread’s work-item list by executing each item? Because a given work
item might generate more work items, so that the list could not be drained in a timely
fashion.)
If the application is to perform and scale well, a good locking design is required.
One common solution is to have a global lock (call it G) protecting the entire process
of departing (and perhaps other things as well), with finer-grained locks protecting the
individual unraveling operations.
Now, a departing thread must clearly refuse to accept further requests before dis-
posing of the work on its list, because otherwise additional work might arrive after
the disposal action, which would render that disposal action ineffective. So simplified
pseudocode for a departing thread might be as follows:
1. Acquire lock G.
2. Acquire the lock guarding communications.
3. Refuse further communications from other threads.
8. Release lock G.
Of course, a thread that needs to wait for all pre-existing work items will need to take
departing threads into account. To see this, suppose that this thread starts waiting for all
pre-existing work items just after a departing thread has refused further communications
from other threads. How can this thread wait for the departing thread’s work items to
complete, keeping in mind that threads are not allowed to access each others’ lists of
work items?
One straightforward approach is for this thread to acquire G and then the lock
guarding the global work-item-disposal list, then move the work items to its own list.
The thread then release both locks, places a work item on the end of it own list, and then
wait for all of the work items that it placed on each thread’s list (including its own) to
complete.
This approach does work well in many cases, but if special processing is required
for each work item as it is pulled in from the global work-item-disposal list, the result
could be excessive contention on G. One way to avoid that contention is to acquire G
and then immediately release it. Then the process of waiting for all prior work items
look something like the following:
2. Send a message to all threads to cause them to atomically increment the global
counter, and then to enqueue a work item. The work item will atomically decre-
ment the global counter, and if the result is zero, it will set a condition variable to
one.
3. Acquire G, which will wait on any currently departing thread to finish departing.
Because only one thread may depart at a time, all the remaining threads will have
already received the message sent in the preceding step.
4. Release G.
6. Move all work items from the global work-item-disposal list to this thread’s list,
processing them as needed along the way.
8. Enqueue an additional work item onto this thread’s list. (As before, this work
item will atomically decrement the global counter, and if the result is zero, it will
set a condition variable to one.)
Once this procedure completes, all pre-existing work items are guaranteed to have
completed. The empty critical sections are using locking for messaging as well as for
protection of data. q
Answer:
There are in fact several. One way would be to use the null, protected-read, and exclusive
574 APPENDIX C. ANSWERS TO QUICK QUIZZES
modes. Another way would be to use the null, protected-read, and concurrent-write
modes. A third way would be to use the null, concurrent-read, and exclusive modes. q
Answer:
Conditionally acquiring a single global lock does work very well, but only for relatively
small numbers of CPUs. To see why it is problematic in systems with many hundreds
of CPUs, look at Figure 5.3 and extrapolate the delay from eight to 1,000 CPUs. q
Answer:
How indeed? This just shows that in concurrency, just as in life, one should take care to
learn exactly what winning entails before playing the game. q
Answer:
Because this default initialization does not apply to locks allocated as auto variables
within the scope of a function. q
Answer:
Suppose that the lock is held and that several threads are attempting to acquire the lock.
In this situation, if these threads all loop on the atomic exchange operation, they will
ping-pong the cache line containing the lock among themselves, imposing load on the
interconnect. In contrast, if these threads are spinning in the inner loop on lines 7-8,
they will each spin within their own caches, putting negligible load on the interconnect.
q
Answer:
This can be a legitimate implementation, but only if this store is preceded by a memory
barrier and makes use of ACCESS_ONCE(). The memory barrier is not required when
C.7. LOCKING 575
the xchg() operation is used because this operation implies a full memory barrier due
to the fact that it returns a value. q
Answer:
In the C language, the following macro correctly handles this:
#define ULONG_CMP_LT(a, b) \
(ULONG_MAX / 2 < (a) - (b))
Although it is tempting to simply subtract two signed integers, this should be avoided
because signed overflow is undefined in the C language. For example, if the compiler
knows that one of the values is positive and the other negative, it is within its rights
to simply assume that the positive number is greater than the negative number, even
though subtracting the negative number from the positive number might well result in
overflow and thus a negative number.
How could the compiler know the signs of the two numbers? It might be able to
deduce it based on prior assignments and comparisons. In this case, if the per-CPU
counters were signed, the compiler could deduce that they were always increasing in
value, and then might assume that they would never go negative. This assumption could
well lead the compiler to generate unfortunate code [McK12c, Reg10]. q
Answer:
The flag approach will normally suffer fewer cache misses, but a better answer is to try
both and see which works best for your particular workload. q
Answer:
Here are some bugs resulting from improper use of implicit existence guarantees:
1. A program writes the address of a global variable to a file, then a later instance
of that same program reads that address and attempts to dereference it. This can
fail due to address-space randomization, to say nothing of recompilation of the
program.
2. A module can record the address of one of its variables in a pointer located in
some other module, then attempt to dereference that pointer after the module has
been unloaded.
3. A function can record the address of one of its on-stack variables into a global
pointer, which some other function might attempt to dereference after that function
has returned.
576 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
This is a very simple hash table with no chaining, so the only element in a given bucket
is the first element. The reader is invited to adapt this example to a hash table with full
chaining. q
Answer:
Consider the following sequence of events:
1. Thread 0 invokes delete(0), and reaches line 10 of the figure, acquiring the
lock.
2. Thread 1 concurrently invokes delete(0), reaching line 10, but spins on the
lock because Thread 0 holds it.
3. Thread 0 executes lines 11-14, removing the element from the hashtable, releasing
the lock, and then freeing the element.
4. Thread 0 continues execution, and allocates memory, getting the exact block of
memory that it just freed.
5. Thread 0 then initializes this block of memory as some other type of structure.
6. Thread 1’s spin_lock() operation fails due to the fact that what it believes to
be p->lock is no longer a spinlock.
Because there is no existence guarantee, the identity of the data element can change
while a thread is attempting to acquire that element’s lock on line 10! q
Answer:
Use of auto variables in functions. By default, these are private to the thread executing
the current function. q
Answer:
The creation of the threads via the sh & operator and the joining of thread via the sh
wait command.
Of course, if the processes explicitly share memory, for example, using the shmget()
or mmap() system calls, explicit synchronization might well be needed when acccess-
ing or updating the shared memory. The processes might also synchronize using any of
the following interprocess communications mechanisms:
1. System V semaphores.
2. System V message queues.
3. UNIX-domain sockets.
4. Networking protocols, including TCP/IP, UDP, and a whole host of others.
5. File locking.
6. Use of the open() system call with the O_CREAT and O_EXCL flags.
7. Use of the rename() system call.
A complete list of possible synchronization mechanisms is left as an exercise to the
reader, who is warned that it will be an extremely long list. A surprising number of
unassuming system calls can be pressed into service as synchronization mechanisms. q
Answer:
That is a philosophical question.
Those wishing the answer “no” might argue that processes by definition do not share
memory.
Those wishing to answer “yes” might list a large number of synchronization mecha-
nisms that do not require shared memory, note that the kernel will have some shared
state, and perhaps even argue that the assignment of process IDs (PIDs) constitute shared
data.
Such arguments are excellent intellectual exercise, and are also a wonderful way
of feeling intelligent, scoring points against hapless classmates or colleagues, and
(especially!) avoiding getting anything useful done. q
Answer:
Amazingly enough, yes. One example is a simple message-passing system where
threads post messages to other threads’ mailboxes, and where each thread is responsible
for removing any message it sent once that message has been acted on. Implementation
of such an algorithm is left as an exercise for the reader, as is the task of identifying
other algorithms with similar ownership patterns. q
578 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
There is a very large number of such mechanisms, including:
3. Shared-memory mailboxes.
4. UNIX-domain sockets.
Answer:
The key phrase is “owns the rights to the data”. In this case, the rights in question are
the rights to access the per-thread counter variable defined on line 1 of the figure.
This situation is similar to that described in Section 8.2.
However, there really is data that is owned by the eventual() thread, namely the
t and sum variables defined on lines 17 and 18 of the figure.
For other examples of designated threads, look at the kernel threads in the Linux
kernel, for example, those created by kthread_create() and kthread_run().
q
Answer:
Yes. One approach is for read_count() to add the value of its own per-thread
variable. This maintains full ownership and performance, but only a slight improvement
in accuracy, particularly on systems with very large numbers of threads.
Another approach is for read_count() to use function shipping, for example,
in the form of per-thread signals. This greatly improves accuracy, but at a significant
performance cost for read_count().
However, both of these methods have the advantage of eliminating cache-line
bouncing for the common case of updating counters. q
C.9. DEFERRED PROCESSING 579
Answer:
To greatly increase the probability of finding bugs. A small torture-test program
(routetorture.h) that allocates and frees only one type of structure can toler-
ate a surprisingly large amount of use-after-free misbehavior. See Figure 11.4 on
page 284 and the related discussion in Section 11.6.4 starting on page 287 for more on
the importance of increasing the probability of finding bugs. q
Answer:
Because the traversal is already protected by the lock, so no additional protection is
required. q
Answer:
The stair-steps are due to hyperthreading. On this particular system, the hardware
threads in a given core have consecutive CPU numbers. In addition, this particular
pointer-following low-cache-miss-rate workload seems to allow a single hardware thread
to consume most of the relevant resources within its core. Workloads featuring heavier
computational loads should be expected to gain greater benefit from each core’s second
hardware thread. q
Answer:
Given the horrible scalability of reference counting, who needs more than eight CPUs?
Four CPUs would have sufficed to make the point! However, people wanting more
CPUs are urged to refer to Chapter 10. q
Answer:
That sentence did say “reduced the usefulness”, not “eliminated the usefulness”, now
didn’t it?
580 APPENDIX C. ANSWERS TO QUICK QUIZZES
Please see Section 13.2, which discusses some of the techniques that the Linux
kernel uses to take advantage of reference counting in a highly concurrent environment.
q
Answer:
Because hp_record() must check for concurrent modifications. To do that job, it
needs a pointer to a pointer to the element, so that it can check for a modification to the
pointer to the element. q
Answer:
It might be inefficient in some sense, but the fact is that such restarting is absolutely
required for correctness. To see this, consider a hazard-pointer-protected linked list
containing elements A, B, and C that is subjecte to the following sequence of events:
2. Thread 1 removes element B from the list, which sets the pointer from element B
to element C to a special HAZPTR_POISON value in order to mark the deletion.
Because Thread 0 has a hazard pointer to element B, it cannot yet be freed.
3. Thread 1 removes element C from the list. Because there are no hazard pointers
referencing element C, it is immediately freed.
Which is a very good thing, because otherwise Thread 0 would have attempted
to access the now-freed element C, which might have resulted in arbitrarily horrible
memory corruption, especially if the memory for element C had since been re-allocated
for some other purpose.
All that aside, please understand that hazard pointers’s restarting allows it to maintain
a minimal memory footprint. Any object not currently referenced by some hazard pointer
may be immediately freed. In contrast, Section 9.5 will discuss a mechanism that avoids
read-side retries (and minimizes read-side overhead), but has a much larger memory
footprint. q
Answer:
The published implementations of hazard pointers used non-blocking synchronization
techniques for insertion and deletion. These techniques require that readers traversing
the data structure “help” updaters complete their updates, which in turn means that
readers need to look at the successor of a deleted element.
In contrast, we will be using locking to synchronize updates, which does away with
the need for readers to help updaters complete their updates, which in turn allows us to
leave pointers’ bottom bits alone. This approach allows read-side code to be simpler
and faster. q
Answer:
These restrictions apply only to reference-counting mechanisms whose reference acqui-
sition can fail. q
Answer:
First, Figure 9.9 has a linear y-axis, while most of the graphs in the “Structured Deferral”
paper have logscale y-axes. Next, that paper uses lightly-loaded hash tables, while
Figure 9.9’s uses a 10-element simple linked list, which means that hazard pointers face
a larger memory-barrier penalty in this workload than in that of the “Structured Deferral”
paper. Finally, that paper used a larger and older x86 system, while a newer but smaller
system was used to generate the data shown in Figure 9.9.
As always, your mileage may vary. Given the difference in performance, it is clear
that hazard pointers give you the most ideal performance either for very large data
structures (where the memory-barrier overhead will at least partially overlap cache-miss
penalties) and for data structures such as hash tables where a lookup operation needs a
minimal number of hazard pointers. q
Answer:
The sequence-lock mechanism is really a combination of two separate synchronization
mechanisms, sequence counts and locking. In fact, the sequence-count mechanism is
available separately in the Linux kernel via the write_seqcount_begin() and
write_seqcount_end() primitives.
However, the combined write_seqlock() and write_sequnlock() prim-
itives are used much more heavily in the Linux kernel. More importantly, many more
people will understand what you mean if you say “sequence lock” than if you say
“sequence count”.
582 APPENDIX C. ANSWERS TO QUICK QUIZZES
So this section is entitled “Sequence Locks” so that people will understand what
it is about just from the title, and it appears in the “Deferred Processing” because (1)
of the emphasis on the “sequence count” aspect of “sequence locks” and (2) because a
“sequence lock” is much more than merely a lock. q
Answer:
That would be a legitimate implementation. However, if the workload is read-mostly, it
would likely increase the overhead of the common-case successful read, which could
be counter-productive. However, given a sufficiently large fraction of updates and
sufficiently high-overhead readers, having the check internal to read_seqbegin()
might be preferable. q
Answer:
If it was omitted, both the compiler and the CPU would be within their rights to
move the critical section preceding the call to read_seqretry() down below this
function. This would prevent the sequence lock from protecting the critical section. The
smp_mb() primitive prevents such reordering. q
Answer:
In older versions of the Linux kernel, no.
In very new versions of the Linux kernel, line 16 could use smp_load_acquire()
instead of ACCESS_ONCE(), which in turn would allow the smp_mb() on line 17 to
be dropped. Similarly, line 41 could use an smp_store_release(), for example,
as follows:
smp_store_release(&slp->seq, ACCESS_ONCE(slp->seq)+ 1);
This would allow the smp_mb() on line 40 to be dropped. q
Answer:
Nothing. This is one of the weaknesses of sequence locking, and as a result, you
should use sequence locking only in read-mostly situations. Unless of course read-side
starvation is acceptable in your situation, in which case, go wild with the sequence-
locking updates! q
Answer:
In this case, the ->lock field could be omitted, as it is in seqcount_t in the Linux
kernel. q
Answer:
Not at all. The Linux kernel has a number of special attributes that allow it to ignore the
following sequence of events:
2. Thread 0 starts executing its read-side critical section, but is then preempted for a
long time.
4. Thread 0 resumes execution, completing its read-side critical section with incon-
sistent data.
The Linux kernel uses sequence locking for things that are updated rarely, with
time-of-day information being a case in point. This information is updated at most
once per millisecond, so that seven weeks would be required to overflow the counter.
If a kernel thread was preempted for seven weeks, the Linux kernel’s soft-lockup code
would be emitting warnings every two minutes for that entire time.
In contrast, with a 64-bit counter, more than five centuries would be required to
overflow, even given an update every nanosecond. Therefore, this implementation uses
a type for ->seq that is 64 bits on 64-bit systems. q
Answer:
One trivial way of accomplishing this is to surround all accesses, including the read-only
accesses, with write_seqlock() and write_sequnlock(). Of course, this
solution also prohibits all read-side parallelism, resulting in massive lock contention,
and furthermore could just as easily be implemented using simple locking.
If you do come up with a solution that uses read_seqbegin() and read_
seqretry() to protect read-side accesses, make sure that you correctly handle the
following sequence of events:
584 APPENDIX C. ANSWERS TO QUICK QUIZZES
1. CPU 0 is traversing the linked list, and picks up a pointer to list element A.
3. CPU 2 allocates an unrelated data structure, and gets the memory formerly
occupied by element A. In this unrelated data structure, the memory previously
used for element A’s ->next pointer is now occupied by a floating-point number.
4. CPU 0 picks up what used to be element A’s ->next pointer, gets random bits,
and therefore gets a segmentation fault.
One way to protect against this sort of problem requires use of “type-safe memory”,
which will be discussed in Section 9.5.3.7. But in that case, you would be using some
other synchronization mechanism in addition to sequence locks! q
Answer:
Yes and no. Although seqlock readers can run concurrently with seqlock writers,
whenever this happens, the read_seqretry() primitive will force the reader to
retry. This means that any work done by a seqlock reader running concurrently with a
seqlock updater will be discarded and redone. So seqlock readers can run concurrently
with updaters, but they cannot actually get any work done in this case.
In contrast, RCU readers can perform useful work even in presence of concurrent
RCU updaters. q
Answer:
On all systems running Linux, loads from and stores to pointers are atomic, that is, if a
store to a pointer occurs at the same time as a load from that same pointer, the load will
return either the initial value or the value stored, never some bitwise mashup of the two.
In addition, the list_for_each_entry_rcu() always proceeds forward through
the list, never looking back. Therefore, the list_for_each_entry_rcu() will
either see the element being added by list_add_rcu() or it will not, but either way,
it will see a valid well-formed list. q
Answer:
One way of accomplishing this is as shown in Figure C.5.
Note that this means that multiple concurrent deletions might be waiting in synchronize_
rcu(). q
C.9. DEFERRED PROCESSING 585
1 spin_lock(&mylock);
2 p = search(head, key);
3 if (p == NULL)
4 spin_unlock(&mylock);
5 else {
6 list_del_rcu(&p->list);
7 spin_unlock(&mylock);
8 synchronize_rcu();
9 kfree(p);
10 }
Answer:
That depends on the synchronization design. If a semaphore protecting the update is
held across the grace period, then there can be at most two versions, the old and the new.
However, suppose that only the search, the update, and the list_replace_
rcu() were protected by a lock, so that the synchronize_rcu() was outside of
that lock, similar to the code shown in Figure C.5. Suppose further that a large number
of threads undertook an RCU replacement at about the same time, and that readers are
also constantly traversing the data structure.
Then the following sequence of events could occur, starting from the end state of
Figure 9.29:
2. Thread B replaces the 5,2,3 element with a new 5,2,4 element, then waits for its
synchronize_rcu() call to return.
4. Thread D replaces the 5,2,4 element with a new 5,2,5 element, then waits for its
synchronize_rcu() call to return.
6. Thread F replaces the 5,2,5 element with a new 5,2,6 element, then waits for its
synchronize_rcu() call to return.
8. And the previous two steps repeat quickly, so that all of them happen before any
of the synchronize_rcu() calls return.
Thus, there can be an arbitrary number of versions active, limited only by memory
and by how many updates could be completed within a grace period. But please note
that data structures that are updated so frequently probably are not good candidates for
RCU. That said, RCU can handle high update rates when necessary. q
586 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
The modifications undertaken by a given RCU updater will cause the corresponding CPU
to invalidate cache lines containing the data, forcing the CPUs running concurrent RCU
readers to incur expensive cache misses. (Can you design an algorithm that changes
a data structure without inflicting expensive cache misses on concurrent readers? On
subsequent readers?) q
Answer:
The rcu_dereference() primitive does constrain the compiler’s optimizations
somewhat, which can result in slightly slower code. This effect would normally be
insignificant, but each search is taking on average about 13 nanoseconds, which is
short enough for small differences in code generation to make their presence felt. The
difference ranges from about 1.5% to about 11.1%, which is quite small when you
consider that the RCU QSBR code can handle concurrent updates and the “ideal” code
cannot.
It is hoped that C11 memory_order_consume loads [Smi15] might someday
allow rcu_dereference() provide the needed protection at lower cost. q
Answer:
Because RCU QSBR places constraints on the overall application that might not be
tolerable, for example, requiring that each and every thread in the application regularly
pass through a quiescent state. Among other things, this means that RCU QSBR is
not helpful to library writers, who might be better served by other flavors of userspace
RCU [MDJ13c]. q
Answer:
First, consider that the inner loop used to take this measurement is as follows:
1 for (i = 0; i < CSCOUNT_SCALE; i++) {
2 rcu_read_lock();
3 rcu_read_unlock();
4 }
Consider also that the compiler does simple optimizations, allowing it to replace the
loop with:
i = CSCOUNT_SCALE;
Answer:
Because the contention on the underlying rwlock_t decreases as the critical-section
overhead increases. However, the rwlock overhead will not quite drop to that on a single
CPU because of cache-thrashing overhead. q
Answer:
One way to cause a deadlock cycle involving RCU read-side primitives is via the
following (illegal) sequence of statements:
rcu_read_lock();
synchronize_rcu();
rcu_read_unlock();
Answer:
It really does work. After all, if it didn’t work, the Linux kernel would not run. q
Answer:
This is an effect of the Law of Toy Examples: beyond a certain point, the code fragments
look the same. The only difference is in how we think about the code. However, this
difference can be extremely important. For but one example of the importance, consider
that if we think of RCU as a restricted reference counting scheme, we would never be
fooled into thinking that the updates would exclude the RCU read-side critical sections.
It nevertheless is often useful to think of RCU as a replacement for reader-writer
locking, for example, when you are replacing reader-writer locking with RCU. q
Answer:
Most likely NUMA effects. However, there is substantial variance in the values measured
for the refcnt line, as can be seen by the error bars. In fact, standard deviations range in
excess of 10% of measured values in some cases. The dip in overhead therefore might
well be a statistical aberration. q
Answer:
As with Figure 7.17, this is a very simple hash table with no chaining, so the only
element in a given bucket is the first element. The reader is again invited to adapt this
example to a hash table with full chaining. q
Answer:
First, please note that the second check on line 14 is necessary because some other CPU
might have removed this element while we were waiting to acquire the lock. However,
the fact that we were in an RCU read-side critical section while acquiring the lock
guarantees that this element could not possibly have been re-allocated and re-inserted
C.9. DEFERRED PROCESSING 589
into this hash table. Furthermore, once we acquire the lock, the lock itself guarantees
the element’s existence, so we no longer need to be in an RCU read-side critical section.
The question as to whether it is necessary to re-check the element’s key is left as an
exercise to the reader. q
Answer:
Suppose we reverse the order of these two lines. Then this code is vulnerable to the
following sequence of events:
1. CPU 0 invokes delete(), and finds the element to be deleted, executing through
line 15. It has not yet actually deleted the element, but is about to do so.
3. CPU 0 executes lines 16 and 17, and blocks at line 18 waiting for CPU 1 to exit
its RCU read-side critical section.
4. CPU 1 now acquires the lock, but the test on line 14 fails because CPU 0 has
already removed the element. CPU 1 now executes line 22 (which we switched
with line 23 for the purposes of this Quick Quiz) and exits its RCU read-side
critical section.
5. CPU 0 can now return from synchronize_rcu(), and thus executes line 19,
sending the element to the freelist.
6. CPU 1 now attempts to release a lock for an element that has been freed, and,
worse yet, possibly reallocated as some other type of data structure. This is a fatal
memory-corruption error.
Answer:
There could certainly be an arbitrarily long period of time during which at least one
thread is always in an RCU read-side critical section. However, the key words in the
description in Section 9.5.3.7 are “in-use” and “pre-existing”. Keep in mind that a
given RCU read-side critical section is conceptually only permitted to gain references
to data elements that were in use at the beginning of that critical section. Furthermore,
remember that a slab cannot be returned to the system until all of its data elements have
been freed, in fact, the RCU grace period cannot start until after they have all been freed.
590 APPENDIX C. ANSWERS TO QUICK QUIZZES
1 struct profile_buffer {
2 long size;
3 atomic_t entry[0];
4 };
5 static struct profile_buffer *buf = NULL;
6
7 void nmi_profile(unsigned long pcvalue)
8 {
9 struct profile_buffer *p;
10
11 rcu_read_lock();
12 p = rcu_dereference(buf);
13 if (p == NULL) {
14 rcu_read_unlock();
15 return;
16 }
17 if (pcvalue >= p->size) {
18 rcu_read_unlock();
19 return;
20 }
21 atomic_inc(&p->entry[pcvalue]);
22 rcu_read_unlock();
23 }
24
25 void nmi_stop(void)
26 {
27 struct profile_buffer *p = buf;
28
29 if (p == NULL)
30 return;
31 rcu_assign_pointer(buf, NULL);
32 synchronize_rcu();
33 kfree(p);
34 }
Figure C.6: Using RCU to Wait for Mythical Preemptible NMIs to Finish
Therefore, the slab cache need only wait for those RCU read-side critical sections
that started before the freeing of the last element of the slab. This in turn means that any
RCU grace period that begins after the freeing of the last element will do—the slab may
be returned to the system after that grace period ends. q
Answer:
One approach would be to use rcu_read_lock() and rcu_read_unlock() in
nmi_profile(), and to replace the synchronize_sched() with synchronize_
rcu(), perhaps as shown in Figure C.6. q
Answer:
The API members with exclamation marks (rcu_read_lock(), rcu_read_unlock(),
and call_rcu()) were the only members of the Linux RCU API that Paul E. McKen-
ney was aware of back in the mid-90s. During this timeframe, he was under the mistaken
impression that he knew all that there is to know about RCU. q
C.9. DEFERRED PROCESSING 591
Answer:
There is no need to do anything to prevent RCU read-side critical sections from indefi-
nitely blocking a synchronize_rcu() invocation, because the synchronize_
rcu() invocation need wait only for pre-existing RCU read-side critical sections. So
as long as each RCU read-side critical section is of finite duration, there should be no
problem. q
Answer:
Absolutely not! And especially not when using preemptible RCU! You instead want
synchronize_irq(). Alternatively, you can place calls to rcu_read_lock()
and rcu_read_unlock() in the specific interrupt handlers that you want synchronize_
rcu() to wait for. q
Answer:
If there happened to be no RCU read-side critical sections delimited by rcu_read_
lock_bh() and rcu_read_unlock_bh() at the time call_rcu_bh() was
invoked, RCU would be within its rights to invoke the callback immediately, possibly
freeing a data structure still being used by the RCU read-side critical section! This
is not merely a theoretical possibility: a long-running RCU read-side critical section
delimited by rcu_read_lock() and rcu_read_unlock() is vulnerable to this
failure mode.
However, the rcu_dereference() family of functions apply to all flavors of
RCU. (There was an attempt to have per-flavor variants of rcu_dereference(),
but it was just too messy.) q
Answer:
Absolutely not! And especially not when using preemptible RCU! If you need to access
“rcu_bh”-protected data structures in an interrupt handler, you need to provide explicit
calls to rcu_read_lock_bh() and rcu_read_unlock_bh(). q
592 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
In a non-PREEMPT or a PREEMPT kernel, mixing these two works “by accident”
because in those kernel builds, RCU Classic and RCU Sched map to the same imple-
mentation. However, this mixture is fatal in PREEMPT_RT builds using the -rt patchset,
due to the fact that Realtime RCU’s read-side critical sections can be preempted, which
would permit synchronize_sched() to return before the RCU read-side critical
section reached its rcu_read_unlock() call. This could in turn result in a data
structure being freed before the read-side critical section was finished with it, which
could in turn greatly increase the actuarial risk experienced by your kernel.
In fact, the split between RCU Classic and RCU Sched was inspired by the need for
preemptible RCU read-side critical sections. q
Answer:
That is correct! Because -rt Linux uses threaded interrupt handlers, there can be context
switches in the middle of an interrupt handler. Because synchronize_sched()
waits only until each CPU has passed through a context switch, it can return before a
given interrupt handler completes.
If you need to wait for a given interrupt handler to complete, you should instead
use synchronize_irq() or place explicit RCU read-side critical sections in the
interrupt handlers that you wish to wait on. q
Answer:
A single task could register SRCU callbacks very quickly. Given that SRCU allows
readers to block for arbitrary periods of time, this could consume an arbitrarily large
quantity of memory. In contrast, given the synchronous synchronize_srcu()
interface, a given task must finish waiting for a given grace period before it can start
waiting for the next one. q
Answer:
In principle, you can use synchronize_srcu() with a given srcu_struct
within an SRCU read-side critical section that uses some other srcu_struct. In
practice, however, doing this is almost certainly a bad idea. In particular, the code shown
in Figure C.7 could still result in deadlock.
q
C.9. DEFERRED PROCESSING 593
1 idx = srcu_read_lock(&ssa);
2 synchronize_srcu(&ssb);
3 srcu_read_unlock(&ssa, idx);
4
5 /* . . . * /
6
7 idx = srcu_read_lock(&ssb);
8 synchronize_srcu(&ssa);
9 srcu_read_unlock(&ssb, idx);
Answer:
Poisoning the next pointer would interfere with concurrent RCU readers, who must
use this pointer. However, RCU readers are forbidden from using the prev pointer, so
it may safely be poisoned. q
Answer:
One such exception is when a multi-element linked data structure is initialized as a unit
while inaccessible to other CPUs, and then a single rcu_assign_pointer() is
used to plant a global pointer to this data structure. The initialization-time pointer assign-
ments need not use rcu_assign_pointer(), though any such assignments that
happen after the structure is globally visible must use rcu_assign_pointer().
However, unless this initialization code is on an impressively hot code-path, it
is probably wise to use rcu_assign_pointer() anyway, even though it is in
theory unnecessary. It is all too easy for a “minor” change to invalidate your cherished
assumptions about the initialization happening privately. q
Answer:
It can sometimes be difficult for automated code checkers such as “sparse” (or indeed for
human beings) to work out which type of RCU read-side critical section a given RCU
traversal primitive corresponds to. For example, consider the code shown in Figure C.8.
Is the rcu_dereference() primitive in an RCU Classic or an RCU Sched
critical section? What would you have to do to figure this out? q
594 APPENDIX C. ANSWERS TO QUICK QUIZZES
1 rcu_read_lock();
2 preempt_disable();
3 p = rcu_dereference(global_pointer);
4
5 /* . . . */
6
7 preempt_enable();
8 rcu_read_unlock();
Answer:
Suppose the functions foo() and bar() in Figure C.9 are invoked concurrently from
different CPUs. Then foo() will acquire my_lock() on line 3, while bar() will
acquire rcu_gp_lock on line 13. When foo() advances to line 4, it will attempt
to acquire rcu_gp_lock, which is held by bar(). Then when bar() advances to
line 14, it will attempt to acquire my_lock, which is held by foo().
Each function is then waiting for a lock that the other holds, a classic deadlock.
Other RCU implementations neither spin nor block in rcu_read_lock(), hence
avoiding deadlocks. q
Answer:
One could in fact use reader-writer locks in this manner. However, textbook reader-
writer locks suffer from memory contention, so that the RCU read-side critical sections
would need to be quite long to actually permit parallel execution [McK03].
On the other hand, use of a reader-writer lock that is read-acquired in rcu_read_
lock() would avoid the deadlock condition noted above. q
C.9. DEFERRED PROCESSING 595
Answer:
Making this change would re-introduce the deadlock, so no, it would not be cleaner. q
Answer:
One deadlock is where a lock is held across synchronize_rcu(), and that same
lock is acquired within an RCU read-side critical section. However, this situation could
deadlock any correctly designed RCU implementation. After all, the synchronize_
rcu() primitive must wait for all pre-existing RCU read-side critical sections to
complete, but if one of those critical sections is spinning on a lock held by the thread
executing the synchronize_rcu(), we have a deadlock inherent in the definition
of RCU.
Another deadlock happens when attempting to nest RCU read-side critical sections.
This deadlock is peculiar to this implementation, and might be avoided by using recursive
locks, or by using reader-writer locks that are read-acquired by rcu_read_lock()
and write-acquired by synchronize_rcu().
However, if we exclude the above two cases, this implementation of RCU does not
introduce any deadlock situations. This is because only time some other thread’s lock
is acquired is when executing synchronize_rcu(), and in that case, the lock is
immediately released, prohibiting a deadlock cycle that does not involve a lock held
across the synchronize_rcu() which is the first case above. q
Answer:
This is indeed an advantage, but do not forget that rcu_dereference() and rcu_
assign_pointer() are still required, which means volatile manipulation for
rcu_dereference() and memory barriers for rcu_assign_pointer(). Of
course, many Alpha CPUs require memory barriers for both primitives. q
Answer:
Indeed, this would deadlock any legal RCU implementation. But is rcu_read_
lock() really participating in the deadlock cycle? If you believe that it is, then
please ask yourself this same question when looking at the RCU implementation in
Section 9.5.5.9. q
596 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
The update-side test was run in absence of readers, so the poll() system call was
never invoked. In addition, the actual code has this poll() system call commented
out, the better to evaluate the true overhead of the update-side code. Any production
uses of this code would be better served by using the poll() system call, but then
again, production uses would be even better served by other implementations shown
later in this section. q
Answer:
Although this would in fact eliminate the starvation, it would also mean that rcu_
read_lock() would spin or block waiting for the writer, which is in turn waiting on
readers. If one of these readers is attempting to acquire a lock that the spinning/blocking
rcu_read_lock() holds, we again have deadlock.
In short, the cure is worse than the disease. See Section 9.5.5.4 for a proper cure. q
Answer:
The spin-lock acquisition only guarantees that the spin-lock’s critical section will not
“bleed out” to precede the acquisition. It in no way guarantees that code preceding the
spin-lock acquisition won’t be reordered into the critical section. Such reordering could
cause a removal from an RCU-protected list to be reordered to follow the complementing
of rcu_idx, which could allow a newly starting RCU read-side critical section to see
the recently removed data element.
Exercise for the reader: use a tool such as Promela/spin to determine which (if any)
of the memory barriers in Figure 9.52 are really needed. See Chapter 12 for information
on using these tools. The first correct and complete response will be credited. q
Answer:
Both flips are absolutely required. To see this, consider the following sequence of
events:
6. The grace period that started in step 5 has been allowed to end, despite the fact
that the RCU read-side critical section that started beforehand in step 4 has not
completed. This violates RCU semantics, and could allow the update to free a
data element that the RCU read-side critical section was still referencing.
Answer:
Using non-atomic operations would cause increments and decrements to be lost, in turn
causing the implementation to fail. See Section 9.5.5.5 for a safe way to use non-atomic
operations in rcu_read_lock() and rcu_read_unlock(). q
Answer:
The atomic_read() primitives does not actually execute atomic machine instruc-
tions, but rather does a normal load from an atomic_t. Its sole purpose is to keep the
compiler’s type-checking happy. If the Linux kernel ran on 8-bit CPUs, it would also
need to prevent “store tearing”, which could happen due to the need to store a 16-bit
pointer with two eight-bit accesses on some 8-bit systems. But thankfully, it seems that
no one runs Linux on 8-bit systems. q
598 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
Keep in mind that we only wait for a given thread if that thread is still in a pre-existing
RCU read-side critical section, and that waiting for one hold-out thread gives all the
other threads a chance to complete any pre-existing RCU read-side critical sections
that they might still be executing. So the only way that we would wait for 2N intervals
would be if the last thread still remained in a pre-existing RCU read-side critical section
despite all the waiting for all the prior threads. In short, this implementation will not
wait unnecessarily.
However, if you are stress-testing code that uses RCU, you might want to comment
out the poll() statement in order to better catch bugs that incorrectly retain a reference
to an RCU-protected data element outside of an RCU read-side critical section. q
Answer:
Special-purpose uniprocessor implementations of RCU can attain this ideal [McK09a].
q
Answer:
Assigning zero (or any other even-numbered constant) would in fact work, but assigning
the value of rcu_gp_ctr can provide a valuable debugging aid, as it gives the devel-
oper an idea of when the corresponding thread last exited an RCU read-side critical
section. q
Answer:
These memory barriers are required because the locking primitives are only guaranteed
to confine the critical section. The locking primitives are under absolutely no obligation
to keep other code from bleeding in to the critical section. The pair of memory barriers
are therefore requires to prevent this sort of code motion, whether performed by the
compiler or by the CPU. q
C.9. DEFERRED PROCESSING 599
Answer:
Indeed it could, with a few modifications. This work is left as an exercise for the reader.
q
Answer:
It is a real problem, there is a sequence of events leading to failure, and there are a
number of possible ways of addressing it. For more details, see the Quick Quizzes near
the end of Section 9.5.5.8. The reason for locating the discussion there is to (1) give you
more time to think about it, and (2) because the nesting support added in that section
greatly reduces the time required to overflow the counter. q
Answer:
The apparent simplicity of the separate per-thread variable is a red herring. This approach
incurs much greater complexity in the guise of careful ordering of operations, especially
if signal handlers are to be permitted to contain RCU read-side critical sections. But
don’t take my word for it, code it up and see what you end up with! q
Answer:
One way would be to replace the magnitude comparison on lines 33 and 34 with
an inequality check of the per-thread rcu_reader_gp variable against rcu_gp_
ctr+RCU_GP_CTR_BOTTOM_BIT. q
Answer:
It can indeed be fatal. To see this, consider the following sequence of events:
600 APPENDIX C. ANSWERS TO QUICK QUIZZES
3. Thread 0 now starts running again, and stores into its per-thread rcu_reader_
gp variable. The value it stores is RCU_GP_CTR_BOTTOM_BIT+1 greater than
that of the global rcu_gp_ctr.
5. Thread 1 now removes the data element A that thread 0 just acquired a reference
to.
Note that scenario can also occur in the implementation presented in Section 9.5.5.7.
One strategy for fixing this problem is to use 64-bit counters so that the time required
to overflow them would exceed the useful lifetime of the computer system. Note that
non-antique members of the 32-bit x86 CPU family allow atomic manipulation of 64-bit
counters via the cmpxchg64b instruction.
Another strategy is to limit the rate at which grace periods are permitted to occur in
order to achieve a similar effect. For example, synchronize_rcu() could record
the last time that it was invoked, and any subsequent invocation would then check this
time and block as needed to force the desired spacing. For example, if the low-order
four bits of the counter were reserved for nesting, and if grace periods were permitted
to occur at most ten times per second, then it would take more than 300 days for the
counter to overflow. However, this approach is not helpful if there is any possibility that
the system will be fully loaded with CPU-bound high-priority real-time threads for the
full 300 days. (A remote possibility, perhaps, but best to consider it ahead of time.)
A third approach is to administratively abolish real-time threads from the system in
question. In this case, the preempted process will age up in priority, thus getting to run
long before the counter had a chance to overflow. Of course, this approach is less than
helpful for real-time applications.
A final approach would be for rcu_read_lock() to recheck the value of the
global rcu_gp_ctr after storing to its per-thread rcu_reader_gp counter, retrying
if the new value of the global rcu_gp_ctr is inappropriate. This works, but introduces
non-deterministic execution time into rcu_read_lock(). On the other hand, if your
application is being preempted long enough for the counter to overflow, you have no
hope of deterministic execution time in any case! q
C.9. DEFERRED PROCESSING 601
Answer:
Indeed it does! An application using this implementation of RCU should therefore
invoke rcu_quiescent_state sparingly, instead using rcu_read_lock() and
rcu_read_unlock() most of the time.
However, this memory barrier is absolutely required so that other threads will see
the store on lines 12-13 before any subsequent RCU read-side critical sections executed
by the caller. q
Answer:
The memory barrier on line 19 prevents any RCU read-side critical sections that might
precede the call to rcu_thread_offline() won’t be reordered by either the com-
piler or the CPU to follow the assignment on lines 20-21. The memory barrier on line 22
is, strictly speaking, unnecessary, as it is illegal to have any RCU read-side critical
sections following the call to rcu_thread_offline(). q
Answer:
Since the measurement loop contains a pair of empty functions, the compiler opti-
mizes it away. The measurement loop takes 1,000 passes between each call to rcu_
quiescent_state(), so this measurement is roughly one thousandth of the over-
head of a single call to rcu_quiescent_state(). q
Answer:
A library function has absolutely no control over the caller, and thus cannot force
the caller to invoke rcu_quiescent_state() periodically. On the other hand, a
library function that made many references to a given RCU-protected data structure
might be able to invoke rcu_thread_online() upon entry, rcu_quiescent_
state() periodically, and rcu_thread_offline() upon exit. q
how can a primitive that generates absolutely no code possibly participate in a deadlock
cycle?
Answer:
Please note that the RCU read-side critical section is in effect extended beyond the
enclosing rcu_read_lock() and rcu_read_unlock(), out to the previous and
next call to rcu_quiescent_state(). This rcu_quiescent_state can be
thought of as a rcu_read_unlock() immediately followed by an rcu_read_
lock().
Even so, the actual deadlock itself will involve the lock acquisition in the RCU read-
side critical section and the synchronize_rcu(), never the rcu_quiescent_
state(). q
Answer:
This situation is one reason for the existence of asynchronous grace-period primitives
such as call_rcu(). This primitive may be invoked within an RCU read-side critical
section, and the specified RCU callback will in turn be invoked at a later time, after a
grace period has elapsed.
The ability to perform an RCU update while within an RCU read-side critical
section can be extremely convenient, and is analogous to a (mythical) unconditional
read-to-write upgrade for reader-writer locking. q
Answer:
Hint: place the global variable finalcount and the array counterp[] into a single
RCU-protected struct. At initialization time, this structure would be allocated and set to
all zero and NULL.
The inc_count() function would be unchanged.
The read_count() function would use rcu_read_lock() instead of acquir-
ing final_mutex, and would need to use rcu_dereference() to acquire a
reference to the current structure.
The count_register_thread() function would set the array element corre-
sponding to the newly created thread to reference that thread’s per-thread counter
variable.
The count_unregister_thread() function would need to allocate a new
structure, acquire final_mutex, copy the old structure to the new one, add the out-
going thread’s counter variable to the total, NULL the pointer to this same counter
C.10. DATA STRUCTURES 603
Answer:
Hint: replace the read-acquisitions of the reader-writer lock with RCU read-side critical
sections, then adjust the device-removal code fragment to suit.
See Section 13.3.2 on Page 364 for one solution to this problem. q
Answer:
Chained hash tables are completely partitionable, and thus well-suited to concurrent
use. There are other completely-partitionable hash tables, for example, split-ordered
list [SS06], but they are considerably more complex. We therefore start with chained
hash tables. q
Answer:
Indeed it is! However, hash tables quite frequently store information with keys such
as character strings that do not necessarily fit into an unsigned long. Simplifying the
hash-table implementation for the case where keys always fit into unsigned longs is left
as an exercise for the reader. q
Answer:
The answer depends on a great many things. If the hash table has a large number
604 APPENDIX C. ANSWERS TO QUICK QUIZZES
of elements per bucket, it would clearly be better to increase the number of hash
buckets. On the other hand, if the hash table is lightly loaded, the answer depends on the
hardware, the effectiveness of the hash function, and the workload. Interested readers
are encouraged to experiment. q
Answer:
You can do just that! In fact, you can extend this idea to large clustered systems,
running one copy of the application on each node of the cluster. This practice is called
“sharding”, and is heavily used in practice by large web-based retailers [DHJ+ 07].
However, if you are going to shard on a per-socket basis within a multisocket system,
why not buy separate smaller and cheaper single-socket systems, and then run one shard
of the database on each of those systems? q
Answer:
Yes it can! This is why hashtab_lookup() must be invoked within an RCU read-
side critical section, and it is why hashtab_add() and hashtab_del() must
also use RCU-aware list-manipulation primitives. Finally, this is why the caller of
hashtab_del() must wait for a grace period (e.g., by calling synchronize_
rcu()) before freeing the deleted element. q
Answer:
It isn’t any safer, and a useful exercise would be to run these programs on larger sys-
tems. That said, other testing has shown that RCU read-side primitives offer consistent
performance and scalability up to at least 1024 CPUs. q
Answer:
The reason is that the old and new hash tables might have completely different hash
functions, so that a hash computed for the old table might be completely irrelevant to
the new table. q
C.10. DATA STRUCTURES 605
Answer:
It does not provide any such protection. That is instead the job of the update-side
concurrency-control functions described next. q
Answer:
This approach allows the hashtorture.h testing infrastructure to be reused. That
said, a production-quality resizable hash table would likely be optimized to avoid this
double computation. Carrying out this optimization is left as an exercise for the reader.
q
Answer:
The second resize operation will not be able to move beyond the bucket into which the
insertion is taking place due to the insertion holding the lock on one of the hash buckets
in the new hash table (the second hash table of three in this example). Furthermore, the
insertion operation takes place within an RCU read-side critical section. As we will see
when we examine the hashtab_resize() function, this means that the first resize
operation will use synchronize_rcu() to wait for the insertion’s read-side critical
section to complete. q
Answer:
Suppose that a resize operation begins and distributes half of the old table’s buckets to
the new table. Suppose further that a thread adds a new element that goes into one of the
already-distributed buckets, and that this same thread now looks up this newly added
element. If lookups unconditionally traversed only the old hash table, this thread would
get a lookup failure for the element that it just added, which certainly sounds like a bug
to me! q
from the old hash table. Doesn’t this mean that readers might access this newly removed
element after it has been freed?
Answer:
No. The hashtab_del() function omits removing the element from the old hash
table only if the resize operation has already progressed beyond the bucket containing the
just-deleted element. But this means that new hashtab_lookup() operations will
use the new hash table when looking up that element. Therefore, only old hashtab_
lookup() operations that started before the hashtab_del() might encounter the
newly removed element. This means that hashtab_del() need only wait for an
RCU grace period to avoid inconveniencing hashtab_lookup() operations. q
Answer:
The synchronize_rcu() on line 30 of Figure 10.27 ensures that all pre-existing
RCU readers have completed between the time that we install the new hash-table
reference on line 29 and the time that we update ->ht_resize_cur on line 36. This
means that any reader that sees a non-negative value of ->ht_resize_cur cannot
have started before the assignment to ->ht_new, and thus must be able to see the
reference to the new hash table. q
Answer:
It probably could, and doing so would benefit all of the per-bucket-locked hash tables
presented in this chapter. Making this modification is left as an exercise for the reader.
q
Answer:
The answer to the first question is left as an exercise to the reader. Try specializing the
resizable hash table and see how much performance improvement results. The second
question cannot be answered in general, but must instead be answered with respect to a
specific use case. Some use cases are extremely sensitive to performance and scalability,
while others are less so. q
C.11. VALIDATION 607
C.11 Validation
Quick Quiz 11.1:
When in computing is the willingness to follow a fragmentary plan critically important?
Answer:
There are any number of situations, but perhaps the most important situation is when
no one has ever created anything resembling the program to be developed. In this case,
the only way to create a credible plan is to implement the program, create the plan, and
implement it a second time. But whoever implements the program for the first time
has no choice but to follow a fragmentary plan because any detailed plan created in
ignorance cannot survive first contact with the real world.
And perhaps this is one reason why evolution has favored insanely optimistic human
beings who are happy to follow fragmentary plans! q
The script is required to check its input for errors, and to give appropriate diagnostics
if fed erroneous time output. What test inputs should you provide to this program to
test it for use with time output generated by single-threaded programs?
Answer:
1. Do you have a test case in which all the time is consumed in user mode by a
CPU-bound program?
2. Do you have a test case in which all the time is consumed in system mode by a
CPU-bound program?
3. Do you have a test case in which all three times are zero?
4. Do you have a test case in which the “user” and “sys” times sum to more than the
“real” time? (This would of course be completely legitimate in a multithreaded
program.)
5. Do you have a set of tests cases in which one of the times uses more than one
second?
6. Do you have a set of tests cases in which one of the times uses more than ten
second?
7. Do you have a set of test cases in which one of the times has non-zero minutes?
(For example, “15m36.342s”.)
8. Do you have a set of test cases in which one of the times has a seconds value of
greater than 60?
608 APPENDIX C. ANSWERS TO QUICK QUIZZES
9. Do you have a set of test cases in which one of the times overflows 32 bits of
milliseconds? 64 bits of milliseconds?
10. Do you have a set of test cases in which one of the times is negative?
11. Do you have a set of test cases in which one of the times has a positive minutes
value but a negative seconds value?
12. Do you have a set of test cases in which one of the times omits the “m” or the
“s”?
13. Do you have a set of test cases in which one of the times is non-numeric? (For
example, “Go Fish”.)
14. Do you have a set of test cases in which one of the lines is omitted? (For example,
where there is a “real” value and a “sys” value, but no “user” value.)
15. Do you have a set of test cases where one of the lines is duplicated? Or duplicated,
but with a different time value for the duplicate?
16. Do you have a set of test cases where a given line has more than one time value?
(For example, “real 0m0.132s 0m0.008s”.)
18. In all test cases involving invalid input, did you generate all permutations?
19. For each test case, do you have an expected outcome for that test?
If you did not generate test data for a substantial number of the above cases, you
will need to cultivate a more destructive attitude in order to have a chance of generating
high-quality tests.
Of course, one way to economize on destructiveness is to generate the tests with
the to-be-tested source code at hand, which is called white-box testing (as opposed to
black-box testing). However, this is no panacea: You will find that it is all too easy to
find your thinking limited by what the program can handle, thus failing to generate truly
destructive inputs. q
Answer:
If it is your project, for example, a hobby, do what you like. Any time you waste will
be your own, and you have no one else to answer to for it. And there is a good chance
that the time will not be completely wasted. For example, if you are embarking on a
first-of-a-kind project, the requirements are in some sense unknowable anyway. In this
case, the best approach might be to quickly prototype a number of rough solutions, try
them out, and see what works best.
On the other hand, if you are being paid to produce a system that is broadly similar
to existing systems, you owe it to your users, your employer, and your future self to
validate early and often. q
C.11. VALIDATION 609
Answer:
If you don’t mind having a WARN_ON_ONCE() that will sometimes warn twice or
three times, simply maintain a static variable that is initialized to zero. If the condition
triggers, check the static variable, and if it is non-zero, return. Otherwise, set it to one,
print the message, and return.
If you really need the message to never appear more than once, perhaps because it is
huge, you can use an atomic exchange operation in place of “set it to one” above. Print
the message only if the atomic exchange operation returns zero. q
Answer:
If you are worried about transcription errors, please allow me to be the first to introduce
you to a really cool tool named diff. In addition, carrying out the copying can be quite
valuable:
1. If you are copying a lot of code, you are probably failing to take advantage of an
opportunity for abstraction. The act of copying code can provide great motivation
for abstraction.
2. Copying the code gives you an opportunity to think about whether the code really
works in its new setting. Is there some non-obvious constraint, such as the need
to disable interrupts or to hold some lock?
3. Copying the code also gives you time to consider whether there is some better
way to get the job done.
So, yes, copy the code! q
Answer:
Indeed, repeatedly copying code by hand is laborious and slow. However, when com-
bined with heavy-duty stress testing and proofs of correctness, this approach is also
extremely effective for complex parallel code where ultimate performance and reliability
are required and where debugging is difficult. The Linux-kernel RCU implementation
is a case in point.
On the other hand, if you are writing a simple single-threaded shell script to manipu-
late some data, then you would be best-served by a different methodology. For example,
you might enter each command one at a time into an interactive shell with a test data set
to make sure that it did what you wanted, then copy-and-paste the successful commands
into your script. Finally, test the script as a whole.
If you have a friend or colleague who is willing to help out, pair programming can
work very well, as can any number of formal design- and code-review processes.
610 APPENDIX C. ANSWERS TO QUICK QUIZZES
And if you are writing code as a hobby, then do whatever you like.
In short, different types of software need different development methodologies. q
Answer:
This approach might well be a valuable addition to your validation arsenal. But it does
have a few limitations:
1. Some bugs have extremely low probabilities of occurrence, but nevertheless need
to be fixed. For example, suppose that the Linux kernel’s RCU implementation
had a bug that is triggered only once per century of machine time on average. A
century of CPU time is hugely expensive even on the cheapest cloud platforms,
but we could expect this bug to result in more than 2,000 failures per day on the
more than 100 million Linux instances in the world as of 2011.
2. The bug might well have zero probability of occurrence on your test setup, which
means that you won’t see it no matter how much machine time you burn testing it.
Of course, if your code is small enough, formal validation may be helpful, as discussed
in Chapter 12. But beware: formal validation of your code will not find errors in your
assumptions, misunderstanding of the requirements, misunderstanding of the software
or hardware primitives you use, or errors that you did not think to construct a proof for.
q
Answer:
You are right, that makes no sense at all.
Remember that a probability is a number between zero and one, so that you need to
divide a percentage by 100 to get a probability. So 10% is a probability of 0.1, which
gets a probability of 0.4095, which rounds to 41%, which quite sensibly matches the
earlier result. q
Answer:
It does not matter. You will get the same answer no matter what base of logarithms you
use because the result is a pure ratio of logarithms. The only constraint is that you use
the same base for both the numerator and the denominator. q
C.11. VALIDATION 611
Answer:
We set n to 3 and P to 99.9 in Equation 11.11, resulting in:
1 100 − 99.9
T = − log = 2.3 (C.9)
3 100
If the test runs without failure for 2.3 hours, we can be 99.9% certain that the fix
reduced the probability of failure. q
Answer:
One approach is to use the open-source symbolic manipulation program named “max-
ima”. Once you have installed this program, which is a part of many Debian-based
Linux distributions, you can run it and give the load(distrib); command fol-
lowed by any number of bfloat(cdf_poisson(m,l)); commands, where the m
is replaced by the desired value of m and the l is replaced by the desired value of λ .
In particular, the bfloat(cdf_poisson(2,24)); command results in 1.181617112359357b-8,
which matches the value given by Equation 11.13.
Alternatively, you can use the rough-and-ready method described in Section 11.6.2.
q
Answer:
Indeed it should. And it does.
To see this, note that e−λ does not depend on i, which means that it can be pulled
out of the summation as follows:
∞
λi
e−λ ∑ (C.10)
i=0 i!
e−λ eλ (C.11)
The two exponentials are reciprocals, and therefore cancel, resulting in exactly 1, as
required. q
612 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
Indeed, that can happen. Many CPUs have hardware-debugging facilities that can help
you locate that unrelated pointer. Furthermore, if you have a core dump, you can search
the core dump for pointers referencing the corrupted region of memory. You can also
look at the data layout of the corruption, and check pointers whose type matches that
layout.
You can also step back and test the modules making up your program more in-
tensively, which will likely confine the corruption to the module responsible for it. If
this makes the corruption vanish, consider adding additional argument checking to the
functions exported from each module.
Nevertheless, this is a hard problem, which is why I used the words “a bit of a dark
art”. q
Answer:
A huge commit? Shame on you! This is but one reason why you are supposed to keep
the commits small.
And that is your answer: Break up the commit into bite-sized pieces and bisect the
pieces. In my experience, the act of breaking up the commit is often sufficient to make
the bug painfully obvious. q
Answer:
There are locking algorithms that depend on conditional-locking primitives telling them
the truth. For example, if conditional-lock failure signals that some other thread is
already working on a given job, spurious failure might cause that job to never get done,
possibly resulting in a hang. q
Answer:
This question fails to consider the option of choosing not to compute the answer at all,
and in doing so, also fails to consider the costs of computing the answer. For example,
consider short-term weather forecasting, for which accurate models exist, but which
require large (and expensive) clustered supercomputers, at least if you want to actually
run the model faster than the weather.
And in this case, any performance bug that prevents the model from running faster
than the actual weather prevents any forecasting. Given that the whole purpose of
C.11. VALIDATION 613
purchasing the large clustered supercomputers was to forecast weather, if you cannot
run the model faster than the weather, you would be better off not running the model at
all.
More severe examples may be found in the area of safety-critical real-time comput-
ing. q
Answer:
Although I do heartily salute your spirit and aspirations, you are forgetting that there
may be high costs due to delays in the program’s completion. For an extreme example,
suppose that a 40% performance shortfall from a single-threaded application is causing
one person to die each day. Suppose further that in a day you could hack together a
quick and dirty parallel program that ran 50% faster on an eight-CPU system than the
sequential version, but that an optimal parallel program would require four months of
painstaking design, coding, debugging, and tuning.
It is safe to say that more than 100 people would prefer the quick and dirty version.
q
Answer:
Changes in memory layout can indeed result in unrealistic decreases in execution time.
For example, suppose that a given microbenchmark almost always overflows the L0
cache’s associativity, but with just the right memory layout, it all fits. If this is a real
concern, consider running your microbenchmark using huge pages (or within the kernel
or on bare metal) in order to completely control the memory layout. q
Answer:
Indeed it might, although in most microbenchmarking efforts you would extract the
code under test from the enclosing application. Nevertheless, if for some reason you
must keep the code under test within the application, you will very likely need to use
the techniques discussed in Section 11.7.6. q
Answer:
Because mean and standard deviation were not designed to do this job. To see this, try
applying mean and standard deviation to the following data set, given a 1% relative
error in measurement:
The problem is that mean and standard deviation do not rest on any sort of measurement-
error assumption, and they will therefore see the difference between the values near
49,500 and those near 49,900 as being statistically significant, when in fact they are well
within the bounds of estimated measurement error.
Of course, it is possible to create a script similar to that in Figure 11.7 that uses
standard deviation rather than absolute difference to get a similar effect, and this is left
as an exercise for the interested reader. Be careful to avoid divide-by-zero errors arising
from strings of identical data values! q
Answer:
Indeed it will! But if your performance measurements often produce a value of exactly
zero, perhaps you need to take a closer look at your performance-measurement code.
Note that many approaches based on mean and standard deviation will have similar
problems with this sort of dataset. q
Answer:
The locker process is an infinite loop, so control never reaches the end of this process.
However, since there are no monotonically increasing variables, Promela is able to
model this infinite loop with a small number of states. q
Answer:
There are several:
1. The declaration of sum should be moved to within the init block, since it is not
used anywhere else.
C.12. FORMAL VERIFICATION 615
2. The assertion code should be moved outside of the initialization loop. The
initialization loop can then be placed in an atomic block, greatly reducing the
state space (by how much?).
3. The atomic block covering the assertion code should be extended to include the
initialization of sum and j, and also to cover the assertion. This also reduces the
state space (again, by how much?).
Answer:
Yes. Replace it with if-fi and remove the two break statements. q
Answer:
Because those operations are for the benefit of the assertion only. They are not part of
the algorithm itself. There is therefore no harm in marking them atomic, and so marking
them greatly reduces the state space that must be searched by the Promela model. q
Answer:
Yes. To see this, delete these lines and run the model.
Alternatively, consider the following sequence of steps:
1. One process is within its RCU read-side critical section, so that the value of
ctr[0] is zero and the value of ctr[1] is two.
2. An updater starts executing, and sees that the sum of the counters is two so that
the fastpath cannot be executed. It therefore acquires the lock.
3. A second updater starts executing, and fetches the value of ctr[0], which is
zero.
4. The first updater adds one to ctr[0], flips the index (which now becomes zero),
then subtracts one from ctr[1] (which now becomes one).
5. The second updater fetches the value of ctr[1], which is now one.
6. The second updater now incorrectly concludes that it is safe to proceed on the
fastpath, despite the fact that the original reader has not yet completed.
616 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
There is always room for doubt. In this case, it is important to keep in mind that the
two proofs of correctness preceded the formalization of real-world memory models,
raising the possibility that these two proofs are based on incorrect memory-ordering
assumptions. Furthermore, since both proofs were constructed by the same person, it is
quite possible that they contain a common error. Again, there is always room for doubt.
q
Answer:
Relax, there are a number of lawful answers to this question:
2. Work out a pencil-and-paper proof, perhaps starting with the comments in the
code in the Linux kernel.
3. Devise careful torture tests, which, though they cannot prove the code correct,
can find hidden bugs.
5. Wait for memory sizes of affordable systems to expand to fit your problem.
Answer:
This fails in presence of NMIs. To see this, suppose an NMI was received just after
rcu_irq_enter() incremented rcu_update_flag, but before it incremented
dynticks_progress_counter. The instance of rcu_irq_enter() invoked
by the NMI would see that the original value of rcu_update_flag was non-zero,
and would therefore refrain from incrementing dynticks_progress_counter.
C.12. FORMAL VERIFICATION 617
This would leave the RCU grace-period machinery no clue that the NMI handler was
executing on this CPU, so that any RCU read-side critical sections in the NMI handler
would lose their RCU protection.
The possibility of NMI handlers, which, by definition cannot be masked, does
complicate this code. q
Answer:
Not if we interrupted a running task! In that case, dynticks_progress_counter
would have already been incremented by rcu_exit_nohz(), and there would be no
need to increment it again. q
Answer:
Read the next section to see if you were correct. q
Answer:
Promela assumes sequential consistency, so it is not necessary to model memory barriers.
In fact, one must instead explicitly model lack of memory barriers, for example, as
shown in Figure 12.13 on page 312. q
Answer:
It probably would be more natural, but we will need this particular order for the liveness
checks that we will add later. q
Answer:
Because the grace-period code processes each CPU’s dynticks_progress_counter
and rcu_dyntick_snapshot variables separately, we can collapse the state onto
a single CPU. If the grace-period code were instead to do something special given
specific values on specific CPUs, then we would indeed need to model multiple CPUs.
618 APPENDIX C. ANSWERS TO QUICK QUIZZES
But fortunately, we can safely confine ourselves to two CPUs, the one running the
grace-period processing and the one entering and leaving dynticks-idle mode. q
Answer:
Recall that Promela and spin trace out every possible sequence of state changes. There-
fore, timing is irrelevant: Promela/spin will be quite happy to jam the entire rest of the
model between those two statements unless some state variable specifically prohibits
doing so. q
Answer:
The easiest thing to do would be to put each such statement in its own EXECUTE_
MAINLINE() statement. q
Answer:
One approach, as we will see in a later section, is to use explicit labels and “goto”
statements. For example, the construct:
if
:: i == 0 -> a = -1;
:: else -> a = -2;
fi;
However, it is not clear that the macro is helping much in the case of the “if”
statement, so these sorts of situations will be open-coded in the following sections. q
Answer:
These lines of code pertain to controlling the model, not to the code being modeled, so
there is no reason to model them non-atomically. The motivation for modeling them
atomically is to reduce the size of the state space. q
Answer:
One such property is nested interrupts, which are handled in the following section. q
Answer:
Not always, but more and more frequently. In this case, Paul started with the smallest
slice of code that included an interrupt handler, because he was not sure how best to
model interrupts in Promela. Once he got that working, he added other features. (But if
he was doing it again, he would start with a “toy” handler. For example, he might have
the handler increment a variable twice and have the mainline code verify that the value
was always even.)
Why the incremental approach? Consider the following, attributed to Brian W.
Kernighan:
Debugging is twice as hard as writing the code in the first place. Therefore,
if you write the code as cleverly as possible, you are, by definition, not
smart enough to debug it.
This means that any attempt to optimize the production of code should place at
least 66% of its emphasis on optimizing the debugging process, even at the expense
of increasing the time and effort spent coding. Incremental coding and testing is one
way to optimize the debugging process, at the expense of some increase in coding effort.
Paul uses this approach because he rarely has the luxury of devoting full days (let alone
weeks) to coding and debugging. q
Answer:
This cannot happen within the confines of a single CPU. The first irq handler cannot
complete until the NMI handler returns. Therefore, if each of the dynticks and
dynticks_nmi variables have taken on an even value during a given time interval,
the corresponding CPU really was in a quiescent state at some time during that interval.
q
each CPU that is in dyntick-idle mode, clearing the bit when entering an irq or NMI
handler, and setting it upon exit?
Answer:
Although this approach would be functionally correct, it would result in excessive irq
entry/exit overhead on large machines. In contrast, the approach laid out in this section
allows each CPU to touch only per-CPU data on irq and NMI entry/exit, resulting in
much lower irq entry/exit overhead, especially on large machines. q
Answer:
Actually, academics consider the x86 memory model to be weak because it can allow
prior stores to be reordered with subsequent loads. From an academic viewpoint, a
strong memory model is one that allows absolutely no reordering, so that all threads
agree on the order of all operations visible to them. q
Answer:
Either way works. However, in general, it is better to use initialization than explicit
instructions. The explicit instructions are used in this example to demonstrate their
use. In addition, many of the litmus tests available on the tool’s web site (http://
www.cl.cam.ac.uk/~pes20/ppcmem/) were automatically generated, which
generates explicit initialization instructions. q
Answer:
The implementation of powerpc version of atomic_add_return() loops when
the stwcx instruction fails, which it communicates by setting non-zero status in the
condition-code register, which in turn is tested by the bne instruction. Because actually
modeling the loop would result in state-space explosion, we instead branch to the Fail:
label, terminating the model with the initial value of 2 in P0’s r3 register, which will
not trigger the exists assertion.
There is some debate about whether this trick is universally applicable, but I have
not seen an example where it fails. q
Answer:
ARM does not have this particular bug because that it places smp_mb() before and
C.13. PUTTING IT ALL TOGETHER 621
Answer:
Unfortunately, no.
The first full verification of the L4 microkernel was a tour de force, with a large
number of Ph.D. students hand-verifying code at a very slow per-student rate. This level
of effort could not be applied to most software projects because the rate of change is just
too great. Furthermore, although the L4 microkernel is a large software artifact from
the viewpoint of formal verification, it is tiny compared to a great number of projects,
including LLVM, gcc, the Linux kernel, Hadoop, MongoDB, and a great many others.
Although formal verification is finally starting to show some promise, including
more-recent L4 verifications involving greater levels of automation, it currently has no
chance of completely displacing testing in the foreseeable future. And although I would
dearly love to be proven wrong on this point, please note that such a proof will be in the
form of a real tool that verifies real software, not in the form of a large body of rousing
rhetoric. q
Answer:
Although this can resolve the race between the release of the last reference and acquisi-
tion of a new reference, it does absolutely nothing to prevent the data structure from
being freed and reallocated, possibly as some completely different type of structure.
It is quite likely that the “simple compare-and-swap operation” would give undefined
results if applied to the differently typed structure.
In short, use of atomic operations such as compare-and-swap absolutely requires
either type-safety or existence guarantees. q
Answer:
Because a CPU must already hold a reference in order to legally acquire another
reference. Therefore, if one CPU releases the last reference, there cannot possibly
be any CPU that is permitted to acquire a new reference. This same fact allows the
non-atomic check in line 22 of Figure 13.2. q
622 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
This cannot happen if these functions are used correctly. It is illegal to invoke kref_
get() unless you already hold a reference, in which case the kref_sub() could not
possibly have decremented the counter to zero. q
Answer:
The caller cannot rely on the continued existence of the object unless it knows that at
least one reference will continue to exist. Normally, the caller will have no way of
knowing this, and must therefore carefullly avoid referencing the object after the call to
kref_sub(). q
Answer:
Because the kref structure normally is embedded in a larger structure, and it is neces-
sary to free the entire structure, not just the kref field. This is normally accomplished
by defining a wrapper function that does a container_of() and then a kfree().
q
Answer:
Suppose that the “if” condition completed, finding the reference counter value equal to
one. Suppose that a release operation executes, decrementing the reference counter to
zero and therefore starting cleanup operations. But now the “then” clause can increment
the counter back to a value of one, allowing the object to be used after it has been
cleaned up. q
Answer:
It might well seem that way, but in situations where no other CPU has access to the
atomic variable in question, the overhead of an actual atomic instruction would be
C.13. PUTTING IT ALL TOGETHER 623
wasteful. Two examples where no other CPU has access are during initialization and
cleanup. q
Answer:
A given thread’s __thread variables vanish when that thread exits. It is therefore nec-
essary to synchronize any operation that accesses other threads’ __thread variables
with thread exit. Without such synchronization, accesses to __thread variable of a
just-exited thread will result in segmentation faults. q
Answer:
Refer to Figure 5.9 on Page 62. Clearly, if there are no concurrent invocations of
inc_count(), read_count() will return an exact result. However, if there are
concurrent invocations of inc_count(), then the sum is in fact changing as read_
count() performs its summation. That said, because thread creation and exit are
excluded by final_mutex, the pointers in counterp remain constant.
Let’s imagine a mythical machine that is able to take an instantaneous snapshot
of its memory. Suppose that this machine takes such a snapshot at the beginning of
read_count()’s execution, and another snapshot at the end of read_count()’s
execution. Then read_count() will access each thread’s counter at some time
between these two snapshots, and will therefore obtain a result that is bounded by those
of the two snapshots, inclusive. The overall sum will therefore be bounded by the pair of
sums that would have been obtained from each of the two snapshots (again, inclusive).
The expected error is therefore half of the difference between the pair of sums
that would have been obtained from each of the two snapshots, that is to say, half of
the execution time of read_count() multiplied by the number of expected calls to
inc_count() per unit time.
Or, for those who prefer equations:
Tr Ri
ε= (C.12)
2
where ε is the expected error in read_count()’s return value, Tr is the time that
read_count() takes to execute, and Ri is the rate of inc_count() calls per unit
time. (And of course, Tr and Ri should use the same units of time: microseconds and
calls per microsecond, seconds and calls per second, or whatever, as long as they are the
same units.) q
1 struct measurement {
2 double meas_1;
3 double meas_2;
4 double meas_3;
5 };
6
7 struct animal {
8 char name[40];
9 double age;
10 struct measurement *mp;
11 struct measurement meas;
12 char photo[0]; /* large bitmap. */
13 };
Answer:
Indeed I did say that. And it would be possible to make count_register_thread()
allocate a new structure, much as count_unregister_thread() currently does.
But this is unnecessary. Recall the derivation of the error bounds of read_
count() that was based on the snapshots of memory. Because new threads start
with initial counter values of zero, the derivation holds even if we add a new thread
partway through read_count()’s execution. So, interestingly enough, when adding
a new thread, this implementation gets the effect of allocating a new structure, but
without actually having to do the allocation. q
Answer:
This of course needs to be decided on a case-by-case basis. If you need an implemen-
tation of read_count() that scales linearly, then the lock-based implementation
shown in Figure 5.9 simply will not work for you. On the other hand, if calls to count_
read() are sufficiently rare, then the lock-based version is simpler and might thus be
better, although much of the size difference is due to the structure definition, memory
allocation, and NULL return checking.
Of course, a better question is “Why doesn’t the language implement cross-thread
access to __thread variables?” After all, such an implementation would make
both the locking and the use of RCU unnecessary. This would in turn enable an
implementation that was even simpler than the one shown in Figure 5.9, but with all the
scalability and performance benefits of the implementation shown in Figure 13.5! q
Answer:
Indeed it can.
One way to avoid this cache-miss overhead is shown in Figure C.10: Simply embed
an instance of a measurement structure named meas into the animal structure,
and point the ->mp field at this ->meas field.
Measurement updates can then be carried out as follows:
C.14. ADVANCED SYNCHRONIZATION 625
1. Allocate a new measurement structure and place the new measurements into
it.
3. Wait for a grace period to elapse, for example using either synchronize_
rcu() or call_rcu().
4. Copy the measurements from the new measurement structure into the embed-
ded ->meas field.
6. After another grace period elapses, free up the new measurement structure.
This approach uses a heavier weight update procedure to eliminate the extra cache
miss in the common case. The extra cache miss will be incurred only while an update is
actually in progress. q
Answer:
True, resizable hash tables as described in Section 10.4 cannot be fully scanned while
being resized. One simple way around this is to acquire the hashtab structure’s
->ht_lock while scanning, but this prevents more than one scan from proceeding
concurrently.
Another approach is for updates to mutate the old hash table as well as the new one
while resizing is in progress. This would allow scans to find all elements in the old hash
table. Implementing this is left as an exercise for the reader. q
Answer:
The key point is that the intuitive analysis missed is that there is nothing preventing the
assignment to C from overtaking the assignment to A as both race to reach thread2().
This is explained in the remainder of this section. q
Answer:
The easiest fix is to replace each of the barrier()s on line 12 and line 20 with an
smp_mb().
Of course, some hardware is more forgiving than other hardware. For example, on
x86 the assertion on line 21 of Figure 14.3 on page 373 cannot trigger. On PowerPC,
only the barrier() on line 20 need be replaced with smp_mb() to prevent the
assertion from triggering. q
Answer:
The code assumes that as soon as a given CPU stops seeing its own value, it will
immediately see the final agreed-upon value. On real hardware, some of the CPUs
might well see several intermediate results before converging on the final value. q
Answer:
Many CPUs have write buffers that record the values of recent writes, which are applied
once the corresponding cache line makes its way to the CPU. Therefore, it is quite
possible for each CPU to see a different value for a given variable at a single point in
time—and for main memory to hold yet another value. One of the reasons that memory
barriers were invented was to allow software to deal gracefully with situations like this
one. q
Answer:
CPUs 2 and 3 are a pair of hardware threads on the same core, sharing the same cache
hierarchy, and therefore have very low communications latencies. This is a NUMA, or,
more accurately, a NUCA effect.
This leads to the question of why CPUs 2 and 3 ever disagree at all. One possible
reason is that they each might have a small amount of private cache in addition to a
larger shared cache. Another possible reason is instruction reordering, given the short
10-nanosecond duration of the disagreement and the total lack of memory barriers in
the code fragment. q
Answer:
MMIO registers are special cases: because they appear in uncached regions of physical
memory. Memory barriers do unconditionally force ordering of loads and stores to
uncached memory, as discussed in Section 14.2.8. q
Answer:
The scenario is as follows, with A and B both initially zero:
CPU 0: A=1; smp_mb(); r1=B;
CPU 1: B=1; smp_mb(); r2=A;
If neither of the loads see the corresponding store, when both CPUs finish, both r1
and r2 will be equal to zero. Let’s suppose that r1 is equal to zero. Then we know that
CPU 0’s load from B happened before CPU 1’s store to B: After all, we would have
had r1 equal to one otherwise. But given that CPU 0’s load from B happened before
CPU 1’s store to B, memory-barrier pairing guarantees that CPU 0’s store to A happens
before CPU 1’s load from A, which in turn guarantees that r2 will be equal to one, not
zero.
Therefore, at least one of r1 and r2 must be nonzero, which means that at least one
of the loads saw the value from the corresponding store, as claimed. q
Answer:
For combination 2, if CPU 1’s load from B sees a value prior to CPU 2’s store to B, then
we know that CPU 2’s load from A will return the same value as CPU 1’s load from A,
or some later value.
For combination 4, if CPU 2’s load from B sees the value from CPU 1’s store to B,
then we know that CPU 2’s load from A will return the same value as CPU 1’s load
from A, or some later value.
For combination 8, if CPU 2’s load from A sees CPU 1’s store to A, then we know
that CPU 1’s load from B will return the same value as CPU 2’s load from A, or some
later value. q
Answer:
If the CPU is not required to see all of its loads and stores in order, then the b=1+a
might well see an old version of the variable “a”.
This is why it is so very important that each CPU or thread see all of its own loads
and stores in program order. q
Answer:
Only the first execution of the critical section should see p==NULL. However, if there
is no global ordering of critical sections for mylock, then how can you say that a
particular one was first? If several different executions of that critical section thought
that they were first, they would all see p==NULL, and they would all allocate memory.
All but one of those allocations would be leaked.
This is why it is so very important that all the critical sections for a given exclusive
lock appear to execute in some well-defined order. q
Answer:
Suppose that the counter started out with the value zero, and that three executions of the
critical section had therefore brought its value to three. If the fourth execution of the
critical section is not constrained to see the most recent store to this variable, it might
well see the original value of zero, and therefore set the counter to one, which would be
going backwards.
This is why it is so very important that loads from a given variable in a given critical
section see the last store from the last prior critical section to store to that variable. q
Answer:
Absolutely none. This barrier would ensure that the assignments to “a” and “b” hap-
pened before any subsequent assignments, but it does nothing to enforce any order of
assignments to “a” and “b” themselves. q
Answer:
A series of two back-to-back LOCK-UNLOCK operations, or, somewhat less conven-
tionally, an UNLOCK operation followed by a LOCK operation. q
Answer:
Itanium is one example. The identification of any others is left as an exercise for the
reader. q
C.14. ADVANCED SYNCHRONIZATION 629
Answer:
2. Legitimate, the lock acquisition was executed concurrently with the last assign-
ment preceding the critical section.
4. Illegitimate, the LOCK must complete before any operation in the critical sec-
tion. However, the UNLOCK may legitimately be executed concurrently with
subsequent operations.
5. Legitimate, the assignment to “A” precedes the UNLOCK, as required, and all
other operations are in order.
8. Legitimate, all assignments are ordered with respect to the LOCK and UNLOCK
operations.
Answer:
All CPUs must see the following ordering constraints:
q
630 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
Sooner or later, either the battery must be recharged, which requires energy to flow into
the system, or the system will stop operating. q
Answer:
Yes, but . . .
Those queueing-theory results assume infinite “calling populations”, which in the
Linux kernel might correspond to an infinite number of tasks. As of mid-2016, no
real system supports an infinite number of tasks, so results assuming infinite calling
populations should be expected to have less-than-infinite applicability.
Other queueing-theory results have finite calling populations, which feature sharply
bounded response times [HL86]. These results better model real systems, and these
models do predict reductions in both average and worst-case response times as utiliza-
tions decrease. These results can be extended to model concurrent systems that use
synchronization mechanisms such as locking [Bra11].
In short, queueing-theory results that accurately describe real-world real-time sys-
tems show that worst-case response time decreases with decreasing utilization. q
Answer:
Perhaps this situation is just a theoretician’s excuse to avoid diving into the messy world
of real software? Perhaps more constructively, the following advances are required:
1. Formal verification needs to handle larger software artifacts. The largest verifica-
tion efforts have been for systems of only about 10,000 lines of code, and those
have been verifying much simpler properties than real-time latencies.
2. Hardware vendors will need to publish formal timing guarantees. This used to be
common practice back when hardware was much simpler, but today’s complex
hardware results in excessively complex expressions for worst-case performance.
Unfortunately, energy-efficiency concerns are pushing vendors in the direction of
even more complexity.
All that said, there is hope, given recent work formalizing the memory models of
real computer systems [AMP+ 11, AKNT13]. q
Answer:
This distinction is admittedly unsatisfying from a strictly theoretical perspective. But
on the other hand, it is exactly what the developer needs in order to decide whether
the application can be cheaply and easily developed using standard non-real-time
approaches, or whether the more difficult and expensive real-time approaches are
required. In other words, theory is quite important, however, for those of us who like to
get things done, theory supports practice, never the other way around. q
Answer:
Indeed it is, other than the API. And the API is important because it allows the Linux
kernel to offer real-time capabilities without having the -rt patchset grow to ridiculous
sizes.
However, this approach clearly and severely limits read-side scalability. The Linux
kernel’s -rt patchset has been able to live with this limitation for several reasons: (1) Real-
time systems have traditionally been relatively small, (2) Real-time systems have gen-
erally focused on process control, thus being unaffected by scalability limitations in
the I/O subsystems, and (3) Many of the Linux kernel’s reader-writer locks have been
converted to RCU.
All that aside, it is quite possible that the Linux kernel will some day permit limited
read-side parallelism for reader-writer locks subject to priority boosting. q
Answer:
That is a real problem, and it is solved in RCU’s scheduler hook. If that scheduler
hook sees that the value of t->rcu_read_lock_nesting is negative, it invokes
rcu_read_unlock_special() if needed before allowing the context switch to
complete. q
Answer:
Yes and no.
Yes in that non-blocking algorithms can provide fault tolerance in the face of fail-
stop bugs, but no in that this is grossly insufficient for practical fault tolerance. For
example, suppose you had a wait-free queue, and further suppose that a thread has just
dequeued an element. If that thread now succumbs to a fail-stop bug, the element it
has just dequeued is effectively lost. True fault tolerance requires way more than mere
non-blocking properties, and is beyond the scope of this book. q
Answer:
Indeed there are, and lots of them. However, they tend to be specific to a given situation,
and many of them can be thought of as refinements of some of the constraints listed
above. For example, the many constraints on choices of data structure will help meeting
the “Bounded time spent in any given critical section” constraint. q
Answer:
In early 2016, situations forbidding runtime memory were also not so excited with
multithreaded computing. So the runtime memory allocation is not an additional
obstacle to safety criticality. q
Answer:
Indeed you do, and you could use any of a number of techniques discussed earlier in
this book. q
Answer:
Yes. However, since each thread must hold the locks of three consecutive elements to
delete the middle one, if there are N threads, there must be 2N + 1 elements (rather than
just N + 1) in order to avoid deadlock. q
C.17. CONFLICTING VISIONS OF THE FUTURE 633
Answer:
That would be Paul.
He was considering the Dining Philosopher’s Problem, which involves a rather
unsanitary spaghetti dinner attended by five philosophers. Given that there are five
plates and but five forks on the table, and given that each philosopher requires two forks
at a time to eat, one is supposed to come up with a fork-allocation algorithm that avoids
deadlock. Paul’s response was “Sheesh! Just get five more forks!”
This in itself was OK, but Paul then applied this same solution to circular linked
lists.
This would not have been so bad either, but he had to go and tell someone about it!
q
Answer:
One exception would be a difficult and complex algorithm that was the only one known
to work in a given situation. Another exception would be a difficult and complex
algorithm that was nonetheless the simplest of the set known to work in a given situation.
However, even in these cases, it may be very worthwhile to spend a little time trying to
come up with a simpler algorithm! After all, if you managed to invent the first algorithm
to do some task, it shouldn’t be that hard to go on to invent a simpler one. q
Answer:
If the exec()ed program maps those same regions of memory, then this program could
in principle simply release the lock. The question as to whether this approach is sound
from a software-engineering viewpoint is left as an exercise for the reader. q
Answer:
If the lock is in the same cacheline as some of the variables that it is protecting, then
writes to those variables by one CPU will invalidate that cache line for all the other
634 APPENDIX C. ANSWERS TO QUICK QUIZZES
CPUs. These invalidations will generate large numbers of conflicts and retries, perhaps
even degrading performance and scalability compared to locking. q
Answer:
The larger the updates, the greater the probability of conflict, and thus the greater
probability of retries, which degrade performance. q
Answer:
In many cases, the enumeration need not be exact. In these cases, hazard pointers or
RCU may be used to protect readers with low probability of conflict with any given
insertion or deletion. q
Answer:
This scheme might work with reasonably high probability, but it can fail in ways that
would be quite surprising to most users. To see this, consider the following transaction:
1 begin_trans();
2 if (a) {
3 do_one_thing();
4 do_another_thing();
5 } else {
6 do_a_third_thing();
7 do_a_fourth_thing();
8 }
9 end_trans();
Suppose that the user sets a breakpoint at line 3, which triggers, aborting the
transaction and entering the debugger. Suppose that between the time that the breakpoint
triggers and the debugger gets around to stopping all the threads, some other thread sets
the value of a to zero. When the poor user attempts to single-step the program, surprise!
The program is now in the else-clause instead of the then-clause.
This is not what I call an easy-to-use debugger. q
Answer:
See the answer to Quick Quiz 7.18 in Section 7.2.1.
C.17. CONFLICTING VISIONS OF THE FUTURE 635
Answer:
It could do so, but this would be both unnecessary and insufficient.
It would be unnecessary in cases where the empty critical section was due to
conditional compilation. Here, it might well be that the only purpose of the lock was to
protect data, so eliding it completely would be the right thing to do. In fact, leaving the
empty lock-based critical section would degrade performance and scalability.
On the other hand, it is possible for a non-empty lock-based critical section to be
relying on both the data-protection and time-based and messaging semantics of locking.
Using transactional lock elision in such a case would be incorrect, and would result in
bugs. q
Answer:
The short answer is that on commonplace commodity hardware, synchronization designs
based on any sort of fine-grained timing are foolhardy and cannot be expected to operate
correctly under all conditions.
That said, there are systems designed for hard real-time use that are much more
deterministic. In the (very unlikely) event that you are using such a system, here is
a toy example showing how time-based synchronization can work. Again, do not try
this on commodity microprocessors, as they have highly nondeterministic performance
characteristics.
This example uses multiple worker threads along with a control thread. Each worker
thread corresponds to an outbound data feed, and records the current time (for example,
from the clock_gettime() system call) in a per-thread my_timestamp variable
after executing each unit of work. The real-time nature of this example results in the
following set of constraints:
636 APPENDIX C. ANSWERS TO QUICK QUIZZES
1. It is a fatal error for a given worker thread to fail to update its timestamp for a
time period of more than MAX_LOOP_TIME.
3. Locks are granted in strict FIFO order within a given thread priority.
When worker threads complete their feed, they must disentangle themselves from
the rest of the application and place a status value in a per-thread my_status variable
that is initialized to −1. Threads do not exit; they instead are placed on a thread pool to
accommodate later processing requirements. The control thread assigns (and re-assigns)
worker threads as needed, and also maintains a histogram of thread statuses. The control
thread runs at a real-time priority no higher than that of the worker threads.
Worker threads’ code is as follows:
1 for (;;) {
2 for_each_thread(t) {
3 ct = clock_gettime(...);
4 d = ct - per_thread(my_timestamp, t);
5 if (d >= MAX_LOOP_TIME) {
6 /* thread departing. */
7 acquire_lock(&departing_thread_lock);
8 release_lock(&departing_thread_lock);
9 i = per_thread(my_status, t);
10 status_hist[i]++; /* Bug if TLE! */
11 }
12 }
13 /* Repurpose threads as needed. */
14 }
Line 5 uses the passage of time to deduce that the thread has exited, executing
lines 6-10 if so. The empty lock-based critical section on lines 7 and 8 guarantees that
any thread in the process of exiting completes (remember that locks are granted in FIFO
order!).
Once again, do not try this sort of thing on commodity microprocessors. After all, it
is difficult enough to get right on systems specifically designed for hard real-time use!
q
C.18. IMPORTANT QUESTIONS 637
Answer:
No deadlock will result. To arrive at deadlock, two different threads must each acquire
the two locks in oppposite orders, which does not happen in this example. However,
deadlock detectors such as lockdep [Cor06a] will flag this as a false positive. q
Answer:
At least they accomplished something useful! And perhaps there will be additional
HTM progress over time. q
Answer:
Answer:
3. The producer might also be running on a faster CPU than is the consumer (for
example, one of the CPUs might have had to decrease its clock frequency due to
heat-dissipation or power-consumption constraints).
638 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
Yes. q
Answer:
The people who would like to arbitrarily subdivide and interleave the workload. Of
course, an arbitrary subdivision might end up separating a lock acquisition from the
corresponding lock release, which would prevent any other thread from acquiring that
lock. If the locks were pure spinlocks, this could even result in deadlock. q
Answer:
The writeback message originates from a given CPU, or in some designs from a given
level of a given CPU’s cache—or even from a cache that might be shared among several
CPUs. The key point is that a given cache does not have room for a given data item, so
some other piece of data must be ejected from the cache to make room. If there is some
other piece of data that is duplicated in some other cache or in memory, then that piece
of data may be simply discarded, with no writeback message required.
On the other hand, if every piece of data that might be ejected has been modified
so that the only up-to-date copy is in this cache, then one of those data items must be
copied somewhere else. This copy operation is undertaken using a “writeback message”.
The destination of the writeback message has to be something that is able to store
the new value. This might be main memory, but it also might be some other cache. If it
is a cache, it is normally a higher-level cache for the same CPU, for example, a level-1
cache might write back to a level-2 cache. However, some hardware designs permit
cross-CPU writebacks, so that CPU 0’s cache might send a writeback message to CPU 1.
This would normally be done if CPU 1 had somehow indicated an interest in the data,
for example, by having recently issued a read request.
In short, a writeback message is sent from some part of the system that is short of
space, and is received by some other part of the system that can accommodate the data.
q
C.19. WHY MEMORY BARRIERS? 639
Answer:
One of the CPUs gains access to the shared bus first, and that CPU “wins”. The other
CPU must invalidate its copy of the cache line and transmit an “invalidate acknowledge”
message to the other CPU.
Of course, the losing CPU can be expected to immediately issue a “read invalidate”
transaction, so the winning CPU’s victory will be quite ephemeral. q
Answer:
It might, if large-scale multiprocessors were in fact implemented that way. Larger
multiprocessors, particularly NUMA machines, tend to use so-called “directory-based”
cache-coherence protocols to avoid this and other problems. q
Answer:
There has been quite a bit of controversy on this topic over the past few decades. One
answer is that the cache-coherence protocols are quite simple, and therefore can be
implemented directly in hardware, gaining bandwidths and latencies unattainable by
software message passing. Another answer is that the real truth is to be found in
economics due to the relative prices of large SMP machines and that of clusters of
smaller SMP machines. A third answer is that the SMP programming model is easier to
use than that of distributed systems, but a rebuttal might note the appearance of HPC
clusters and MPI. And so the argument continues. q
Answer:
Usually by adding additional states, though these additional states need not be actually
stored with the cache line, due to the fact that only a few lines at a time will be
transitioning. The need to delay transitions is but one issue that results in real-world
cache coherence protocols being much more complex than the over-simplified MESI
protocol described in this appendix. Hennessy and Patterson’s classic introduction to
computer architecture [HP95] covers many of these issues. q
Answer:
There is no such sequence, at least in absence of special “flush my cache” instructions
in the CPU’s instruction set. Most CPUs do have such instructions. q
Answer:
Because the purpose of store buffers is not just to hide acknowledgement latencies in
multiprocessor cache-coherence protocols, but to hide memory latencies in general.
Because memory is much slower than is cache on uniprocessors, store buffers on
uniprocessors can help to hide write-miss latencies. q
Answer:
Because the cache line in question contains more than just the variable a. q
Answer:
CPU 0 already has the values of these variables, given that it has a read-only copy of the
cache line containing “a”. Therefore, all CPU 0 need do is to cause the other CPUs to
discard their copies of this cache line. An “invalidate” message therefore suffices. q
Answer:
CPUs are free to speculatively execute, which can have the effect of executing the
assertion before the while loop completes. Furthermore, compilers normally assume
that only the currently executing thread is updating the variables, and this assumption
allows the compiler to hoist the load of a to precede the loop.
In fact, some compilers would transform the loop to a branch around an infinite loop
as follows:
C.19. WHY MEMORY BARRIERS? 641
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 if (b == 0)
11 for (;;)
12 continue;
13 smp_mb();
14 assert(a == 1);
15 }
Given this optimization, the assertion could clearly fire. You should use volatile
casts or (where available) C++ relaxed atomics to prevent the compiler from optimizing
your parallel code into oblivion.
In short, both compilers and CPUs are quite aggressive about optimizing, so you
must clearly communicate your constraints to them, using compiler directives and
memory barriers. q
Answer:
No. Consider the case where a thread migrates from one CPU to another, and where
the destination CPU perceives the source CPU’s recent memory operations out of order.
To preserve user-mode sanity, kernel hackers must use memory barriers in the context-
switch path. However, the locking already required to safely do a context switch should
automatically provide the memory barriers needed to cause the user-level task to see
its own accesses in order. That said, if you are designing a super-optimized scheduler,
either in the kernel or at user level, please keep this scenario in mind! q
Answer:
No. Such a memory barrier would only force ordering local to CPU 1. It would have no
effect on the relative ordering of CPU 0’s and CPU 1’s accesses, so the assertion could
still fail. However, all mainstream computer systems provide one mechanism or another
to provide “transitivity”, which provides intuitive causal ordering: if B saw the effects
of A’s accesses, and C saw the effects of B’s accesses, then C must also see the effects
of A’s accesses. In short, hardware designers have taken at least a little pity on software
developers. q
642 APPENDIX C. ANSWERS TO QUICK QUIZZES
Answer:
The assertion will need to written to ensure that the load of “e” precedes that of “a”. In
the Linux kernel, the barrier() primitive may be used to accomplish this in much the
same way that the memory barrier was used in the assertions in the previous examples.
q
Answer:
The result depends on whether the CPU supports “transitivity”. In other words, CPU 0
stored to “e” after seeing CPU 1’s store to “c”, with a memory barrier between CPU 0’s
load from “c” and store to “e”. If some other CPU sees CPU 0’s store to “e”, is it also
guaranteed to see CPU 1’s store?
All CPUs I am aware of claim to provide transitivity. q
Answer:
First, Alpha has only mb and wmb instructions, so smp_rmb() would be implemented
by the Alpha mb instruction in either case.
More importantly, smp_read_barrier_depends() must order subsequent
stores. For example, consider the following code:
1 p = global_pointer;
2 smp_read_barrier_depends();
3 if (do_something_with(p->a, p->b) == 0)
4 p->hey_look = 1;
Here the store to p->hey_look must be ordered, not just the loads from p->a
and p->b. q
Dictionaries are inherently circular in nature.
Appendix D
Associativity: The number of cache lines that can be held simultaneously in a given
cache, when all of these cache lines hash identically in that cache. A cache
that could hold four cache lines for each possible hash value would be termed a
“four-way set-associative” cache, while a cache that could hold only one cache
line for each possible hash value would be termed a “direct-mapped” cache. A
cache whose associativity was equal to its capacity would be termed a “fully
associative” cache. Fully associative caches have the advantage of eliminating
associativity misses, but, due to hardware limitations, fully associative caches
are normally quite limited in size. The associativity of the large caches found on
modern microprocessors typically range from two-way to eight-way.
Associativity Miss: A cache miss incurred because the corresponding CPU has recently
accessed more data hashing to a given set of the cache than will fit in that set.
Fully associative caches are not subject to associativity misses (or, equivalently,
in fully associative caches, associativity and capacity misses are identical).
Cache: In modern computer systems, CPUs have caches in which to hold frequently
used data. These caches can be thought of as hardware hash tables with very
simple hash functions, but in which each hash bucket (termed a “set” by hardware
types) can hold only a limited number of data items. The number of data items that
can be held by each of a cache’s hash buckets is termed the cache’s “associativity”.
These data items are normally called “cache lines”, which can be thought of a
fixed-length blocks of data that circulate among the CPUs and memory.
Cache Coherence: A property of most modern SMP machines where all CPUs will
observe a sequence of values for a given variable that is consistent with at least
one global order of values for that variable. Cache coherence also guarantees that
at the end of a group of stores to a given variable, all CPUs will agree on the final
value for that variable. Note that cache coherence applies only to the series of
values taken on by a single variable. In contrast, the memory consistency model
643
644 APPENDIX D. GLOSSARY AND BIBLIOGRAPHY
for a given machine describes the order in which loads and stores to groups of
variables will appear to occur. See Section 14.2.4.2 for more information.
Cache Coherence Protocol: A communications protocol, normally implemented in
hardware, that enforces memory consistency and ordering, preventing different
CPUs from seeing inconsistent views of data held in their caches.
Cache Geometry: The size and associativity of a cache is termed its geometry. Each
cache may be thought of as a two-dimensional array, with rows of cache lines
(“sets”) that have the same hash value, and columns of cache lines (“ways”)
in which every cache line has a different hash value. The associativity of a
given cache is its number of columns (hence the name “way”—a two-way set-
associative cache has two “ways”), and the size of the cache is its number of rows
multiplied by its number of columns.
Cache Line: (1) The unit of data that circulates among the CPUs and memory, usually
a moderate power of two in size. Typical cache-line sizes range from 16 to 256
bytes.
(2) A physical location in a CPU cache capable of holding one cache-line unit of
data.
(3) A physical location in memory capable of holding one cache-line unit of data,
but that it also aligned on a cache-line boundary. For example, the address of the
first word of a cache line in memory will end in 0x00 on systems with 256-byte
cache lines.
Cache Miss: A cache miss occurs when data needed by the CPU is not in that CPU’s
cache. The data might be missing because of a number of reasons, including: (1)
this CPU has never accessed the data before (“startup” or “warmup” miss), (2)
this CPU has recently accessed more data than would fit in its cache, so that some
of the older data had to be removed (“capacity” miss), (3) this CPU has recently
accessed more data in a given set1 than that set could hold (“associativity” miss),
(4) some other CPU has written to the data (or some other data in the same cache
line) since this CPU has accessed it (“communication miss”), or (5) this CPU
attempted to write to a cache line that is currently read-only, possibly due to that
line being replicated in other CPUs’ caches.
Capacity Miss: A cache miss incurred because the corresponding CPU has recently
accessed more data than will fit into the cache.
Code Locking: A simple locking design in which a “global lock” is used to protect
a set of critical sections, so that access by a given thread to that set is granted
or denied based only on the set of threads currently occupying the set of critical
sections, not based on what data the thread intends to access. The scalability of a
code-locked program is limited by the code; increasing the size of the data set
will normally not increase scalability (in fact, will typically decrease scalability
by increasing “lock contention”). Contrast with “data locking”.
Communication Miss: A cache miss incurred because the some other CPU has written
to the cache line since the last time this CPU accessed it.
1 In hardware-cache terminology, the word “set” is used in the same way that the word “bucket” is used
Store Buffer: A small set of internal registers used by a given CPU to record pending
stores while the corresponding cache lines are making their way to that CPU.
Also called “store queue”.
Store Forwarding: An arrangement where a given CPU refers to its store buffer as
well as its cache so as to ensure that the software sees the memory operations
performed by this CPU as if they were carried out in program order.
Unteachable: A topic, concept, method, or mechanism that the teacher does not under-
stand well is therefore uncomfortable teaching.
649
Vector CPU: A CPU that can apply a single instruction to multiple items of data
concurrently. In the 1960s through the 1980s, only supercomputers had vector
capabilities, but the advent of MMX in x86 CPUs and VMX in PowerPC CPUs
brought vector processing to the masses.
Write Miss: A cache miss incurred because the corresponding CPU attempted to write
to a cache line that is read-only, most likely due to its being replicated in other
CPUs’ caches.
Write-Side Critical Section: A section of code guarded by write-acquisition of some
reader-writer synchronization mechanism. For example, if one set of critical
sections are guarded by write-acquisition of a given global reader-writer lock,
while a second set of critical section are guarded by read-acquisition of that same
reader-writer lock, then the first set of critical sections will be the write-side
critical sections for that lock. Only one thread may execute in the write-side
critical section at a time, and even then only if there are no threads are executing
concurrently in any of the corresponding read-side critical sections.
650 APPENDIX D. GLOSSARY AND BIBLIOGRAPHY
Bibliography
[ABD+ 97] Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat,
Monika R. Henzinger, Shun-Tak A. Leung, Richard L. Sites, Mark T.
Vandevoorde, Carl A. Waldspurger, and William E. Weihl. Continuous
profiling: Where have all the cycles gone? In Proceedings of the 16th
ACM Symposium on Operating Systems Principles, pages 1–14, New
York, NY, October 1997.
[ACHS13] Dan Alistarh, Keren Censor-Hillel, and Nir Shavit. Are lock-
free concurrent algorithms practically wait-free?, December 2013.
ArXiv:1311.3200v2.
[Ada11] Andrew Adamatzky. Slime mould solves maze in one pass . . . assisted
by gradient of chemo-attractants, August 2011. http://arxiv.org/
abs/1108.4956.
651
652 BIBLIOGRAPHY
[AGH+ 11a] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov,
Maged M. Michael, and Martin Vechev. Laws of order: Expensive syn-
chronization in concurrent algorithms cannot be eliminated. In 38th ACM
SIGACT-SIGPLAN Symposium on Principles of Programming Languages,
pages 487–498, New York, NY, USA, 2011. ACM.
[AGH+ 11b] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov,
Maged M. Michael, and Martin Vechev. Laws of order: Expensive
synchronization in concurrent algorithms cannot be eliminated. SIGPLAN
Not., 46(1):487–498, January 2011.
[AHS+ 03] J. Appavoo, K. Hui, C. A. N. Soules, R. W. Wisniewski, D. M. Da Silva,
O. Krieger, M. A. Auslander, D. J. Edelsohn, B. Gamsa, G. R. Ganger,
P. McKenney, M. Ostrowski, B. Rosenburg, M. Stumm, and J. Xenidis.
Enabling autonomic behavior in systems software with hot swapping.
IBM Systems Journal, 42(1):60–76, January 2003.
[AKNT13] Jade Alglave, Daniel Kroening, Vincent Nimal, and Michael Tautschnig.
Software verification for weak memory via program transformation. In
Proceedings of the 22nd European conference on Programming Lan-
guages and Systems, ESOP’13, pages 512–532, Berlin, Heidelberg, 2013.
Springer-Verlag.
[AKT13] Jade Alglave, Daniel Kroening, and Michael Tautschnig. Partial orders for
efficient Bounded Model Checking of concurrent software. In Computer
Aided Verification (CAV), volume 8044 of LNCS, pages 141–157. Springer,
2013.
[Ale79] Christopher Alexander. The Timeless Way of Building. Oxford University
Press, New York, 1979.
[Alg13] Jade Alglave. Weakness is a virtue. In (EC)2 2013: 6th International
Workshop on Exploiting Concurrency Efficiently and Correctly, page 3,
2013.
[Amd67] Gene Amdahl. Validity of the single processor approach to achieving
large-scale computing capabilities. In AFIPS Conference Proceedings,
pages 483–485, Washington, DC, USA, 1967. IEEE Computer Society.
[AMP+ 11] Jade Alglave, Luc Maranget, Pankaj Pawan, Susmit Sarkar, Peter Sewell,
Derek Williams, and Francesco Zappa Nardelli. PPCMEM/ARM-
MEM: A tool for exploring the POWER and ARM memory mod-
els, June 2011. http://www.cl.cam.ac.uk/~pes20/ppc-
supplemental/pldi105-sarkar.pdf.
[AMT14] Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding cats:
Modelling, simulation, testing, and data-mining for weak memory. In
Proceedings of the 35th ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI ’14, pages 40–40, New
York, NY, USA, 2014. ACM.
[And90] T. E. Anderson. The performance of spin lock alternatives for shared-
memory multiprocessors. IEEE Transactions on Parallel and Distributed
Systems, 1(1):6–16, January 1990.
BIBLIOGRAPHY 653
[Boh01] Kristoffer Bohmann. Response time still matters, July 2001. URL:
http://www.bohmann.dk/articles/response_time_
still_matters.html [broken, November 2016].
[Bow06] Maggie Bowman. Dividing the sheep from the goats, February 2006.
http://www.cs.kent.ac.uk/news/2006/RBornat/.
[CBM+ 08] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng
Wu, Stefanie Chiras, and Siddhartha Chatterjee. Software transactional
memory: Why is it only a research toy? ACM Queue, September 2008.
[Cli09] Cliff Click. And now some hardware transactional memory com-
ments..., February 2009. http://www.azulsystems.com/blog/
cliff-click/2009-02-25-and-now-some-hardware-
transactional-memory-comments.
BIBLIOGRAPHY 655
[CLRS01] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to
Algorithms, Second Edition. MIT electrical engineering and computer
science series. MIT Press, 2001.
[Cor06a] Jonathan Corbet. The kernel lock validator, May 2006. Available: http:
//lwn.net/Articles/185666/ [Viewed: March 26, 2010].
[Cor06b] Jonathan Corbet. Priority inheritance in the kernel, April 2006. Available:
http://lwn.net/Articles/178253/ [Viewed June 29, 2009].
[Cor13] Jonathan Corbet. (nearly) full tickless operation in 3.10, May 2013.
http://lwn.net/Articles/549580/.
[Cra93] Travis Craig. Building FIFO and priority-queuing spin locks from atomic
swap. Technical Report 93-02-02, University of Washington, Seattle,
Washington, February 1993.
[CSG99] David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Com-
puter Architecture: a Hardware/Software Approach. Morgan Kaufman,
1999.
[DBA09] Saeed Dehnadi, Richard Bornat, and Ray Adams. Meta-analysis of the
effect of consistency on success in early learning of programming. In
PPIG 2009, pages 1–13, University of Limerick, Ireland, June 2009.
Psychology of Programming Interest Group.
[DCW+ 11] Luke Dalessandro, Francois Carouge, Sean White, Yossi Lev, Mark Moir,
Michael L. Scott, and Michael F. Spear. Hybrid NOrec: A case study in
the effectiveness of best effort hardware transactional memory. In Pro-
ceedings of the 16th International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), ASPLOS
’11, pages ???–???, New York, NY, USA, 2011. ACM.
656 BIBLIOGRAPHY
[DMS+ 12] Mathieu Desnoyers, Paul E. McKenney, Alan Stern, Michel R. Dagenais,
and Jonathan Walpole. User-level implementations of read-copy update.
IEEE Transactions on Parallel and Distributed Systems, 23:375–382,
2012.
[Dov90] Ken F. Dove. A high capacity TCP/IP in parallel STREAMS. In UKUUG
Conference Proceedings, London, June 1990.
[Dre11] Ulrich Drepper. Futexes are tricky. Technical Report FAT2011, Red Hat,
Inc., Raleigh, NC, USA, November 2011.
[DSS06] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II. In Proc.
International Symposium on Distributed Computing. Springer Verlag,
2006. Available: http://www.springerlink.com/content/
5688h5q0w72r54x0/ [Viewed March 10, 2008].
[Duf10a] Joe Duffy. A (brief) retrospective on transactional memory, January
2010. http://joeduffyblog.com/2010/01/03/a-brief-
retrospective-on-transactional-memory/.
[Duf10b] Joe Duffy. More thoughts on transactional memory, May
2010. http://joeduffyblog.com/2010/05/16/more-
thoughts-on-transactional-memory/.
[Edg13] Jake Edge. The future of realtime Linux, November 2013. URL: http:
//lwn.net/Articles/572740/.
[Edg14] Jake Edge. The future of the realtime patch set, October 2014. URL:
http://lwn.net/Articles/617140/.
[EGCD03] T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC language
specifications v1.1, May 2003. Available: http://upc.gwu.edu
[Viewed September 19, 2008].
[EGMdB11] Stephane Eranian, Eric Gouriou, Tipp Moseley, and Willem de Bruijn.
Linux kernel profiling with perf, June 2011. https://perf.wiki.
kernel.org/index.php/Tutorial.
[Ell80] Carla Schlatter Ellis. Concurrent search and insertion in avl trees. IEEE
Transactions on Computers, C-29(9):811–817, September 1980.
[ELLM07] Faith Ellen, Yossi Lev, Victor Luchangco, and Mark Moir. Snzi: scalable
nonzero indicators. In Proceedings of the twenty-sixth annual ACM
symposium on Principles of distributed computing, PODC ’07, pages
13–22, New York, NY, USA, 2007. ACM.
[Eng68] Douglas Engelbart. The demo, December 1968. Avail-
able: http://video.google.com/videoplay?docid=-
8734787622017763097 [Viewed November 28, 2008].
[ENS05] Ryan Eccles, Blair Nonneck, and Deborah A. Stacey. Exploring parallel
programming knowledge in the novice. In HPCS ’05: Proceedings of the
19th International Symposium on High Performance Computing Systems
and Applications, pages 97–102, Washington, DC, USA, 2005. IEEE
Computer Society.
658 BIBLIOGRAPHY
[ES90] Margaret A. Ellis and Bjarne Stroustrup. The Annotated C++ Reference
Manual. Addison Wesley, 1990.
[ES05] Ryan Eccles and Deborah A. Stacey. Understanding the parallel program-
mer. In HPCS ’05: Proceedings of the 19th International Symposium on
High Performance Computing Systems and Applications, pages 156–160,
Washington, DC, USA, 2005. IEEE Computer Society.
[ETH11] ETH Zurich. Parallel solver for a perfect maze, March 2011.
http://nativesystems.inf.ethz.ch/pub/Main/
WebHomeLecturesParallelProgrammingExercises/
pp2011hw04.pdf.
[Fos10] Ron Fosner. Scalable multithreaded programming with tasks. MSDN Mag-
azine, 2010(11):60–69, November 2010. http://msdn.microsoft.
com/en-us/magazine/gg309176.aspx.
[FRK02] Hubertus Francke, Rusty Russell, and Matthew Kirkwood. Fuss, futexes
and furwocks: Fast userlevel locking in linux. In Ottawa Linux Sympo-
sium, pages 479–495, June 2002. Available: http://www.kernel.
org/doc/ols/2002/ols2002-pages-479-495.pdf [Viewed
May 22, 2011].
[GC96] Michael Greenwald and David R. Cheriton. The synergy between non-
blocking synchronization and operating system structure. In Proceedings
of the Second Symposium on Operating Systems Design and Implementa-
tion, pages 123–136, Seattle, WA, October 1996. USENIX Association.
[GDZE10] Olga Golovanevsky, Alon Dayan, Ayal Zaks, and David Edelsohn. Trace-
based data layout optimizations for multi-core processors. In Proceedings
of the 5th International Conference on High Performance Embedded Ar-
chitectures and Compilers, HiPEAC’10, pages 81–95, Berlin, Heidelberg,
2010. Springer-Verlag.
BIBLIOGRAPHY 659
[GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design
Patterns: Elements of Reusable Object-Oriented Software. Addison-
Wesley, 1995.
[GKAS99] Ben Gamsa, Orran Krieger, Jonathan Appavoo, and Michael Stumm.
Tornado: Maximizing locality and concurrency in a shared memory mul-
tiprocessor operating system. In Proceedings of the 3rd Symposium on
Operating System Design and Implementation, pages 87–100, New Or-
leans, LA, February 1999.
[GKP13] Justin Gottschlich, Rob Knauerhase, and Gilles Pokam. But how do we
really debug transactional memory? In 5th USENIX Workshop on Hot
Topics in Parallelism (HotPar 2013), San Jose, CA, USA, June 2013.
[GKPS95] Ben Gamsa, Orran Krieger, E. Parsons, and Michael Stumm. Performance
issues for multiprocessor operating systems, November 1995. Technical
Report CSRI-339, Available: ftp://ftp.cs.toronto.edu/pub/
reports/csri/339/339.ps.
[Gle10] Thomas Gleixner. Realtime linux: academia v. reality, July 2010. URL:
http://lwn.net/Articles/397422/.
[Gle12] Thomas Gleixner. Linux -rt kvm guest demo, December 2012. Personal
communication.
[GPB+ 07] Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes,
and Doug Lea. Java: Concurrency in Practice. Addison Wesley, Upper
Saddle River, NJ, USA, 2007.
[Gri00] Scott Griffen. Internet pioneers: Doug englebart, May 2000. Available:
http://www.ibiblio.org/pioneers/englebart.html
[Viewed November 28, 2008].
660 BIBLIOGRAPHY
[Gro01] The Open Group. Single UNIX specification, July 2001. http://www.
opengroup.org/onlinepubs/007908799/index.html.
[HCS+ 05] Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, and Victor
Basili. Parallel programmer productivity: A case study of novice parallel
programmers. In SC ’05: Proceedings of the 2005 ACM/IEEE confer-
ence on Supercomputing, page 35, Washington, DC, USA, 2005. IEEE
Computer Society.
[HHK+ 13] A. Haas, T.A. Henzinger, C.M. Kirsch, M. Lippautz, H. Payer, A. Sezgin,
and A. Sokolova. Distributed queues in shared memory—multicore
performance and scalability through quantitative relaxation. In Proc.
International Conference on Computing Frontiers. ACM, 2013.
[HLM02] Maurice Herlihy, Victor Luchangco, and Mark Moir. The repeat offender
problem: A mechanism for supporting dynamic-sized, lock-free data
structures. In Proceedings of 16th International Symposium on Distributed
Computing, pages 339–353, October 2002.
[HMBW07] Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan
Walpole. Performance of memory reclamation for lockless synchroniza-
tion. J. Parallel Distrib. Comput., 67(12):1270–1285, 2007.
[HMDZ06] David Howells, Paul E. McKenney, Will Deacon, and Peter Zijlstra. Linux
kernel memory barriers, March 2006. https://www.kernel.org/
doc/Documentation/memory-barriers.txt.
[Hol03] Gerard J. Holzmann. The Spin Model Checker: Primer and Reference
Manual. Addison-Wesley, Boston, MA, USA, 2003.
[HS08] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming.
Morgan Kaufmann, Burlington, MA, USA, 2008.
[Inm07] Bill Inmon. Time value of information, January 2007. URL: http:
//www.b-eye-network.com/view/3365.
[Jon11] Dave Jones. Trinity: A system call fuzzer. In Proceedings of the 13th
Ottawa Linux Symposium, pages ???–???, Ottawa, Canada, June 2011.
[JSG12] Christian Jacobi, Timothy Slegel, and Dan Greiner. Transactional mem-
ory architecture and implementation for IBM System z, December
2012. The 45th Annual IEEE/ACM International Symposium on Mi-
croArchitecture http://www.microsymposia.org/micro45/
talks-posters/3-jacobi-presentation.pdf.
[KCH+ 06] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kumar,
and Anthony Nguyen. Hybrid transactional memory. In Proceedings
of the ACM SIGPLAN 2006 Symposium on Principles and Practice of
Parallel Programming. ACM SIGPLAN, 2006. http://princeton.
kumarbhope.com/papers/PPoPP06/ppopp06.pdf.
[KLP12] Christoph M. Kirsch, Michael Lippautz, and Hannes Payer. Fast and
scalable k-fifo queues. Technical Report 2012-04, University of Salzburg,
Salzburg, Austria, June 2012.
[Kni08] John U. Knickerbocker. 3D chip technology. IBM Journal of Research
and Development, 52(6), November 2008. Available: http://www.
research.ibm.com/journal/rd52-6.html [Viewed: January
1, 2009].
[Knu73] Donald Knuth. The Art of Computer Programming. Addison-Wesley,
1973.
[KS08] Daniel Kroening and Ofer Strichman. Decision Procedures: An Algorith-
mic Point of View. Springer Publishing Company, Incorporated, 1 edition,
2008.
[KWS97] Leonidas Kontothanassis, Robert W. Wisniewski, and Michael L. Scott.
Scheduler-conscious synchronization. Communications of the ACM,
15(1):3–40, January 1997.
[LA94] Beng-Hong Lim and Anant Agarwal. Reactive synchronization algo-
rithms for multiprocessors, March 1994. 03/28/94 FTP hing.lcs.mit.edu
/pub/papers/reactive.ps.Z.
[Lam74] Leslie Lamport. A new solution of Dijkstra’s concurrent programming
problem. Communications of the ACM, 17(8):453–455, August 1974.
[Lea97] Doug Lea. Concurrent Programming in Java: Design Principles and
Patterns. Addison Wesley Longman, Reading, MA, USA, 1997.
[LHF05] Michael Lyons, Bill Hay, and Brad Frey. PowerPC storage model
and AIX programming, November 2005. http://www.ibm.com/
developerworks/systems/articles/powerpc.html.
[LLO09] Yossi Lev, Victor Luchangco, and Marek Olszewski. Scalable reader-
writer locks. In SPAA ’09: Proceedings of the twenty-first annual sym-
posium on Parallelism in algorithms and architectures, pages 101–110,
New York, NY, USA, 2009. ACM.
[Loc02] Doug Locke. Priority inheritance: The real story, July 2002.
Available: http://www.linuxdevices.com/articles/
AT5698775833.html [Viewed June 29, 2005].
[Lom77] D. B. Lomet. Process structuring, synchronization, and recovery
using atomic actions. SIGSOFT Softw. Eng. Notes, 2(2):128–137,
1977. Available: http://portal.acm.org/citation.cfm?
id=808319# [Viewed June 27, 2008].
[LR80] Butler W. Lampson and David D. Redell. Experience with processes and
monitors in Mesa. Communications of the ACM, 23(2):105–117, 1980.
[LS86] Vladimir Lanin and Dennis Shasha. A symmetric concurrent b-tree al-
gorithm. In ACM ’86: Proceedings of 1986 ACM Fall joint computer
conference, pages 380–389, Los Alamitos, CA, USA, 1986. IEEE Com-
puter Society Press.
BIBLIOGRAPHY 665
[LS11] Yujie Liu and Michael Spear. Toxic transactions. In TRANSACT 2011.
ACM SIGPLAN, June 2011.
[MAK+ 01] Paul E. McKenney, Jonathan Appavoo, Andi Kleen, Orran Krieger,
Rusty Russell, Dipankar Sarma, and Maneesh Soni. Read-copy update.
In Ottawa Linux Symposium, July 2001. Available: http://www.
linuxsymposium.org/2001/abstracts/readcopy.php
http://www.rdrop.com/users/paulmck/RCU/rclock_
OLS.2001.05.01c.pdf [Viewed June 23, 2004].
[Mas92] H. Massalin. Synthesis: An Efficient Implementation of Fundamental
Operating System Services. PhD thesis, Columbia University, New York,
NY, 1992.
[Mat13] Norm Matloff. Programming on Parallel Machines. University of Cali-
fornia, Davis, Davis, CA, USA, 2013.
[MBM+ 06] Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, and
David A. Wood. LogTM: Log-based transactional memory. In Proceed-
ings of the 12th Annual International Symposium on High Performance
Computer Architecture (HPCA-12), Washington, DC, USA, 2006. IEEE.
Available: http://www.cs.wisc.edu/multifacet/papers/
hpca06_logtm.pdf [Viewed December 21, 2006].
[MBWW12] Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. RCU
usage in the linux kernel: One decade later, September 2012. Technical re-
port paulmck.2012.09.17, http://rdrop.com/users/paulmck/
techreports/survey.2012.09.17a.pdf.
[McK90a] Paul E. McKenney. Stochastic fairness queuing. Technical Report ITSTD-
7186-PA-89-11, SRI International, Menlo Park, CA, March 1990. To
appear in INFOCOM’90.
[McK90b] Paul E. McKenney. Stochastic fairness queuing. In IEEE INFO-
COM’90 Proceedings, pages 733–740, San Francisco, June 1990.
The Institute of Electrical and Electronics Engineers, Inc. Re-
vision available: http://www.rdrop.com/users/paulmck/
scalability/paper/sfq.2002.06.04.pdf [Viewed May 26,
2008].
[McK91] Paul E. McKenney. Stochastic fairness queuing. Internetworking: Theory
and Experience, 2:113–131, 1991.
[McK95] Paul E. McKenney. Differential profiling. In MASCOTS 1995, pages
237–241, Toronto, Canada, January 1995.
[McK96a] Paul E. McKenney. Pattern Languages of Program Design, vol-
ume 2, chapter 31: Selecting Locking Designs for Parallel Pro-
grams, pages 501–531. Addison-Wesley, June 1996. Available:
http://www.rdrop.com/users/paulmck/scalability/
paper/mutexdesignpat.pdf [Viewed February 17, 2005].
[McK96b] Paul E. McKenney. Selecting locking primitives for parallel programs.
Communications of the ACM, 39(10):75–82, October 1996.
666 BIBLIOGRAPHY
[McK03] Paul E. McKenney. Using RCU in the Linux 2.5 kernel. Linux
Journal, 1(114):18–26, October 2003. Available: http://
www.linuxjournal.com/article/6993 [Viewed November 14,
2007].
[McK07d] Paul E. McKenney. RCU and unloadable modules, January 2007. Avail-
able: http://lwn.net/Articles/217484/ [Viewed November
22, 2007].
[McK07e] Paul E. McKenney. Using Promela and Spin to verify parallel algorithms,
August 2007. Available: http://lwn.net/Articles/243851/
[Viewed September 8, 2007].
BIBLIOGRAPHY 667
[McK14b] Paul E. McKenney. The RCU API, 2014 edition, September 2014. http:
//lwn.net/Articles/609904/.
[MGM+ 09] Paul E. McKenney, Manish Gupta, Maged M. Michael, Phil Howard,
Joshua Triplett, and Jonathan Walpole. Is parallel programming hard,
and if so, why? Technical Report TR-09-02, Portland State University,
Portland, OR, USA, February 2009. Available: http://www.cs.pdx.
edu/pdfs/tr0902.pdf [Viewed February 19, 2009].
[MHS12] Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. Why on-chip
coherence is here to stay. Communications of the ACM, 55(7):78–89, July
2012.
[Mic04] Maged M. Michael. Hazard pointers: Safe memory reclamation for lock-
free objects. IEEE Transactions on Parallel and Distributed Systems,
15(6):491–504, June 2004.
[Mil06] David S. Miller. Re: [PATCH, RFC] RCU : OOM avoidance and lower la-
tency, January 2006. Available: https://lkml.org/lkml/2006/
1/7/22 [Viewed February 29, 2012].
[MLH94] Peter Magnusson, Anders Landin, and Erik Hagersten. Efficient software
synchronization on large cache coherent multiprocessors. Technical
Report T94:07, Swedish Institute of Computer Science, Kista, Sweden,
February 1994.
[MM00] Ingo Molnar and David S. Miller. brlock, March 2000. Available:
http://www.tm.kernel.org/pub/linux/kernel/v2.
3/patch-html/patch-2.3.49/linux_include_linux_
brlock.h.html [Viewed September 3, 2004].
[MMW07] Paul E. McKenney, Maged Michael, and Jonathan Walpole. Why the
grass may not be greener on the other side: A comparison of locking
vs. transactional memory. In Programming Languages and Operating
Systems, pages 1–5, New York, NY, USA, October 2007. ACM SIGOPS.
[MOZ09] Nicholas Mc Guire, Peter Odhiambo Okech, and Qingguo Zhou. Analysis
of inherent randomness of the linux kernel. In Eleventh Real Time Linux
Workshop, Dresden, Germany, September 2009.
[MPA+ 06] Paul E. McKenney, Chris Purcell, Algae, Ben Schumin, Gaius Cor-
nelius, Qwertyus, Neil Conway, Sbw, Blainster, Canis Rufus, Zoicon5,
Anome, and Hal Eisen. Read-copy update, July 2006. http://en.
wikipedia.org/wiki/Read-copy-update.
670 BIBLIOGRAPHY
[MPI08] MPI Forum. Message passing interface forum, September 2008. Available:
http://www.mpi-forum.org/ [Viewed September 9, 2008].
[MR08] Paul E. McKenney and Steven Rostedt. Integrating and validating dynticks
and preemptable RCU, April 2008. Available: http://lwn.net/
Articles/279077/ [Viewed April 24, 2008].
[MS93] Paul E. McKenney and Jack Slingwine. Efficient kernel memory allo-
cation on shared-memory multiprocessors. In USENIX Conference Pro-
ceedings, pages 295–306, Berkeley CA, February 1993. USENIX Asso-
ciation. Available: http://www.rdrop.com/users/paulmck/
scalability/paper/mpalloc.pdf [Viewed January 30, 2005].
[MS95] Maged M. Michael and Michael L. Scott. Correction of a memory man-
agement method for lock-free data structures, December 1995. Technical
Report TR599.
[MS96] M.M Michael and M. L. Scott. Simple, fast, and practical non-blocking
and blocking concurrent queue algorithms. In Proc of the Fifteenth ACM
Symposium on Principles of Distributed Computing, pages 267–275, May
1996. Available: http://www.research.ibm.com/people/m/
michael/podc-1996.pdf [Viewed January 26, 2009].
[MS98a] Paul E. McKenney and John D. Slingwine. Read-copy update: Using exe-
cution history to solve concurrency problems. In Parallel and Distributed
Computing and Systems, pages 509–518, Las Vegas, NV, October 1998.
[MS98b] Maged M. Michael and Michael L. Scott. Nonblocking algorithms and
preemption-safe locking on multiprogrammed shared memory multipro-
cessors. J. Parallel Distrib. Comput., 51(1):1–26, 1998.
[MS08] MySQL AB and Sun Microsystems. MySQL Downloads, November
2008. Available: http://dev.mysql.com/downloads/ [Viewed
November 26, 2008].
[MS09] Paul E. McKenney and Raul Silvera. Example power imple-
mentation for c/c++ memory model, February 2009. Available:
http://www.rdrop.com/users/paulmck/scalability/
paper/N2745r.2009.02.27a.html [Viewed: April 5, 2009].
[MS12] Alexander Matveev and Nir Shavit. Towards a fully pessimistic STM
model. In TRANSACT 2012. ACM SIGPLAN, February 2012.
[MS14] Paul E. McKenney and Alan Stern. Axiomatic validation of memory
barriers and atomic instructions, August 2014. http://lwn.net/
Articles/608550/.
[MSK01] Paul E. McKenney, Jack Slingwine, and Phil Krueger. Experience with
an efficient parallel kernel memory allocator. Software – Practice and
Experience, 31(3):235–257, March 2001.
[MSM05] Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill. Pat-
terns for Parallel Programming. Addison Wesley, Boston, MA, USA,
2005.
BIBLIOGRAPHY 671
[MSS04] Paul E. McKenney, Dipankar Sarma, and Maneesh Soni. Scaling dcache
with RCU. Linux Journal, 1(118):38–46, January 2004.
[MSS12] Luc Maranget, Susmit Sarkar, and Peter Sewell. A tutorial in-
troduction to the ARM and POWER relaxed memory models,
October 2012. https://www.cl.cam.ac.uk/~pes20/ppc-
supplemental/test7.pdf.
[MT01] Jose F. Martinez and Josep Torrellas. Speculative locks for con-
current execution of critical sections in shared-memory multiproces-
sors. In Workshop on Memory Performance Issues, International Sym-
posium on Computer Architecture, Gothenburg, Sweden, June 2001.
Available: http://iacoma.cs.uiuc.edu/iacoma-papers/
wmpi_locks.pdf [Viewed June 23, 2004].
[Nes06b] Oleg Nesterov. Re: [rfc, patch 1/2] qrcu: "quick" srcu implementation,
November 2006. Available: http://lkml.org/lkml/2006/11/
29/330 [Viewed November 26, 2008].
[ON06] Robert Olsson and Stefan Nilsson. TRASH: A dynamic LC-trie and
hash data structure, August 2006. http://www.nada.kth.se/
~snilsson/publications/TRASH/trash.pdf.
[ONH+ 96] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and
Kunyung Chang. The case for a single-chip multiprocessor. In ASPLOS
VII, October 1996.
[Pat10] David Patterson. The trouble with multicore. IEEE Spectrum, 2010:28–32,
52–53, July 2010.
[Pig06] Nick Piggin. [patch 3/3] radix-tree: RCU lockless readside, June 2006.
Available: http://lkml.org/lkml/2006/6/20/238 [Viewed
March 25, 2008].
[Pug90] William Pugh. Concurrent maintenance of skip lists. Technical Report CS-
TR-2222.1, Institute of Advanced Computer Science Studies, Department
of Computer Science, University of Maryland, College Park, Maryland,
June 1990.
[Pul00] Geoffrey K. Pullum. How Dr. Seuss would prove the halting problem
undecidable. Mathematics Magazine, 73(4):319–320, 2000. http:
//www.lel.ed.ac.uk/~gpullum/loopsnoop.html.
[Ray99] Eric S. Raymond. The Cathedral and the Bazaar: Musings on Linux and
Open Source by an Accidental Revolutionary. O’Reilly, 1999.
[RD12] Ravi Rajwar and Martin Dixon. Intel transactional synchronization exten-
sions, September 2012. Intel Developer Forum (IDF) 2012 ARCS004.
[Reg10] John Regehr. A guide to undefined behavior in c and c++, part 1, July
2010. http://blog.regehr.org/archives/213.
[RG01] Ravi Rajwar and James R. Goodman. Speculative lock elision: Enabling
highly concurrent multithreaded execution. In Proceedings of the 34th
Annual ACM/IEEE International Symposium on Microarchitecture, pages
294–305, Austin, TX, December 2001. The Institute of Electrical and
Electronics Engineers, Inc.
[RH02] Zoran Radović and Erik Hagersten. Efficient synchronization for nonuni-
form communication architectures. In Proceedings of the 2002 ACM/IEEE
Conference on Supercomputing, pages 1–13, Baltimore, Maryland, USA,
November 2002. The Institute of Electrical and Electronics Engineers,
Inc.
[RH03] Zoran Radović and Erik Hagersten. Hierarchical backoff locks for nonuni-
form communication architectures. In Proceedings of the Ninth Interna-
tional Symposium on High Performance Computer Architecture (HPCA-
9), pages 241–252, Anaheim, California, USA, February 2003.
[RHP+ 07] Chistopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ra-
madan, Aditya Bhandari, and Emmett Witchel. TxLinux: Using and man-
aging hardware transactional memory in an operating system. In SOSP’07:
Twenty-First ACM Symposium on Operating Systems Principles. ACM
SIGOPS, October 2007. Available: http://www.sosp2007.org/
papers/sosp056-rossbach.pdf [Viewed October 21, 2007].
[Ros10a] Steven Rostedt. tracing: Harry Potter and the Deathly Macros, De-
cember 2010. Available: http://lwn.net/Articles/418710/
[Viewed: August 28, 2011].
[Ros10b] Steven Rostedt. Using the TRACE_EVENT() macro (part 1), March 2010.
Available: http://lwn.net/Articles/379903/ [Viewed: Au-
gust 28, 2011].
[Ros10c] Steven Rostedt. Using the TRACE_EVENT() macro (part 2), March 2010.
Available: http://lwn.net/Articles/381064/ [Viewed: Au-
gust 28, 2011].
[Ros10d] Steven Rostedt. Using the TRACE_EVENT() macro (part 3), April 2010.
Available: http://lwn.net/Articles/383362/ [Viewed: Au-
gust 28, 2011].
[Ros11] Steven Rostedt. lockdep: How to read its cryptic output, September
2011. http://www.linuxplumbersconf.org/2011/ocw/
sessions/153.
[Rus03] Rusty Russell. Hanging out with smart people: or... things I learned
being a kernel monkey, July 2003. 2003 Ottawa Linux Symposium
Keynote http://ozlabs.org/~rusty/ols-2003-keynote/
ols-keynote-2003.html.
[SAH+ 03] Craig A. N. Soules, Jonathan Appavoo, Kevin Hui, Dilma Da Silva, Gre-
gory R. Ganger, Orran Krieger, Michael Stumm, Robert W. Wisniewski,
Marc Auslander, Michal Ostrowski, Bryan Rosenburg, and Jimi Xeni-
dis. System support for online reconfiguration. In Proceedings of the
2003 USENIX Annual Technical Conference, pages 141–154. USENIX
Association, June 2003.
674 BIBLIOGRAPHY
[SATG+ 09] Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Robert Geva, Yang Ni, and
Adam Welc. Towards transactional memory semantics for c++. In SPAA
’09: Proceedings of the twenty-first annual symposium on Parallelism in
algorithms and architectures, pages 49–58, New York, NY, USA, 2009.
ACM.
[Sha11] Nir Shavit. Data structures in the multicore age. Commun. ACM, 54(3):76–
84, March 2011.
[SHW11] Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory
Consistency and Cache Coherence. Synthesis Lectures on Computer
Architecture. Morgan & Claypool, 2011.
[SM04] Dipankar Sarma and Paul E. McKenney. Making RCU safe for deep
sub-millisecond response realtime applications. In Proceedings of the
2004 USENIX Annual Technical Conference (FREENIX Track), pages
182–191. USENIX Association, June 2004.
[Smi15] Richard Smith. Working draft, standard for programming language C++,
May 2015. http://www.open-std.org/jtc1/sc22/wg21/
docs/papers/2015/n4527.pdf.
[SMS08] Michael Spear, Maged Michael, and Michael Scott. Inevitability mecha-
nisms for software transactional memory. In 3rd ACM SIGPLAN Work-
shop on Transactional Computing, New York, NY, USA, February 2008.
ACM. Available: http://www.cs.rochester.edu/u/scott/
papers/2008_TRANSACT_inevitability.pdf [Viewed Jan-
uary 10, 2009].
[Spi77] Keith R. Spitz. Tell which is which and you’ll be rich, 1977. Inscription
on wall of dungeon.
BIBLIOGRAPHY 675
[Spr01] Manfred Spraul. Re: RFC: patch to allow lock-free traversal of lists with
insertion, October 2001. Available: http://marc.theaimsgroup.
com/?l=linux-kernel&m=100264675012867&w=2 [Viewed
June 23, 2004].
[Spr08] Manfred Spraul. [RFC, PATCH] state machine based rcu, August 2008.
Available: http://lkml.org/lkml/2008/8/21/336 [Viewed
December 8, 2008].
[SR84] Z. Segall and L. Rudolf. Dynamic decentralized cache schemes for MIMD
parallel processors. In 11th Annual International Symposium on Computer
Architecture, pages 340–347, June 1984.
[SRL90a] L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority inheritance proto-
cols: An approach to real-time synchronization. IEEE Trans. Comput.,
39(9):1175–1185, 1990.
[SRL90b] Lui Sha, Ragunathan Rajkumar, and John P. Lehoczky. Priority inheri-
tance protocols: An approach to real-time synchronization. IEEE Trans-
actions on Computers, 39(9):1175–1185, 1990.
[SS94] Duane Szafron and Jonathan Schaeffer. Experimentally assessing the
usability of parallel programming systems. In IFIP WG10.3 Programming
Environments for Massively Parallel Distributed Systems, pages 19.1–
19.7, 1994.
[SS06] Ori Shalev and Nir Shavit. Split-ordered lists: Lock-free extensible hash
tables. J. ACM, 53(3):379–405, May 2006.
[SSHT93] Janice S. Stone, Harold S. Stone, Philip Heidelberger, and John Turek.
Multiple reservations and the Oklahoma update. IEEE Parallel and
Distributed Technology Systems and Applications, 1(4):58–71, November
1993.
[SSRB00] Douglas C. Schmidt, Michael Stal, Hans Rohnert, and Frank Buschmann.
Pattern-Oriented Software Architecture Volume 2: Patterns for Concur-
rent and Networked Objects. Wiley, Chichester, West Sussex, England,
2000.
[SSVM02] S. Swaminathan, John Stultz, Jack Vogel, and Paul E. McKenney. Fair-
locks – a high performance fair locking scheme. In Proceedings of the
14th IASTED International Conference on Parallel and Distributed Com-
puting and Systems, pages 246–251, Cambridge, MA, USA, November
2002.
[ST87] William E. Snaman and David W. Thiel. The VAX/VMS distributed lock
manager. Digital Technical Journal, 5:29–44, September 1987.
[ST95] Nir Shavit and Dan Touitou. Software transactional memory. In Proceed-
ings of the 14th Annual ACM Symposium on Principles of Distributed
Computing, pages 204–213, Ottawa, Ontario, Canada, August 1995.
[Ste92] W. Richard Stevens. Advanced Programming in the UNIX Environment.
Addison Wesley, 1992.
676 BIBLIOGRAPHY
[Sut08] Herb Sutter. Effective concurrency, 2008. Series in Dr. Dobbs Journal.
[SW95] Richard L. Sites and Richard T. Witek. Alpha AXP Architecture. Digital
Press, second edition, 1995.
[The08] The Open MPI Project. MySQL Downloads, November 2008. Avail-
able: http://www.open-mpi.org/software/ [Viewed Novem-
ber 26, 2008].
[TMW11] Josh Triplett, Paul E. McKenney, and Jonathan Walpole. Resizable, scal-
able, concurrent hash tables via relativistic programming. In Proceedings
of the 2011 USENIX Annual Technical Conference, pages 145–158, Port-
land, OR USA, June 2011. The USENIX Association.
[Tor01] Linus Torvalds. Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal
of lists with insertion, October 2001. Available: http://lkml.org/
lkml/2001/10/13/105 [Viewed August 21, 2004].
[TS93] Hiroaki Takada and Ken Sakamura. A bounded spin lock algorithm with
preemption. Technical Report 93-02, University of Tokyo, Tokyo, Japan,
1993.
[Xu10] Herbert Xu. bridge: Add core IGMP snooping support, February
2010. Available: http://marc.info/?t=126719855400006&
r=1&w=2 [Viewed March 20, 2011].
[Yod04a] Victor Yodaiken. Against priority inheritance, September 2004. Available:
http://www.yodaiken.com/papers/inherit.pdf [Viewed
May 26, 2007].
[Yod04b] Victor Yodaiken. Temporal inventory and real-time synchronization in
RTLinuxPro, September 2004. URL: http://www.yodaiken.com/
papers/sync.pdf.
If I have seen further it is by standing on the
shoulders of giants.
Appendix E
Credits
E.1 Authors
E.2 Reviewers
• Alan Stern (Section 14.2).
Reviewers whose feedback took the extremely welcome form of a patch are credited
in the git logs.
679
680 APPENDIX E. CREDITS