Main
Main
Main
1 Introduction 1
1.1 Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background 7
2.1 Systems Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Optional: Hyperthreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Debugging and Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 git . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Clean Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Asserts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 TSAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Involved gdb example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3 Undefined Behavior Sanitizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.4 Clang Static Build Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.5 strace and ltrace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.6 printfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Extra: Compiling and Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Homework 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.1 So you want to master System Programming? And get a better grade than B? . . . . . . . . 27
2.6.2 Watch the videos and write up your answers to the following questions . . . . . . . . . . . . 28
2.6.3 Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.4 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.5 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.6 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.7 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.8 C Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.9 Optional: Just for fun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 UIUC Specific Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7.1 Piazza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ii
CONTENTS CONTENTS
4 Processes 81
4.1 File Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
iii
CONTENTS CONTENTS
6 Threads 125
6.1 Processes vs threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 Thread Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3 Simple Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4 Pthread Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5 Race Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5.1 Don’t Cross the Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
iv
CONTENTS CONTENTS
7 Synchronization 141
7.1 Mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Mutex Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.2 Mutex Usages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.1.3 Mutex Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.4 Extra: Implementing a Mutex with hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.5 Semaphore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 Condition Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.2.1 Extra: Why do Condition Variables also need a mutex? . . . . . . . . . . . . . . . . . . . . . . 152
7.2.2 Condition Wait Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.3 Thread-Safe Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.3.1 Using Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.4 Software Solutions to the Critical Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.4.1 Naive Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.4.2 Turn-based solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.4.3 Turn and Flag solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.5 Working Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.5.1 Peterson’s Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.5.2 Extra: Implementing Software Mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.6 Implementing Counting Semaphore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.6.1 Other semaphore considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.6.2 Extra: Implementing CVs with Mutexes Alone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.7 Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.7.1 Reader Writer Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.7.2 Attempt #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.7.3 Attempt #2: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.7.4 Attempt #3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.7.5 Starving writers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.7.6 Attempt #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.8 Ring Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.8.1 Ring Buffer Gotchas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.8.2 Multithreaded Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.8.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.8.4 Another Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.8.5 Correct implementation of a ring buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.9 Extra: Process Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.9.1 Interruption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.9.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.10 Extra: Higher Order Models of Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
v
CONTENTS CONTENTS
8 Deadlock 199
8.1 Resource Allocation Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.2 Coffman Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.3 Approaches to Solving Livelock and Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.3.1 Extra: Banker’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.4 Dining Philosophers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.4.1 Failed Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.5 Viable Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.5.1 Leaving the Table (Stallings’ Solution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.5.2 Partial Ordering (Dijkstra’s Solution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.5.3 Extra: Clean/Dirty Forks (Chandy/Misra Solution) . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.5.4 Extra: Actor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.6 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
vi
CONTENTS CONTENTS
10 Scheduling 241
10.1 High Level Scheduler Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.2 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.2.1 What is preemption? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.2.2 Why might a process (or thread) be placed on the ready queue? . . . . . . . . . . . . . . . . 243
10.3 Measures of Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.3.1 Convoy Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.3.2 Extra: Linux Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.4 Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.4.1 Shortest Job First (SJF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.4.2 Preemptive Shortest Job First (PSJF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.4.3 First Come First Served (FCFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
10.4.4 Round Robin (RR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.4.5 Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.5 Extra: Scheduling Conceptually . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.5.1 First Come First Served . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
10.5.2 Round Robin or Processor Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
10.5.3 Non Preemptive Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.5.4 Shortest Job First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.5.5 Preemptive Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.5.6 Preemptive Shortest Job First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.6 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
11 Networking 257
11.1 The OSI Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.2 Layer 3: The Internet Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
11.2.1 Extra: In-depth IPv4 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
11.2.2 Extra: Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
11.2.3 Extra: Fragmentation/Reassembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
11.2.4 Extra: IP Multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
11.2.5 What’s the deal with IPv6? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.2.6 What’s My Address? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
11.3 Layer 4: TCP and Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
11.3.1 Note on network orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.3.2 TCP Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.3.3 Sending some data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
11.4 Layer 4: TCP Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.4.1 Example Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.4.2 Sorry To Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.5 Layer 4: UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.5.1 UDP Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.5.2 UDP Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.5.3 UDP Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.6 Layer 7: HTTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
vii
CONTENTS CONTENTS
12 Filesystems 293
12.1 What is a filesystem? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
12.1.1 The File API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
12.2 Storing data on disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
12.2.1 File Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
12.2.2 Directory Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
12.2.3 UNIX Directory Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.2.4 Directory API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
12.2.5 Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
12.2.6 Pathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.2.7 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.3 Permissions and bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
12.3.1 User ID / Group ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
12.3.2 Reading / Changing file permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
12.3.3 Understanding the ‘umask’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
12.3.4 The ‘setuid’ bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
12.3.5 The ‘sticky’ bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
12.4 Virtual filesystems and other filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.4.1 Managing files and filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.4.2 Obtaining Random Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
12.4.3 Copying Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
12.4.4 Updating Modification Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
12.4.5 Managing Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
12.5 Memory Mapped IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
12.6 Reliable Single Disk Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
12.6.1 RAID - Redundant Array of Inexpensive Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
12.6.2 Higher Levels of RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
12.6.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.7 Simple Filesystem Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.7.1 File Size vs Space on Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.7.2 Performing Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
12.7.3 Performing Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
12.7.4 Adding Deletes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
12.8 Extra: Modern Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
viii
CONTENTS CONTENTS
13 Signals 323
13.1 The Deep Dive of Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
13.2 Sending Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
13.3 Handling Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
13.3.1 Sigaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
13.4 Blocking Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
13.4.1 Sigwait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
13.5 Signals in Child Processes and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
13.6 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
13.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
14 Security 335
14.1 Security Terminology and Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
14.1.1 CIA Triad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
14.2 Security in C Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
14.2.1 Stack Smashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
14.2.2 Buffer Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
14.2.3 Out of order instructions & Spectre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
14.2.4 Operating Systems Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
14.2.5 Virtualization Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
14.2.6 Extra: Security through scrubbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
14.3 Cyber Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
14.3.1 Security at the TCP Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
14.3.2 Security at the DNS Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.4 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.5 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
15 Review 345
15.1 C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
15.1.1 Memory and Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
15.1.2 Printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
15.1.3 Input parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
15.2 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
15.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
15.4 Threading and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
15.5 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
15.6 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
15.7 Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
15.8 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
15.9 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
15.10 Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
ix
CONTENTS CONTENTS
17 Appendix 365
17.1 Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
17.1.1 Shell tricks and tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
17.1.2 What’s a terminal? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
17.1.3 Common Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
17.1.4 Syntactic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
17.1.5 What are environment variables? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
17.2 Stack Smashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
17.3 Assorted Man Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
17.3.1 Malloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
17.4 System Programming Jokes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
17.4.1 Light bulb jokes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
17.4.2 Groaners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
17.4.3 System Programmer (Definition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
x
List of Figures
xi
LIST OF FIGURES LIST OF FIGURES
xii
List of Tables
xiii
Introduction
1
To thy happy children of the future, those of the past send greetings.
Alma Mater
At the University of Illinois at Urbana-Champaign, I fundamentally believe that we have a right to make the
university better for all future students. It is a message etched into our Alma Mater and makes up the DNA of
our course staff. As such, we created the coursebook. The coursebook is a free and open systems programming
textbook that anyone can read, contribute to, and modify for now and forever. We don’t think information should
be behind a walled garden, and we truly believe that complex concepts can be explained simply and fully, for
anyone to understand. The goal of this book is to teach you the basics and give you some intuition into the
complexities of systems programming.
Like any good book, it isn’t complete. We still have plenty of examples, ideas, typos, and chapters to work on.
If you find any issues, please file an issue or email a list of typos to CS 241 course staff, and we’ll be happy to
work on it. We are constantly trying to make the book better for students a year and ten years from now.
This work is based on the original coursebook located at at this url. All these peoples’ hard work is included in
the section below.
Thanks again and happy reading!
– Bhuvy
Authors
1
Thomas Liu <[email protected]>
Johnny Chang <[email protected]>
goldcase <[email protected]>
vassimladenov <[email protected]>
SurtaiHan <[email protected]>
Brandon Chong <[email protected]>
Ben Kurtovic <[email protected]>
dprorok2 <[email protected]>
anchal-agrawal <[email protected]>
Lawrence Angrave <[email protected]>
daeyun <[email protected]>
bchong95 <[email protected]>
rushingseas8 <[email protected]>
lukspdev <[email protected]>
hilalh <[email protected]>
dimyr7 <[email protected]>
Azrakal <[email protected]>
G. Carl Evans <[email protected]>
Cornel Punga <[email protected]>
vikasagartha <[email protected]>
dyarbrough93 <[email protected]>
berwin7996 <[email protected]>
Sudarshan Govindaprasad <[email protected]>
NMyren <[email protected]>
Ankit Gohel <[email protected]>
vha-weh-shh <[email protected]>
sasankc <[email protected]>
rishabhjain2795 <[email protected]>
nickgarfield <[email protected]>
by700git <[email protected]>
bw-vbnm <[email protected]>
Navneeth Jayendran <[email protected]>
Joe Benassi <[email protected]>
Harpreet Singh <[email protected]>
FenixFeather <[email protected]>
EntangledLight <[email protected]>
Bliss Chapman <[email protected]>
zikaeroh <[email protected]>
time bandit <[email protected]>
paultgibbons <[email protected]>
kevinwang <[email protected]>
cPolaris <[email protected]>
Zecheng () <[email protected]>
Wieschie <[email protected]>
WeiL <[email protected]>
Graham Dyer <[email protected]>
Arun Prakash Jana <[email protected]>
Ankit Goel <[email protected]>
Allen Kleiner <[email protected]>
Abhishek Deep Nigam <[email protected]>
zmmille2 <[email protected]>
sidewallme <[email protected]>
raych05 <[email protected]>
mmahes <[email protected]>
mass <[email protected]>
kovaka <[email protected]>
gmag23 <[email protected]>
ejian2 <[email protected]>
cerutii <[email protected]>
briantruong777 <[email protected]>
adevar <[email protected]>
Yuxuan Zou (Sean) <[email protected]>
Xikun Zhang <[email protected]>
Vishal Disawar <[email protected]>
Taemin Shin <[email protected]>
Sujay Patwardhan <[email protected]>
SufeiZ <[email protected]>
Sufei Zhang <[email protected]>
Steven Shang <[email protected]>
Steve Zhu <[email protected]>
Sibo Wang <[email protected]>
Shane Ryan <[email protected]>
Scott Bigelow <[email protected]>
Riyad Shauk <[email protected]>
Nathan Somers <[email protected]>
LieutenantChips <[email protected]>
Jacob K LaGrou <[email protected]>
George <[email protected]>
David Levering <[email protected]>
Bernard Lim <[email protected]>
zwang180 <[email protected]>
xuanwang91 <[email protected]>
xin-0 <[email protected]>
wchill <[email protected]>
vishnui <[email protected]>
tvarun2013 <[email protected]>
sstevenshang <[email protected]>
ssquirrel <[email protected]>
smeenai <[email protected]>
shrujancheruku <[email protected]>
ruiqili2 <[email protected]>
rchwlsk2 <[email protected]>
ralphchung <[email protected]>
nikioftime <[email protected]>
mosaic0123 <[email protected]>
majiasheng <[email protected]>
m <[email protected]>
li820970 <[email protected]>
kuck1 <[email protected]>
kkgomez2 <[email protected]>
jjames34 <[email protected]>
jargals2 <[email protected]>
hzding621 <[email protected]>
hzding621 <[email protected]>
hsingh23 <[email protected]>
denisdemaisbr <[email protected]>
daishengliang <[email protected]>
cucumbur <[email protected]>
codechao999 <[email protected]>
chrisshroba <[email protected]>
cesarcastmore <[email protected]>
briantruong777 <[email protected]>
botengHY <[email protected]>
blapalp <[email protected]>
bchhetri1 <[email protected]>
anadella96 <[email protected]>
akleiner2 <[email protected]>
aRatnam12 <[email protected]>
Yash Sharma <[email protected]>
Xiangbin Hu <[email protected]>
WininWin <[email protected]>
William Klock <[email protected]>
WenhanZ <[email protected]>
Vivek Pandya <[email protected]>
Vineeth Puli <[email protected]>
Vangelis Tsiatsianas <[email protected]>
Vadiml1024 <[email protected]>
Utsav2 <[email protected]>
Thirumal Venkat <[email protected]>
TheEntangledLight <[email protected]>
SudarshanGp <[email protected]>
Sudarshan Konge <[email protected]>
Slix <[email protected]>
Sasank Chundi <[email protected]>
SachinRaghunathan <[email protected]>
Rémy Léone <[email protected]>
RusselLuo <[email protected]>
Roman Vaivod <[email protected]>
Rohit Sarathy <[email protected]>
Rick Sheahan <[email protected]>
Rakhim Davletkaliyev <[email protected]>
Punitvara <[email protected]>
Phillip Quy Le <[email protected]>
Pavle Simonovic <[email protected]>
Paul Hindt <[email protected]>
Nishant Maniam <[email protected]>
Mustafa Altun <[email protected]>
Mohammed Sadik P. K <[email protected]>
Mingchao Zhang <[email protected]>
Michael Vanderwater <[email protected]>
Maxiwell Luo <[email protected]>
LunaMystic <[email protected]>
Liam Monahan <[email protected]>
Joshua Wertheim <[email protected]>
John Pham <[email protected]>
Johannes Scheuermann <[email protected]>
Joey Bloom <[email protected]>
Jimmy Zhang <[email protected]>
Jeffrey Foster <[email protected]>
James Daniel <[email protected]>
Jake Bailey <[email protected]>
JACKHAHA363 <[email protected]>
Hydrosis <[email protected]>
Hong <[email protected]>
Grant Wu <[email protected]>
EvanFabry <[email protected]>
EddieVilla <[email protected]>
Deepak Nagaraj <[email protected]>
Daniel Meir Doron <[email protected]>
Daniel Le <[email protected]>
Daniel Jamrozik <[email protected]>
Daniel Carballal <[email protected]>
Daniel <[email protected]>
Daeyun Shin <[email protected]>
Creyslz <[email protected]>
Christian Cygnus <[email protected]>
CharlieMartell <[email protected]>
Caleb Bassi <[email protected]>
Brian Kurek <[email protected]>
Brendan Wilson <[email protected]>
Bo Liu <[email protected]>
Ayush Ranjan <[email protected]>
Atul kumar Agrawal <[email protected]>
Artur Sak <[email protected]>
Ankush Agarwal <[email protected]>
Angelino <[email protected]>
Andrey Zaytsev <[email protected]>
Alex Yang <[email protected]>
Alex Cusack <[email protected]>
Aidan Epstein <[email protected]>
Ace Nassri <[email protected]>
Abdullahi Abdalla <[email protected]>
Aneesh Durg <[email protected]>
Assassin Eclipse <[email protected]>
Eric Cao <[email protected]>
Raphael Long <[email protected]>
WeiL <[email protected]>
williamsentosa95
<[email protected]>
Pradyumna Shome <[email protected]>
Benjamin West Pollak <[email protected]>
Background
2
Sometimes the journey of a thousand steps begins by learning to
walk
Bhuvy
Systems Architecture
This section is a short review of System Architecture topics that you’ll need for System Programming.
Assembly
What is assembly? Assembly is the lowest that you’ll get to machine language without writing 1’s and 0’s. Each
computer has an architecture, and that architecture has an associated assembly language. Each assembly command
has a 1:1 mapping to a set of 1’s and 0’s that tell the computer exactly what to do. For example, the following
in the widely used x86 Assembly language add one to the memory address 20 [13] – you can also look in [8]
Section 2A under the add instruction though it is more verbose.
Why do we mention this? Because it is important that although you are going to be doing most of this class
in C. That this is what the code is translated into. Serious implications arise for race conditions and atomic
operations.
Atomic Operations
An operation is atomic if no other processor should interrupt it. Take for example the above assembly code to
add one to a register. In the architecture, it may actually have a few different steps on the circuit. The operation
7
may start by fetching the value of the memory from the stick of ram, then storing it in the cache or a register,
and then finally writing back [12] – under the description for fetch-and-add though your micro-architecture may
vary. Or depending on performance operations, it may keep that value in cache or in a register which is local to
that process – try dumping the -O2 optimized assembly of incrementing a variable. The problem comes in if two
processors try to do it at the same time. The two processors could at the same time copy the value of the memory
address, add one, and store the same result back, resulting in the value only being incremented once. That is why
we have a special set of instructions on modern systems called atomic operations. If an instruction is atomic, it
makes sure that only one processor or thread performs any intermediate step at a time. With x86 this is done by
the lock prefix [8, p. 1120].
Why don’t we do this for everything? It makes commands slower! If every time a computer does something it
has to make sure that the other cores or processors aren’t doing anything, it’ll be much slower. Most of the time
we differentiate these with special consideration. Meaning, we will tell you when we use something like this.
Most of the time you can assume the instructions are unlocked.
Caching
Ah yes, Caching. One of computer science’s greatest problems. Caching that we are referring is processor caching.
If a particular address is already in the cache when reading or writing, the processor will perform the operation
on the cache such as adding and update the actual memory later because updating memory is slow [9, Section
3.4]. If it isn’t, the processor requests a chunk of memory from the memory chip and stores it in the cache, kicking
out the least recently used page – this depends on caching policy, but Intel’s does use this. This is done because
the l3 processor cache is roughly three times faster to reach than the memory in terms of time [11, p. 22] though
exact speeds will vary based on the clock speed and architecture. Naturally, this leads to problems because there
are two different copies of the same value, in the cited paper this refers to an unshared line. This isn’t a class
about caching, know how this could impact your code. A short but non-complete list could be
1. Race Conditions! If a value is stored in two different processor caches, then that value should be accessed
by a single thread.
2. Speed. With a cache, your program may look faster mysteriously. Just assume that reads and writes that
either happened recently or are next to each other in memory are fast.
3. Side effects. Every read or write effects the cache state. While most of the time this doesn’t help or hurt, it
is important to know. Check the Intel programmer guide on the lock prefix for more information.
Interrupts
Interrupts are a important part of system programming. An interrupt is internally an electrical signal that is
delivered to the processor when something happens – this is a hardware interrupt [3]. Then the hardware
decides if this is something that it should handle (i.e. handling keyboard or mouse input for older keyboard
and mouses) or it should pass to the operating system. The operating system then decides if this is something
that it should handle (i.e. paging a memory table from disk) or something the application should handle (i.e. a
SEGFAULT). If the operating system decides that this is something that the process or program should take care of,
it sends a software fault and that software fault is then propagated. The application then decides if it is an error
(SEGFAULT) or not (SIGPIPE for example) and reports to the user. Applications can also send signals to the kernel
and to the hardware as well. This is an oversimplification because there are certain hardware faults that can’t be
ignored or masked away, but this class isn’t about teaching you to build an operating system.
An important application of this is this is how system calls are served! There is a well-established set of
registers that the arguments go in according to the kernel as well as a system call “number” again defined by the
kernel. Then the operating system triggers an interrupt which the kernel catches and serves the system call [7].
Operating system developers and instruction set developers alike didn’t like the overhead of causing an
interrupt on a system call. Now, systems use SYSENTER and SYSEXIT which has a cleaner way of transferring
control safely to the kernel and safely back. What safely means is obvious out of the scope for this class, but it
persists.
Optional: Hyperthreading
Hyperthreading is a new technology and is in no way shape or form multithreading. Hyperthreading allows
one physical core to appear as many virtual cores to the operating system [8, P.51]. The operating system can
then schedule processes on these virtual cores and one core will execute them. Each core interleaves processes
or threads. While the core is waiting for one memory access to complete, it may perform a few instructions of
another process thread. The overall result is more instructions executed in a shorter time. This potentially means
that you can divide the number of cores you need to power smaller devices.
There be dragons here though. With hyperthreading, you must be wary of optimizations. A famous hyper-
threading bug that caused programs to crash if at least two processes were scheduled on a physical core, using
specific registers, in a tight loop. The actual problem is better explained through an architecture lens. But, the
actual application was found through systems programmers working on OCaml’s mainline [10].
I’m going to tell you a secret about this course: it is about working smarter not harder. The course can be
time-consuming but the reason that so many people see it as such (and why so many students don’t see it as such)
is the relative familiarity of people with their tools. Let’s go through some of the common tools that you’ll be
working on and need to be familiar with.
ssh
ssh is short for the Secure Shell [2]. It is a network protocol that allows you to spawn a shell on a remote machine.
Most of the times in this class you will need to ssh into your VM like this
$ ssh [email protected]
If you don’t want to type your password out every time, you can generate an ssh key that uniquely identifies
your machine. If you already have a key pair, you can skip to the copy id stage.
> ssh-keygen -t rsa -b 4096
# Do whatever keygen tells you
# Don’t feel like you need a passcode if your login password is
secure
> ssh-copy-id [email protected]
# Enter your password for maybe the final time
> ssh [email protected]
If you still think that that is too much typing, you can always alias hosts. You may need to restart your VM or
reload sshd for this to take effect. The config file is available on Linux and Mac distros. For Windows, you’ll have
to use the Windows Linux Subsystem or configure any aliases in PuTTY
git
What is ‘git‘? Git is a version control system. What that means is git stores the entire history of a directory. We
refer to the directory as a repository. So what do you need to know is a few things. First, create your repository
with the repo creator. If you haven’t already signed into enterprise GitHub, make sure to do so otherwise your
repository won’t be created for you. After that, that means your repository is created on the server. Git is a
decentralized version control system, meaning that you’ll need to get a repository onto your VM. We can do this
with a clone. Whatever you do, do not go through the README.md tutorial.
$ git clone
https://github-dev.cs.illinois.edu/cs241-fa18/<netid>.git
This will create a local repository. The workflow is you make a change on your local repository, add the
changes to a current commit, actually commit, and push the changes to the server.
1. git-cherry-pick
2. git-pack
3. git-gc
4. git-clean
5. git-rebase
6. git-stash/git-apply/git-pop
7. git-branch
$ git status
On branch master
Your branch is up-to-date with ’origin/master’.
nothing to commit, working directory clean
or
$ git status
On branch master
Your branch is up-to-date with ’origin/master’.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working
directory)
modified: <FILE>
...
no changes added to commit (use "git add" and/or "git commit -a")
$ git status
HEAD detached at 4bc4426
nothing to commit, working directory clean
Don’t panic, but your repository may be in an unworkable state. If you aren’t nearing a deadline, come to office
hours or ask your question on Piazza, and we’d be happy to help. In an emergency scenario, delete your repository
and re-clone (you’ll have to add the release as above). This will lose any local uncommitted changes. Make
sure to copy any files you were working on outside the directory, remove and copy them back in
If you want to learn more about git, there are all but an endless number of tutorials and resources online that
can help you. Here are some links that can help you out
1. https://git-scm.com/docs/gittutorial
2. https://www.atlassian.com/git/tutorials/what-is-version-control
3. https://thenewstack.io/tutorial-git-for-absolutely-everyone/
Editors
Some people take this as an opportunity to learn a new editor, others not so much. The first part is those of you
who want to learn a new editor. In the editor war that spans decades, we have come to the battle of vim vs emacs.
Vim is a text editor and a Unix-like utility. You enter vim by typing vim [file]. This takes you into the
editor. You start off in normal mode. In this mode, you can move around with many keys with the most common
ones being jklh. To exit vim from this mode, you need to type :q which quits. If you have any unsaved edits,
you must either save them :w, save and quit :wq, or discard changes :q!. To make edits you can either type i to
change you into insert mode or a to change to insert mode after the cursor. This is the basics when it comes to vim
Emacs is more of a way of life, and I don’t mean that figuratively. A lot of people say that emacs is a powerful
operating system lacking a decent text editor. This means emacs can house a terminal, gdb session, ssh session,
code and a whole lot more. It would not be fitting any other way to introduce you to the gnu-emacs any other
way than the gnu-docs https://www.gnu.org/software/emacs/tour/. Just note that emacs is insanely
powerful. You can do almost anything with it. There are a fair number of students who like the IDE-aspect of
other programming languages. Know that you can set up emacs to be an IDE, but you have to learn a bit of Lisp
http://martinsosic.com/development/emacs/2017/12/09/emacs-cpp-ide.html.
Then there are those of you who like to use your own editors. That is completely fine. For this, we require
sshfs which has ports on many different machines.
1. Windows https://github.com/billziss-gh/sshfs-win
2. Mac https://github.com/osxfuse/osxfuse/wiki/SSHFS
3. Linux https://help.ubuntu.com/community/SSHFS
At that point, the files on your VM are synced with the files on your machine and edits can be made and will
be synced.
At the time of writing, Bhuvy (That’s me!) likes to use spacemacs http://spacemacs.org/ which marries
both vim and emacs and both of their difficulties. I’ll give my soapbox for why I like it, but be warned that if you
are starting from absolutely no vim or emacs experience the learning curve along with this course may be too
much.
1. Extensible. Spacemacs has a clean design written in lisp. There are 100s of packages ready to be installed
by editing your spacemacs config and reloading that do everything from syntax checking, automatic static
analyzing, etc.
2. Most of the good parts from vim and emacs. Emacs is good at doing everything by being a fast editor.
Vim is good at making fast edits and moving around. Spacemacs is the best of both worlds allowing vim
keybindings to all the emacs goodness underneath.
3. Lots of preconfiguration done. As opposed with a fresh emacs install, a lot of the configurations with
language and projects are done for you like neotree, helm, various language layers. All that you have to
do is navigate neotree to the base of your project and emacs will turn into an IDE for that programming
language.
But obviously to each his or her own. Many people will argue that editor gurus spend more time editing their
editors and actually editing.
Clean Code
Make your code modular using helper functions. If there is a repeated task (getting the pointers to contiguous
blocks in the malloc MP, for example), make them helper functions. And make sure each function does one thing
well so that you don’t have to debug twice. Let’s say that we are doing selection sort by finding the minimum
element each iteration like so,
Many can see the bug in the code, but it can help to refactor the above method into
And the error is specifically in one function. In the end, this class is about writing system programs, not a class
about refactoring/debugging your code. In fact, most kernel code is so atrocious that you don’t want to read it –
the defense there is that it needs to be. But for the sake of debugging, it may benefit you in the long run to adopt
some of these practices.
Asserts
Use assertions to make sure your code works up to a certain point – and importantly, to make sure you
don’t break it later. For example, if your data structure is a doubly-linked list, you can do something like
assert(node == node->next->prev) to assert that the next node has a pointer to the current node. You
can also check the pointer is pointing to an expected range of memory address, non-null, ->size is reasonable, etc.
The DEBUG macro will disable all assertions, so don’t forget to set that once you finish debugging [1].
Here is a quick example with an assert. Let’s say that we are writing code using memcpy. We would want to
put an assert before that checks whethermy two memory regions overlap. If they do overlap, memcpy runs into
undefined behavior, so we want to catch that problem than later.
This check can be turned off at compile-time, but will save you tons of trouble debugging!
Valgrind
Valgrind is a suite of tools designed to provide debugging and profiling tools to make your programs more
correct and detect some runtime issues [4]. The most used of these tools is Memcheck, which can detect many
memory-related errors that are common in C and C++ programs and that can lead to crashes and unpredictable
behavior (for example, unfreed memory buffers). To run Valgrind on your program:
Arguments are optional and the default tool that will run is Memcheck. The output will be presented in the
form: the number of allocations, frees, and errors. Suppose we have a simple program like this:
#include <stdlib.h>
void dummy_function() {
int* x = malloc(10 * sizeof(int));
x[10] = 0; // error 1: Out of bounds write, as you can see
here we write to an out of bound memory address.
} // error 2: Memory Leak, x is allocated at
function exit.
int main(void) {
dummy_function();
return 0;
}
This program compiles and runs with no errors. Let’s see what Valgrind will output.
Invalid write: It detected our heap block overrun, writing outside of an allocated block.
Definitely lost: Memory leak — you probably forgot to free a memory block.
Valgrind is a effective tool to check for errors at runtime. C is special when it comes to such behavior, so after
compiling your program you can use Valgrind to fix errors that your compiler may miss and that usually happens
when your program is running.
For more information, you can refer to the manual [4]
TSAN
ThreadSanitizer is a tool from Google, built into clang and gcc, to help you detect race conditions in your code
[5]. Note, that running with tsan will slow your code down a bit. Consider the following code.
#include <pthread.h>
#include <stdio.h>
int global;
int main() {
pthread_t t[2];
pthread_create(&t[0], NULL, Thread1, NULL);
global = 100;
pthread_join(t[0], NULL);
}
// compile with gcc -fsanitize=thread -pie -fPIC -ltsan -g
simple_race.c
We can see that there is a race condition on the variable global. Both the main thread and created thread
will try to change the value at the same time. But, does ThreadSantizer catch it?
$ ./a.out
==================
WARNING: ThreadSanitizer: data race (pid=28888)
Read of size 4 at 0x7f73ed91c078 by thread T1:
#0 Thread1 /home/zmick2/simple_race.c:7 (exe+0x000000000a50)
#1 :0 (libtsan.so.0+0x00000001b459)
If we compiled with the debug flag, then it would give us the variable name as well.
GDB
GDB is short for the GNU Debugger. GDB is a program that helps you track down errors by interactively debugging
them [6]. It can start and stop your program, look around, and put in ad hoc constraints and checks. Here are a
few examples.
Setting breakpoints programmatically A breakpoint is a line of code where you want the execution
to stop and give control back to the debugger. A useful trick when debugging complex C programs with GDB is
setting breakpoints in the source code.
int main() {
int val = 1;
val = 42;
asm("int $3"); // set a breakpoint here
val = 7;
}
You can also set breakpoints programmatically. Assume that we have no optimization and the line numbers
are as follows
1. int main() {
2. int val = 1;
3. val = 42;
4. val = 7;
5. }
Checking memory content We can also use gdb to check the content of different pieces of memory. For
example,
int main() {
char bad_string[3] = {’C’, ’a’, ’t’};
printf("%s", bad_string);
}
Compiled we get
We can now use gdb to look at specific bytes of the string and reason about when the program should’ve
stopped running
(gdb) l
1 #include <stdio.h>
2 int main() {
3 char bad_string[3] = {’C’, ’a’, ’t’};
4 printf("%s", bad_string);
5 }
(gdb) b 4
Breakpoint 1 at 0x100000f57: file main.c, line 4.
(gdb) r
[...]
Breakpoint 1, main () at main.c:4
4 printf("%s", bad_string);
(gdb) x/16xb bad_string
0x7fff5fbff9cd: 0x63 0x61 0x74 0xe0 0xf9 0xbf 0x5f 0xff
0x7fff5fbff9d5: 0x7f 0x00 0x00 0xfd 0xb5 0x23 0x89 0xff
(gdb)
Here, by using the x command with parameters 16xb, we can see that starting at memory address 0x7fff5fbff9c
(value of bad_string), printf would actually see the following sequence of bytes as a string because we pro-
vided a malformed string without a null terminator.
#include <stdio.h>
int main(){
for (int deg = 0; deg > 360; ++deg){
double radians = convert_to_radians(deg);
printf("%d. %f\n", deg, radians);
}
return 0;
}
From running the code, the breakpoint didn’t even trigger, meaning the code never got to that point. Thats
because of the comparison! Okay, flip the sign it should work now right?
(gdb) run
350. 60.000000
351. 60.000000
352. 60.000000
353. 60.000000
354. 60.000000
355. 61.000000
356. 61.000000
357. 61.000000
358. 61.000000
359. 61.000000
(gdb) break 14 if deg == 359 # Let’s check the last iteration only
(gdb) run
...
(gdb) print/x deg # print the hex value of degree
$1 = 0x167
(gdb) print (31415/1000)
$2 = 0x31
(gdb) print (31415/1000.0)
$3 = 201.749
(gdb) print (31415.0/10000.0)
$4 = 3.1414999999999999
That was only the bare minimum, though most of you will get by with that. There are a whole load more
resources on the web, here are a few specific ones that can help you get started.
1. Introduction to gdb
2. Memory Content
3. CppCon 2015: Greg Law Give me 15 minutes I’ll change your view of GDB
Shell
What do you actually use to run your program? A shell! A shell is a programming language that is running inside
your terminal. A terminal is merely a window to input commands. Now, on POSIX we usually have one shell
called sh that is linked to a POSIX compliant shell called dash. Most of the time, you use a shell called bash that
is somewhat POSIX compliant but has some nifty built-in features. If you want to be even more advanced, zsh
has some more powerful features like tab complete on programs and fuzzy patterns.
Also please please read Chris Lattner’s 3 Part blog post on undefined behavior. It can shed light on debug builds
and the mystery of compiler optimization.
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
$ scan-build make
And in addition to the make output, you will get static build warnings.
Debugging with ltrace can be as simple as figuring out what was the return call of the last library call that
failed.
int main() {
FILE *fp = fopen("I don’t exist", "r");
fprintf(fp, "a");
fclose(fp);
return 0;
}
ltrace output can clue you in to weird things your program is doing live. Unfortunately, ltrace cant be used
to inject faults, meaning that ltrace can tell you what is happening, but it can’t tamper with what is already
happening.
strace on the other hand could modify your program. Debugging with strace is amazing. The basic usage is
running strace with a program, and itll get you a complete list of system call parameters.
Newer versions of strace can actually inject faults into your program. This is useful when you want to
occasionally make reads and writes fail for example in a networking application, which your program should
handle. The problem is as of early 2019, that version is missing from Ubuntu repositories. Meaning that you’ll
have to install it from the source.
printfs
When all else fails, print! Each of your functions should have an idea of what it is going to do. You want to test
that each of your functions is doing what it set out to do and see exactly where your code breaks. In the case with
race conditions, tsan may be able to help, but having each thread print out data at certain times could help you
identify the race condition.
To make printfs useful, try to have a macro that fills in the context by which the printf was called – a log
statement if you will. A simple useful but untested log statement could be as follows. Try to make a test and
figure out something that is going wrong, then log the state of your variables.
#include <execinfo.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <unistd.h>
// bt is print backtrace
const int num_stack = 10;
int __log(int line, const char *file, int bt, const char *fmt,
...) {
if (bt) {
void *raw_trace[num_stack];
size_t size = backtrace(raw_trace, sizeof(raw_trace) /
sizeof(raw_trace[0]));
char **syms = backtrace_symbols(raw_trace, size);
#ifdef DEBUG
#define log(...) __log(__LINE__, __FILE__, 0, __VA_ARGS__)
#define bt(...) __log(__LINE__, __FILE__, 1, __VA_ARGS__)
#else
#define log(...)
#define bt(...)
#endif
int main() {
log("Hello Log");
bt("Hello Backtrace");
}
This is a high-level overview from the time you compile your program to the time you run your program. We often
know that compiling your program is easy. You run the program through an IDE or a terminal, and it just works.
$ cat main.c
#include <stdio.h>
int main() {
printf("Hello World!\n");
return 0;
}
$ gcc main.c -o main
$ ./main
Hello World!
$
2. Parsing: The compiler parses the text file for function declarations, variable declarations, etc.
3. Assembly Generation: The compiler then generates assembly code for all the functions after some optimiza-
tions if enabled.
4. Assembling: The assembler turns the assembly into 0s and 1s and creates an object file. This object file
maps names to pieces of code.
5. Static Linking: The linker then takes a series of objects and static libraries and resolves references of variables
and functions from one object file to another. The linker then finds the main method and makes that the
entry point for the function. The linker also notices when a function is meant to be dynamically linked. The
compiler also creates a section in the executable that tells the operating system that these functions need
addresses right before running.
6. Dynamic Linking: As the program is getting ready to be executed, the operating system looks at what
libraries that the program needs and links those functions to the dynamic library.
Further classes will teach you about parsing and assembly – preprocessing is an extension of parsing. Most
classes won’t teach you about the two different types of linking though. Static linking a library is similar to
combining object files. To create a static library, a compiler combines different object files to create one executable.
A static library is literally is an archive of object files. These libraries are useful when you want your executable to
be secure, you know all the code that is being included into your executable, and portable, all the code is bundled
with your executable meaning no additional installs.
The other type is a dynamic library. Typically, dynamic libraries are installed user-wide or system-wide and
are accessible by most programs. Dynamic libraries’ functions are filled in right before they are run. There are a
number of benefits to this.
• Lower code footprint for common libraries like the C standard library
• Late binding means more generalized code and less reliance on specific behavior.
• Differentiation means that the shared library can be updated while keeping the executable the same.
There are a number of drawbacks as well.
• All the code is no longer bundled into your program. This means that users have to install something else.
• There could be security flaws in the other code leading to security exploits in your program.
• Standard Linux allows you to "replace" dynamic libraries, leading to possible social engineering attacks.
• This adds additional complexity to your application. Two identical binaries with different shared libraries
could lead to different results.
Homework 0
// First, can you guess which lyrics have been transformed into
this C-like system code?
char q[] = "Do you wanna build a C99 program?";
#define or "go debugging with gdb?"
static unsigned int i = sizeof(or) != strlen(or);
char* ptr = "lathe";
size_t come = fprintf(stdout,"%s door", ptr+2);
int away = ! (int) * "";
if(!fork()) {
execlp("man","man","-3","ftell", (char*)0); perror("failed");
}
if(!fork()) {
execlp("make","make", "snowman", (char*)0);
execlp("make","make", (char*)0));
}
exit(0);
So you want to master System Programming? And get a better grade than
B?
m
Watch the videos and write up your answers to the following questions
Important!
The virtual machine-in-your-browser and the videos you need for HW0 are here:
http://cs-education.github.io/sys/
Questions? Comments? Use the current semester’s CS241 Piazza: https://piazza.com/
The in-browser virtual machine runs entirely in JavaScript and is fastest in Chrome. Note the VM and any
code you write is reset when you reload the page, so copy your code to a separate document. The post-video
challenges are not part of homework 0, but you learn the most by doing rather than passively watching. You have
some fun with each end-of-video challenge.
HW0 questions are below. Copy your answers into a text document because you’ll need to submit them later
in the course.
Chapter 1
In which our intrepid hero battles standard out, standard error, file descriptors and writing to files
1. Hello, World! (system call style) Write a program that uses write() to print out "Hi! My name is <Your
Name>".
2. Hello, Standard Error Stream! Write a function to print out a triangle of height n to standard error.
Your function should have the signature void write_triangle(int n) and should use write(). The
triangle should look like this, for n = 3:
*
**
***
3. Writing to files Take your program from "Hello, World!" modify it write to a file called hello_world.txt.
Make sure to use correct flags and a correct mode for open() (man 2 open is your friend).
4. Not everything is a system call Take your program from "Writing to files" and replace write() with
printf(). Make sure to print to the file instead of standard out!
Chapter 2
Sizing up C types and their limits, int and char arrays, and incrementing pointers
3. How many bytes the following are on your machine? int, double, float, long, and long long
4. On a machine with 8 byte integers, the declaration for the variable data is int data[8]. If the address
of data is 0x7fbd9d40, then what is the address of data+2?
5. What is data[3] equivalent to in C? Hint: what does C convert data[3] to before dereferencing the
address? Remember, the type of a string constant "abc" is an array.
10. Give an example of Y such that sizeof(Y) might be 4 or 8 depending on the machine.
Chapter 3
Program arguments, environment variables, and working with character arrays (strings)
3. Where are the pointers to environment variables stored (on the stack, the heap, somewhere else)?
4. On a machine where pointers are 8 bytes, and with the following code:
Chapter 4
Heap and stack memory, and working with structs
1. If I want to use data after the lifetime of the function it was created in ends, where should I put it? How do
I put it there?
4. Fill in the blank: "In a good C program, for every malloc, there is a ___".
free(ptr);
free(ptr);
free(ptr);
printf("%s\n", ptr);
9. How can one avoid the previous two mistakes?
10. Create a struct that represents a Person. Then make a typedef, so that struct Person can be
replaced with a single word. A person should contain the following information: their name (a string), their
age (an integer), and a list of their friends (stored as a pointer to an array of pointers to Persons).
11. Now, make two persons on the heap, "Agent Smith" and "Sonny Moore", who are 128 and 256 years old
respectively and are friends with each other. Create functions to create and destroy a Person (Person’s and
their names should live on the heap).
12. create() should take a name and age. The name should be copied onto the heap. Use malloc to reserve
sufficient memory for everyone having up to ten friends. Be sure initialize all fields (why?).
13. destroy() should free up both the memory of the person struct and all of its attributes that are stored on
the heap. Destroying one person keeps other people in tact any other.
Chapter 5
Text input and output and parsing using getchar, gets, and getline.
1. What functions can be used for getting characters from stdin and writing them to stdout?
3. Write code that parses the string "Hello 5 World" and initializes 3 variables to "Hello", 5, and "World".
5. Write a C program to print out the content of a file line-by-line using getline().
C Development
These are general tips for compiling and developing using a compiler and git. Some web searches will be useful
here
2. You fix a problem in the Makefile and type make again. Explain why this may be insufficient to generate a
new build.
3. Are tabs or spaces used to indent the commands after the rule in a Makefile?
4. What does git commit do? What’s a sha in the context of git?
6. What does git status tell you and how would the contents of .gitignore change its output?
7. What does git push do? Why is it insufficient to commit with git commit -m ’fixed all bugs’ ?
8. What does a non-fast-forward error git push reject mean? What is the most common way of dealing with
this?
Optional: Just for fun
• Convert a song lyrics into System Programming and C code covered in this wiki book and share on Piazza.
• Find, in your opinion, the best and worst C code on the web and post the link to Piazza.
• Write a short C program with a deliberate subtle C bug and post it on Piazza to see if others can spot your
bug.
• Do you have any cool/disastrous system programming bugs you’ve heard about? Feel free to share with
your peers and the course staff on Piazza.
Piazza
TAs and student assistants get a ton of questions. Some are well-researched, and some are not. This is a handy
guide that’ll help you move away from the latter and towards the former. Oh, and did I mention that this is an
easy way to score points with your internship managers? Ask yourself...
6. Did I Google the error message and a few permutations thereof if necessary? How about StackOverflow.
7. Did I try commenting out, printing, and/or stepping through parts of the code bit by bit to find out precisely
where the error occurs?
8. Did I commit my code to git in case the TAs need more context?
9. Did I include the console/GDB/Valgrind output **AND** code surrounding the bug in my Piazza post?
10. Have I fixed other segmentation faults unrelated to the issue I’m having?
11. Am I following good programming practice? (i.e. encapsulation, functions to limit repetition, etc)
The biggest tip that we can give you when asking a question on piazza if you want a swift answer is to ask
your question like you were trying to answer it. Like before you ask a question, try to answer it yourself. If
you are thinking about posting
Sounds good and courteous, but course staff would much much prefer a post resembling the following
Hi, I recently failed test X, Y, Z which is about half the tests on this current assignment. I noticed that
they all have something to do with networking and epoll, but couldn’t figure out what was linking
them together, or I may be completely off track. So to test my idea, I tried spawning 1000 clients
with various get and put requests and verifying the files matched their originals. I couldn’t get it to
fail while running normally, the debug build, or valgrind or tsan. I have no warnings and none of
the pre-syntax checks showed me anything. Could you tell me if my understanding of the failure is
correct and what I could do to modify my tests to better reflect X, Y, Z? netid: bvenkat2
You don’t need to be as courteous, though we’d appreciate it, this will get a faster response time hand over
foot. If you were trying to answer this question, you’d have everything you need in the question body.
Bibliography
[6] Gdb: The gnu project debugger, Feb 2019. URL https://www.gnu.org/software/gdb/.
[7] Manu Garg. Sysenter based system call mechanism in linux 2.6, 2006. URL http://articles.manugarg.
com/systemcallinlinux2_6.html.
[8] Part Guide. Intel R 64 and ia-32 architectures software developers manual. Volume 3B: System programming
[9] CAT Intel. Improving real-time performance by utilizing cache allocation technology. Intel Corporation, April,
2015.
[10] Xavier Leroy. How i found a bug in intel skylake processors, Jul 2017. URL http://gallium.inria.fr/
blog/intel-skylake-bug/.
[11] David Levinthal. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel
Performance Analysis Guide, 30:18, 2009.
[12] Hermann Schweizer, Maciej Besta, and Torsten Hoefler. Evaluating the cost of atomic operations on modern
architectures. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages
445–456. IEEE, 2015.
[13] Wikibooks. X86 assembly — wikibooks, the free textbook project, 2018. URL https://en.wikibooks.
org/w/index.php?title=X86_Assembly&oldid=3477563. [Online; accessed 19-March-2019].
The C Programming Language
3
If you want to teach systems, don’t drum up the programmers, sort
the issues, and make PRs. Instead, teach them to yearn for the vast and
endless C.
Antoine de Saint-Exupéry (With edits from Bhuvy)
Note: This chapter is long and goes into a lot of detail. Feel free to gloss over parts with which you have
experience in.
C is the de-facto programming language to do serious system serious programming. Why? Most kernels have
their accessible through C. The [7] and the XNU kernel Inc. [4] of which is based on are written in C and have C
API - Application Programming Interface. The uses C++, but doing system programming on that is much harder
on windows that for novice system programmers. C doesn’t have like classes and (RAII) to clean up memory. C
also gives you much more of an opportunity to shoot yourself in the foot, but it lets you do things at a much more
fine-grained level.
History of C
C was developed by and at back in 1973 [8]. Back then, we had gems of programming languages like , , and .
The goal of C was two-fold. Firstly, it was made to target the most popular computers at the time, such as the .
Secondly, it tried to remove some of the lower-level constructs (managing , and programming assembly for ), and
create a language that had the power to express programs procedurally (as opposed to mathematically like LISP)
with readable code. All this while still having the ability to interface with the operating system. It sounded like a
tough feat. At first, it was only used internally at Bell Labs along with the UNIX operating system.
The first "real" standardization was with and Dennis Ritchie’s book [6]. It is still widely regarded today as the
only Portable set of C instructions. The K&R book is known as the de-facto standard for learning C. There were
different standards of C from to , though ISO largely won out as a language specification. We will be mainly
focusing on is the POSIX C library which extends ISO. Now to get the elephant out of the room, the is fails to be
POSIX compliant. Mostly, this is so because the Linux developers didn’t want to pay the fee for compliance. It is
also because they did not want to be fully compliant with a multitude of different standards because that meant
increased development costs to maintain compliance.
We will aim to use C99, as it is the standard that most computers recognize, but sometimes use some of the
35
newer C11 features. We will also talk about some off-hand features like getline because they are so widely
used with the . We’ll begin by providing a fairly comprehensive overview of the language with language facilities.
Feel free to gloss over if you have already worked with a C based language.
Features
• Speed. There is little separating a program and the system.
• Simplicity. C and its standard library comprise a simple set of portable functions.
• Manual Memory Management. C gives a program the ability to manage its memory. However, this can be a
downside if a program has memory errors.
• Ubiquity. Through foreign function interfaces (FFI) and language bindings of various types, most other
languages can call C functions and vice versa. The standard library is also everywhere. C has stood the test
of time as a popular language, and it doesn’t look like it is going anywhere.
The canonical way to start learning C is by starting with the program. The original example that Kernighan and
Ritchie proposed way back when hasn’t changed.
#include <stdio.h>
int main(void) {
printf("Hello World\n");
return 0;
}
1. The #include directive takes the file stdio.h (which stands for standard input and output) located
somewhere in your operating system, copies the text, and substitutes it where the #include was.
2. The int main(void) is a function declaration. The first word int tells the compiler the return type of the
function. The part before the parenthesis (main) is the function name. In C, no two functions can have the
same name in a single compiled program, although shared libraries may be able. Then, the parameter list
comes after. When we provide the parameter list for regular functions (void) that means that the compiler
should produce an error if the function is called with a non-zero number of arguments. For regular functions
having a declaration like void func() means that the function can be called like func(1, 2, 3),
because there is no delimiter. main is a special function. There are many ways of declaring main but the
standard ones are int main(void), int main(), and int main(int argc, char *argv[]).
3. printf("Hello World"); is what a function call. printf is defined as a part of stdio.h. The function
has been compiled and lives somewhere else on our machine - the location of the C standard library. Just
remember to include the header and call the function with the appropriate parameters (a string literal
"Hello World"). If the newline isn’t included, the buffer will not be flushed (i.e. the write will not
complete immediately).
4. return 0. main has to return an integer. By convention, return 0 means success and anything else
means failure. Here are some exit codes / statuses with special meaning: http://tldp.org/LDP/abs/
html/exitcodes.html. In general, assume 0 means success.
1. gcc is short for the which has a host of compilers ready for use. The compiler infers from the extension
that you are trying to compile a .c file.
2. ./main tells your shell to execute the program in the current directory called main. The program then
prints out "hello world".
If systems programming was as easy as writing hello world though, our jobs would be much easier.
Preprocessor
What is the ? Preprocessing is a copy and paste operation that the compiler performs before actually compiling
the program. The following is an example of substitution
// Before preprocessing
#define MAX_LENGTH 10
char buffer[MAX_LENGTH]
// After preprocessing
char buffer[10]
There are side effects to the preprocessor though. One problem is that the preprocessor needs to be able
to properly, meaning trying to redefine the internals of the C language with a preprocessor may be impossible.
Another problem is that they can’t be infinitely - there is a bounded depth where they need to stop. Macros
are also simple text substitutions, without semantics. For example, look at what can happen if a macro tries to
perform an inline modification.
In this case, it is opaque what gets printed out, but it will be 6. Can you try to figure out why? Also, consider
the edge case when comes into play.
int x = 99;
int r = 10 + min(99, 100); // r is 100!
// This is what it is expanded to
int r = 10 + 99 < 100 ? 99 : 100
// Which means
int r = (10 + 99) < 100 ? 99 : 100
There are also logical problems with the flexibility of certain parameters. One common source of confusion is
with static arrays and the sizeof operator.
What is wrong with the macro? Well, it works if a is passed in because sizeof a static array returns the
number of bytes that array takes up and dividing it by the sizeof(an_element) would give the number of
entries. But if passed a pointer to a piece of memory, taking the sizeof the pointer and dividing it by the size of
the first entry won’t always give us the size of the array.
// foo.h
int bar();
#include "foo.h"
int bar() {
}
// foo.c unpreprocessed
int bar();
int bar() {
int main() {
#ifdef __GNUC__
return 1;
#else
return 0;
#endif
}
Using gcc your compiler would preprocess the source to the following.
int main() {
return 1;
}
int main() {
return 0;
}
Language Facilities
Keywords
C has an assortment of keywords. Here are some constructs that you should know briefly as of C99.
1. break is a keyword that is used in case statements or looping statements. When used in a case statement,
the program jumps to the end of the block.
switch(1) {
case 1: /* Goes to this switch */
puts("1");
break; /* Jumps to the end of the block */
case 2: /* Ignores this program */
puts("2");
break;
} /* Continues here */
In the context of a loop, using it breaks out of the inner-most loop. The loop can be either a for, while, or
do-while construct
while(1) {
while(2) {
break; /* Breaks out of while(2) */
} /* Jumps here */
break; /* Breaks out of while(1) */
} /* Continues here */
2. const is a language level construct that tells the compiler that this data should remain constant. If
one tries to change a const variable, the program will fail to compile. const works a little differently
when put before the type, the compiler re-orders the first type and const. Then the compiler uses a left
associativity rule. Meaning that whatever is left of the pointer is constant. This is known as const-correctness.
But, it is important to know that this is a compiler imposed restriction only. There are ways of getting
around this, and the program will run fine with defined behavior. In systems programming, the only type of
memory that you can’t write to is system write-protected memory.
3. continue is a control flow statement that exists only in loop constructions. Continue will skip the rest of
the loop body and set the program counter back to the start of the loop before.
int i = 10;
while(i--) {
if(1) continue; /* This gets triggered */
*((int *)NULL) = 0;
} /* Then reaches the end of the while loop */
4. do {} while(); is another loop construct. These loops execute the body and then check the condition at
the bottom of the loop. If the condition is zero, the next statement is executed – the program counter is set
to the first instruction after the loop. Otherwise, the loop body is executed.
int i = 1;
do {
printf("%d\n", i--);
} while (i > 10) /* Only executed once */
5. enum is to declare an enumeration. An enumeration is a type that can take on many, finite values. If you
have an enum and don’t specify any numerics, the C compiler will generate a unique number for that enum
(within the context of the current enum) and use that for comparisons. The syntax to declare an instance of
an enum is enum <type> varname. The added benefit to this is that the compiler can type check these
expressions to make sure that you are only comparing alike types.
It is completely possible to assign enum values to either be different or the same. It is not advisable to rely
on the compiler for consistent numbering, if you assign numbers. If you are going to use this abstraction,
try not to break it.
enum day{
monday = 0,
tuesday = 0,
wednesday = 0,
thursday = 1,
friday = 10,
saturday = 10,
sunday = 0};
6. extern is a special keyword that tells the compiler that the variable may be defined in another object file
or a library, so the program compiles on missing variable because the program will reference a variable in
the system or another file.
// file1.c
extern int panic;
void foo() {
if (panic) {
printf("NONONONONO");
} else {
printf("This is fine");
}
}
//file2.c
int panic = 1;
7. for is a keyword that allows you to iterate with an initialization condition, a loop invariant, and an update
condition. This is meant to be equivalent to a while loop, but with differing syntax.
// Typically
int i;
for (i = 0; i < 10; i++) {
//...
}
As of the C89 standard, one cannot declare variables inside the for loop initialization block. This is because
there was a disagreement in the standard for how the scoping rules of a variable defined in the loop would
work. It has since been resolved with more recent standards, so people can use the for loop that they know
and love today
8. goto is a keyword that allows you to do conditional jumps. Do not use goto in your programs. The reason
being is that it makes your code infinitely more hard to understand when strung together with multiple
chains, which is called spaghetti code. It is acceptable to use in some contexts though, for example, error
checking code in the Linux kernel. The keyword is usually used in kernel contexts when adding another
stack frame for cleanup isn’t a good idea. The canonical example of kernel cleanup is as below.
void setup(void) {
Doe *deer;
Ray *drop;
Mi *myself;
if (!setupdoe(deer)) {
goto finish;
}
if (!setupray(drop)) {
goto cleanupdoe;
}
if (!setupmi(myself)) {
goto cleanupray;
}
cleanupray:
cleanup(drop);
cleanupdoe:
cleanup(deer);
finish:
return;
}
9. if else else-if are control flow keywords. There are a few ways to use these (1) A bare if (2) An if
with an else (3) an if with an else-if (4) an if with an else if and else. Note that an else is matched with the
most recent if. A subtle bug related to a mismatched if and else statement, is the dangling else problem.
The statements are always executed from the if to the else. If any of the intermediate conditions are true,
the if block performs that action and goes to the end of that block.
// (1)
if (connect(...))
return -1;
// (2)
if (connect(...)) {
exit(-1);
} else {
printf("Connected!");
}
// (3)
if (connect(...)) {
exit(-1);
} else if (bind(..)) {
exit(-2);
}
// (1)
if (connect(...)) {
exit(-1);
} else if (bind(..)) {
exit(-2);
} else {
printf("Successfully bound!");
}
10. inline is a compiler keyword that tells the compiler it’s okay to moit the C function call procedure and
"paste" the code in the callee. Instead, the compiler is hinted at substituting the function body directly into
the calling function. This is not always recommended explicitly as the compiler is usually smart enough to
know when to inline a function for you.
int main() {
printf("Max %d", max(a, b));
// printf("Max %d", a < b ? a : b);
}
11. restrict is a keyword that tells the compiler that this particular memory region shouldn’t overlap with all
other memory regions. The use case for this is to tell users of the program that it is undefined behavior if
the memory regions overlap. Note that memcpy has undefined behavior when memory regions overlap. If
this might be the case in your program, consider using memmove.
void process() {
if (connect(...)) {
return -1;
} else if (bind(...)) {
return -2
}
return 0;
}
13. signed is a modifier which is rarely used, but it forces a type to be signed instead of unsigned. The reason
that this is so rarely used is because types are signed by default and need to have the unsigned modifier to
make them unsigned but it may be useful in cases where you want the compiler to default to a signed type
such as below.
14. sizeof is an operator that is evaluated at compile-time, which evaluates to the number of bytes that the
expression contains. When the compiler infers the type the following code changes as follows.
char a = 0;
printf("%zu", sizeof(a++));
char a = 0;
printf("%zu", 1);
Which then the compiler is allowed to operate on further. The compiler must have a complete definition of
the type at compile-time - not link time - or else you may get an odd error. Consider the following
// file.c
struct person;
printf("%zu", sizeof(person));
// file2.c
struct person {
// Declarations
}
This code will not compile because sizeof is not able to compile file.c without knowing the full declaration
of the person struct. That is typically why programmers either put the full declaration in a header file or
we abstract the creation and the interaction away so that users cannot access the internals of our struct.
Additionally, if the compiler knows the full length of an array object, it will use that in the expression instead
of having it decay into a pointer.
(a) When used with a global variable or function declaration it means that the scope of the variable or the
function is only limited to the file.
(b) When used with a function variable, that declares that the variable has static allocation – meaning that
the variable is allocated once at program startup not every time the program is run, and its lifetime is
extended to that of the program.
char *print_time(void) {
static char buffer[200]; // Shared every time a function is
called
// ...
}
16. struct is a keyword that allows you to pair multiple types together into a new structure. C-structs are
contiguous regions of memory that one can access specific elements of each memory as if they were separate
variables. Note that there might be padding between elements, such that each variable is memory-aligned
(starts at a memory address that is a multiple of its size).
struct hostname {
const char *port;
const char *name;
const char *resource;
}; // You need the semicolon at the end
// Assign each individually
struct hostname facebook;
facebook.port = "80";
facebook.name = "www.google.com";
facebook.resource = "/";
17. switch case default Switches are essentially glorified jump statements. Meaning that you take either
a byte or an integer and the control flow of the program jumps to that location. Note that, the various cases
of a switch statement fall through. It means that if execution starts in one case, the flow of control will
continue to all subsequent cases, until a break statement.
switch(2) {
case 1: puts("1"); /* Doesn’t run this */
case 2: puts("2"); /* Runs this */
case 3: puts("3"); /* Also runs this */
}
One of the more famous examples of this is Duff’s device which allows for loop unrolling. You don’t need to
understand this code for the purposes of this class, but it is fun to look at [2].
This piece of code highlights that switch statements are goto statements, and you can put any code on the
other end of a switch case. Most of the time it doesn’t make sense, some of the time it just makes too much
sense.
18. typedef declares an alias for a type. Often used with structs to reduce the visual clutter of having to write
‘struct’ as part of the type.
In this class, we regularly typedef functions. A typedef for a function can be this for example
This declares a function type comparator that accepts two void* params and returns an integer.
19. union is a new type specifier. A union is one piece of memory that many variables occupy. It is used to
maintain consistency while having the flexibility to switch between types without maintaining functions to
keep track of the bits. Consider an example where we have different pixel values.
union pixel {
struct values {
char red;
char blue;
char green;
char alpha;
} values;
uint32_t encoded;
}; // Ending semicolon needed
union pixel a;
// When modifying or reading
a.values.red;
a.values.blue = 0x0;
20. unsigned is a type modifier that forces unsigned behavior in the variables they modify. Unsigned can only
be used with primitive int types (like int and long). There is a lot of behavior associated with unsigned
arithmetic. For the most part, unless your code involves bit shifting, it isn’t essential to know the difference
in behavior with regards to unsigned and signed arithmetic.
21. void is a double meaning keyword. When used in terms of function or parameter definition, it means
that the function explicitly returns no value or accepts no parameter, respectively. The following declares a
function that accepts no parameters and returns nothing.
void foo(void);
The other use of void is when you are defining an lvalue. A void * pointer is just a memory address. It
is specified as an incomplete type meaning that you cannot dereference it but it can be promoted to any
time to any other type. Pointer arithmetic with this pointer is undefined behavior.
int *array = void_ptr; // No cast needed
22. volatile is a compiler keyword. This means that the compiler should not optimize its value out. Consider
the following simple function.
int flag = 1;
pass_flag(&flag);
while(flag) {
// Do things unrelated to flag
}
The compiler may, since the internals of the while loop have nothing to do with the flag, optimize it to the
following even though a function may alter the data.
while(1) {
// Do things unrelated to flag
}
If you use the volatile keyword, the compiler is forced to keep the variable in and perform that check. This
is useful for cases where you are doing multi-process or multi-threaded programs so that we can affect the
running of one sequence of execution with another.
23. while represents the traditional while loop. There is a condition at the top of the loop, which is checked
before every execution of the loop body. If the condition evaluates to a non-zero value, the loop body will
be run.
C data types
There are many data types in C. As you may realize, all of them are either integers or floating point numbers and
other types are variations of these.
1. char Represents exactly one byte of data. The number of bits in a byte might vary. unsigned char and
signed char means the exact same thing. This must be aligned on a boundary (meaning you cannot use
bits in between two addresses). The rest of the types will assume 8 bits in a byte.
2. short (short int) must be at least two bytes. This is aligned on a two byte boundary, meaning that
the address must be divisible by two.
3. int must be at least two bytes. Again aligned to a two byte boundary [5, P. 34]. On most machines this will
be 4 bytes.
4. long (long int) must be at least four bytes, which are aligned to a four byte boundary. On some
machines this can be 8 bytes.
5. long long must be at least eight bytes, aligned to an eight byte boundary.
6. float represents an IEEE-754 single precision floating point number tightly specified by IEEE [1]. This will
be four bytes aligned to a four byte boundary on most machines.
7. double represents an IEEE-754 double precision floating point number specified by the same standard,
which is aligned to the nearest eight byte boundary.
If you want a fixed width integer type, for more portable code, you may use the types defined in stdint.h,
which are of the form [u]intwidth_t, where u (which is optional) represents the signedness, and width is any of 8,
16, 32, and 64.
Operators
Operators are language constructs in C that are defined as part of the grammar of the language. These operators
are listed in order of precedence.
• [] is the subscript operator. a[n] == (a + n)* where n is a number type and a is a pointer type.
• -> is the structure dereference (or arrow) operator. If you have a pointer to a struct *p, you can use this to
access one of its elements. p->element.
• . is the structure reference operator. If you have an object a then you can access an element a.element.
• +/-a is the unary plus and minus operator. They either keep or negate the sign, respectively, of the integer
or float type underneath.
• *a is the dereference operator. If you have a pointer *p, you can use this to access the element located at
this memory address. If you are reading, the return value will be the size of the underlying type. If you are
writing, the value will be written with an offset.
• &a is the address-of operator. This takes an element and returns its address.
• ++ is the increment operator. You can use it as a prefix or postfix, meaning that the variable that is being
incremented can either be before or after the operator. a = 0; ++a === 1 and a = 1; a++ === 0.
• – is the decrement operator. This has the same semantics as the increment operator except that it decreases
the value of the variable by one.
• sizeof is the sizeof operator, that is evaluated at the time of compilation. This is also mentioned in the
keywords section.
• a <mop> b where <mop> in {+, -, *, %, /} are the arithmetic binary operators. If the operands
are both number types, then the operations are plus, minus, times, modulo, and division respectively. If the
left operand is a pointer and the right operand is an integer type, then only plus or minus may be used and
the rules for pointer arithmetic are invoked.
• »/« are the bit shift operators. The operand on the right has to be an integer type whose signedness is
ignored unless it is signed negative in which case the behavior is undefined. The operator on the left decides
a lot of semantics. If we are left shifting, there will always be zeros introduced on the right. If we are right
shifting there are a few different cases
– If the operand on the left is signed, then the integer is sign-extended. This means that if the number
has the sign bit set, then any shift right will introduce ones on the left. If the number does not have
the sign bit set, any shift right will introduce zeros on the left.
– If the operand is unsigned, zeros will be introduced on the left either way.
Note that shifting by the word size (e.g. by 64 in a 64-bit architecture) results in undefined behavior.
• <=/>= are the greater than equal to/less than equal to, relational operators. They work as their name
implies.
• </> are the greater than/less than relational operators. They again do as the name implies.
• ==/= are the equal/not equal to relational operators. They once again do as the name implies.
• && is the logical AND operator. If the first operand is zero, the second won’t be evaluated and the expression
will evaluate to 0. Otherwise, it yields a 1-0 value of the second operand.
• || is the logical OR operator. If the first operand is not zero, then second won’t be evaluated and the
expression will evaluate to 1. Otherwise, it yields a 1-0 value of the second operand.
• ! is the logical NOT operator. If the operand is zero, then this will return 1. Otherwise, it will return 0.
• & is the bitwise AND operator. If a bit is set in both operands, it is set in the output. Otherwise, it is not.
• | is the bitwise OR operator. If a bit is set in either operand, it is set in the output. Otherwise, it is not.
• is the bitwise NOT operator. If a bit is set in the input, it will not be set in the output and vice versa.
• ?: is the ternary / conditional operator. You put a boolean condition before the and if it evaluates to
non-zero the element before the colon is returned otherwise the element after is. 1 ? a : b === a
and 0 ? a : b === b.
• a, b is the comma operator. a is evaluated and then b is evaluated and b is returned. In a sequence of
multiple statements delimited by commas, all statements are evaluated from left to right, and the right-most
expression is returned.
Up until this point, we’ve covered C’s language fundamentals. We’ll now be focusing our attention to C and the
POSIX variety of functions available to us to interact with the operating systems. We will talk about portable
functions, for example fwrite printf. We will be evaluating the internals and scrutinizing them under the
POSIX models and more specifically GNU/Linux. There are several things to that philosophy that makes the rest
of this easier to know, so we’ll put those things here.
Everything is a file
One POSIX mantra is that everything is a file. Although that has become recently outdated, and moreover wrong,
it is the convention we still use today. What this statement means is that everything is a file descriptor, which is
an integer. For example, here is a file object, a network socket, and a kernel object. These are all references to
records in the kernel’s file descriptor table.
And operations on those objects are done through system calls. One last thing to note before we move on is
that the file descriptors are merely pointers. Imagine that each of the file descriptors in the example actually refers
to an entry in a table of objects that the operating system picks and chooses from (that is, the file descriptor table).
Objects can be allocated and deallocated, closed and opened, etc. The program interacts with these objects by
using the API specified through system calls, and library functions.
System Calls
Before we dive into common C functions, we need to know what a system call is. If you are a student and have
completed HW0, feel free to gloss over this section.
A system call is an operation that the kernel carries out. First, the operating system prepares a system call.
Next, the kernel executes the system call to the best of its ability in kernel space and is a privileged operation.
In the previous example, we got access to a file descriptor object. We can now also write some bytes to the file
descriptor object that represents a file, and the operating system will do its best to get the bytes written to the disk.
When we say the kernel tries its best, this includes the possibility that the operation could fail for several
reasons. Some of them are: the file is no longer valid, the hard drive failed, the system was interrupted etc. The
way that a programmer communicates with the outside system is with system calls. An important thing to note is
that system calls are expensive. Their cost in terms of time and CPU cycles has recently been decreased, but try to
use them as sparingly as possible.
C System Calls
Many C functions that will be discussed in the next sections are abstractions that call the correct underlying
system call, based on the current platform. Their Windows implementation, for example, may be entirely different
from that of other operating systems. Nevertheless, we will be studying these in the context of their Linux
implementation.
Common C Functions
To find more information about any functions, please use the man pages. Note the man pages are organized into
sections. Section 2 are System calls. Section 3 are C libraries. On the web, Google man 7 open. In the shell,
man -S2 open or man -S3 printf
Handling Errors
Before we get into the nitty gritty of all the functions, know that most functions in C handle errors return oriented.
This is at odds with programming languages like C++ or Java where the errors are handled with exceptions.
There are a number of arguments against exceptions.
2. Exception oriented languages need to keep stack traces and maintain jump tables.
Whatever the pros/cons are, we use the former because of backwards compatibility with languages like
FORTRAN [3, P. 84]. Each thread will get a copy of errno because it is stored at the top of each thread’s stack –
more on threads later. One makes a call to a function that could return an error and if that function returns an
error according to the man pages, it is up to the programmer to check errno.
#include <errno.h>
int s = getnameinfo(...);
if (0 != s) {
fprintf(stderr, "getnameinfo: %s\n", gai_strerror(s));
}
Input / Output
In this section we will cover all the basic input and output functions in the standard library with references to
system calls. Every process has three streams of data when it starts execution: standard input (for program input),
standard output (for program output), and standard error (for error and debug messages). Usually, standard
input is sourced from the terminal in which the program is being run in, and standard out is the same terminal.
However, a programmer can use redirection such that their program can send output and/or receive input, to and
from a file, or other programs.
They are designated by the file descriptors 0 and 1 respectively. 2 is reserved for standard error which by
library convention is unbuffered (i.e. IO operations are performed immediately).
Standard output or stdout oriented streams are streams whose only options are to write to stdout. printf is the
function with which most people are familiar in this category. The first parameter is a format string that includes
placeholders for the data to be printed. Common format specifiers are the following
1. %s treat the argument as a c string pointer, keep printing all characters until the NULL-character is reached
For performance, printf buffers data until its cache is full or a newline is printed. Here is an example of printing
things out.
• Unbuffered, where the contents of the stream reach their destination as soon as possible.
• Line Buffered, where the contents of the stream reach their destination as soon as a newline is provided.
• Fully Buffered, where the contents of the stream reach their destination as soon as the buffer is full.
Standard Error is defined as “not fully buffered” [5, P. 279]. Standard Output and Input are merely defined to
be fully buffered if and only if the stream destination is not an interactive device. Usually, standard error will be
unbuffered, standard input and output will be line buffered if the output is a terminal otherwise fully buffered.
This relates to printf because printf merely uses the abstraction provided by the FILE interface and uses the above
semantics to determine when to write. One can force a write by calling fflush() on the stream.
To print strings and single characters, use puts(char *name ) and putchar(char c )
Other streams
To print to other file streams, use fprintf( _file_ , "Hello %s, score: %d", name, score); Where
_file_ is either predefined (‘stdout’ or ‘stderr’) or a FILE pointer that was returned by fopen or fdopen. There is a
printf equivalent that works with file descriptors, called dprintf. Just use dprintf(int fd, char* format_string, ...);.
To print data into a C string, use sprintf or better snprintf. snprintf returns the number of characters
written excluding the terminating byte. We would use sprintf the size of the printed string is less than the
provided buffer – think about printing an integer, it will never be more than 11 characters with the NUL byte. If
printf is dealing with variadic input, it is safer to use the former function as shown in the following snippet.
// Fixed
char int_string[20];
sprintf(int_string, "%d", integer);
// Variable length
char result[200];
int len = snprintf(result, sizeof(result), "%s:%d", name, score);
Note that, unlike gets, fgets copies the newline into the buffer. On the other hand, one of the advantages
of getline is that will automatically allocate and reallocate a buffer on the heap of sufficient size.
In addition to those functions, we have perror that has a two-fold meaning. Let’s say that a function call
failed using the errno convention. perror(const char* message) will print the English version of the error
to stderr.
int main(){
int ret = open("IDoNotExist.txt", O_RDONLY);
if(ret < 0){
perror("Opening IDoNotExist:");
}
//...
return 0;
}
To have a library function parse input in addition to reading it, use scanf (or fscanf or sscanf) to get
input from the default input stream, an arbitrary file stream or a C string, respectively. All of those functions will
return how many items were parsed. It is a good idea to check if the number is equal to the amount expected.
Also naturally like printf, scanf functions require valid pointers. Instead of pointing to valid memory, they
need to also be writable. It’s a common source of error to pass in an incorrect pointer value. For example,
We wanted to write the character value into c and the integer value into the malloc’d memory. However, we
passed the address of the data pointer, not what the pointer is pointing to! So sscanf will change the pointer
itself. The pointer will now point to address 10 so this code will later fail when free(data) is called.
Now, scanf will keep reading characters until the string ends. To stop scanf from causing a buffer overflow, use
a format specifier. Make sure to pass one less than the size of the buffer.
char buffer[10];
scanf("%9s", buffer); // reads up to 9 characters from input
(leave room for the 10th byte to be the terminating byte)
One last thing to note is if system calls are expensive, the scanf family is much more expensive due to
compatibility reasons. Since it needs to be able to process all of the printf specifiers correctly, the code isn’t
efficient TODO: citation needed. For highly performant programs, one should write the parsing themselves. If it is
a one-off program or script, feel free to use scanf.
string.h
String.h functions are a series of functions that deal with how to manipulate and check pieces of memory. Most of
them deal with C-strings. A C-string is a series of bytes delimited by a NUL character which is equal to the byte
0x00. More information about all of these functions. Any behavior missing from the documentation, such as the
result of strlen(NULL) is considered undefined behavior.
• int strlen(const char *s) returns the length of the string.
• int strcmp(const char *s1, const char *s2) returns an integer determining the lexicographic
order of the strings. If s1 where to come before s2 in a dictionary, then a -1 is returned. If the two strings
are equal, then 0. Else, 1.
• char *strcpy(char *dest, const char *src) Copies the string at src to dest. This function
assumes dest has enough space for src otherwise undefined behavior
• char *strcat(char *dest, const char *src) Concatenates the string at src to the end of desti-
nation. This function assumes that there is enough space for src at the end of destination including
the NUL byte
• char *strchr(const char *haystack, int needle) Returns a pointer to the first occurrence of
needle in the haystack. If none found, NULL is returned.
• char *strstr(const char *haystack, const char *needle) Same as above but this time a
string!
#include <stdio.h>
#include <string.h>
int main(){
char* upped = strdup("strtok,is,tricky,!!");
char* start = strtok(upped, ",");
do{
printf("%s\n", start);
}while((start = strtok(NULL, ",")));
return 0;
}
Output
strtok
is
tricky
!!
Why is it tricky? Well what happens when upped is changed to the following?
char* upped = strdup("strtok,is,tricky,,,!!");
• For integer parsing use long int strtol(const char *nptr, char **endptr, int base); or
long long int strtoll(const char *nptr, char **endptr, int base);.
What these functions do is take the pointer to your string *nptr and a base (i.e. binary, octal, decimal,
hexadecimal etc) and an optional pointer endptr and returns a parsed value.
int main(){
const char *nptr = "1A2436";
char* endptr;
long int result = strtol(nptr, &endptr, 16);
return 0;
}
Be careful though! Error handling is tricky because the function won’t return an error code. If passed an
invalid number string, it will return 0. The caller has to be careful from a valid 0 and an error. This often
involves an errno trampoline as shown below.
int main(){
const char *input = "0"; // or "!##@" or ""
char* endptr;
int saved_errno = errno;
errno = 0
long int parsed = strtol(input, &endptr, 10);
if(parsed == 0 && errno != 0){
// Definitely an error
}
errno = saved_errno;
return 0;
}
• void *memcpy(void *dest, const void *src, size_t n) moves n bytes starting at src to dest.
Be careful, there is undefined behavior when the memory regions overlap. This is one of the classic "This
works on my machine!" examples because many times Valgrind won’t be able to pick it up because it will
look like it works on your machine. Consider the safer version memmove.
• void *memmove(void *dest, const void *src, size_t n) does the same thing as above, but if
the memory regions overlap then it is guaranteed that all the bytes will get copied over correctly. memcpy
and memmove both in string.h?
C Memory Model
The C memory model is probably unlike most that you’ve seen before. Instead of allocating an object with type
safety, we either use an automatic variable or request a sequence of bytes with malloc or another family member
and later we free it.
Structs
In low-level terms, a struct is a piece of contiguous memory, nothing more. Just like an array, a struct has enough
space to keep all of its members. But unlike an array, it can store different types. Consider the contact struct
declared above.
struct contact {
char firstname[20];
char lastname[20];
unsigned int phone;
};
We will often use the following typedef, so we can write use the struct name as the full type.
If you compile the code without any optimizations and reordering, you can expect the addresses of each of the
variables to look like this.
&bhuvan // 0x100
&bhuvan.firstname // 0x100 = 0x100+0x00
&bhuvan.lastname // 0x114 = 0x100+0x14
&bhuvan.phone // 0x128 = 0x100+0x28
All your compiler does is say "reserve this much space". Whenever a read or write occurs in the code, the
compiler will calculate the offsets of the variable. The offsets are where the variable starts at. The phone variables
starts at the 0x128th bytes and continues for sizeof(int) bytes with this compiler. Offsets don’t determine where
the variable ends though. Consider the following hack seen in a lot of kernel code.
typedef struct {
int length;
char c_str[0];
} string;
Currently, our memory looks like the following image. There is nothing in those boxes
So what happens when we assign length? The first four boxes are filled with the value of the variable at length.
The rest of the space is left untouched. We will assume that our machine is big endian. This means that the least
significant byte is last.
bhuvan_name->length = length;
Now, we can write a string to the end of our struct with the following call.
m
strcpy(bhuvan_name->c_str, to_convert);
Figure 3.3: Struct pointing to 11 boxes, 4 filled with 0006, 7 the stirng “bhuvan”
We can even do a sanity check to make sure that the strings are equal.
m
What that zero length array does is point to the end of the struct this means that the compiler will leave room
for all of the elements calculated with respect to their size on the operating system (ints, chars, etc). The zero
length array will take up no bytes of space. Since structs are continuous pieces of memory, we can allocate more
space than required and use the extra space as a place to store extra bytes. Although this seems like a parlor trick,
it is an important optimization because to have a variable length string any other way, one would need to have
two different memory allocation calls. This is highly inefficient for doing something as common in programming
as is string manipulation.
Extra: Struct packing
Structs may require something called padding (tutorial). We do not expect you to pack structs in this course,
know that compilers perform it. This is because in the early days (and even now) loading an address in memory
happens in 32-bit or 64-bit blocks. This also meant requested addresses had to be multiples of block sizes.
struct picture{
int height;
pixel** data;
int width;
char* encoding;
}
You think the picture looks like this. One box is four bytes.
h data w encoding
struct picture{
int height;
char slop1[4];
pixel** data;
int width;
char slop2[4];
char* encoding;
}
h ? data w ? encoding
This padding is common on a 64-bit system. Other time, a processor supports unaligned access, leaving the
compiler able to pack structs. What does this mean? We can have a variable start at a non-64-bit boundary. The
processor will figure out the rest. To enable this, set an attribute.
struct picture{
int height;
int width;
pixel** data;
char* encoding;
}
Strings in C
In C, we have Null Terminated strings rather than Length Prefixed for historical reasons. For everyday programmers,
remember to NUL terminate your string! A string in C is defined as a bunch of bytes ended by ‘’ or the NUL Byte.
A string literal is naturally constant. Any write will cause the operating system to produce a SEGFAULT.
The strings pointed to by str1 and str2 may actually reside in the same location in memory.
Char arrays, however, contain the literal value which has been copied from the code segment into either the
stack or static memory. These following char arrays reside in different memory locations.
Here are some common ways to initialize a string include. Where do they reside in memory?
We can also print out the pointer and the contents of a C-string easily. Here is some boilerplate code to
illustrate this.
As mentioned before, the char array is mutable, so we can change its contents. Be careful to write within
the bounds of the array. C does not do bounds checking at compile-time, but invalid reads/writes can get your
program to crash.
strcpy(ary, "World"); // OK
strcpy(ptr, "World"); // NOT OK - Segmentation fault (crashes by
default; unless SIGSEGV is blocked)
Unlike the array, however, we can change ptr to point to another piece of memory,
Unlike pointers, that hold addresses to variables on the heap, or stack, char arrays (string literals) point to
read-only memory located in the data section of the program. This means that pointers are more flexible than
arrays, even though the name of an array is a pointer to its starting address.
In a more common case, pointers will point to heap memory in which case the memory referred to by the
pointer can be modified.
Pointers
Pointers are variables that hold addresses. These addresses have a numeric value, but usually, programmers are
interested in the value of the contents at that memory address. In this section, we will try to take you through a
basic introduction to pointers.
Pointer Basics
Declaring a Pointer
A pointer refers to a memory address. The type of the pointer is useful – it tells the compiler how many bytes
need to be read/written and delineates the semantics for pointer arithmetic (addition and subtraction).
int *ptr1;
char *ptr2;
Due to C’s syntax, an int* or any pointer is not actually its own type. You have to precede each pointer
variable with an asterisk. As a common gotcha, the following
Keep this in mind for structs as well. If one declares without a typedef, then the pointer goes after the type.
Let’s say that int *ptr was declared. For the sake of discussion, let us assume that ptr contains the memory
address 0x1000. To write to the pointer, it must be dereferenced and assigned a value.
What C does is take the type of the pointer which is an int and write sizeof(int) bytes from the start of
the pointer, meaning that bytes 0x1000, 0x1001, 0x1002, 0x1003 will all be zero. The number of bytes written
depends on the pointer type. It is the same for all primitive types but structs are a little different.
Reading works roughly the same way, except you put the variable in the spot that it needs the value.
Reading and writing to non-primitive types gets tricky. The compilation unit - usually the file or a header -
needs to have the size of the data structure readily available. This means that opaque data structures can’t be
copied. Here is an example of assigning a struct pointer:
#include <stdio.h>
typedef struct {
int a1;
int a2;
} pair;
int main() {
pair obj;
pair zeros;
zeros.a1 = 0;
zeros.a2 = 0;
pair *ptr = &obj;
obj.a1 = 1;
obj.a2 = 2;
*ptr = zeros;
printf("a1: %d, a2: %d\n", ptr->a1, ptr->a2);
return 0;
}
As for reading structure pointers, don’t do it directly. Instead, programmers create abstractions for creating,
copying, and destroying structs. If this sounds familiar, it is what C++ originally intended to do before the
standards committee went off the deep end.
Pointer Arithmetic
In addition to adding to an integer, pointers can be added to. However, the pointer type is used to determine how
much to increment the pointer. A pointer is moved over by the value added times the size of the underlying type.
For char pointers, this is trivial because characters are always one byte.
If an int is 4 bytes then ptr+1 points to 4 bytes after whatever ptr is pointing at.
Notice how only ’EFGH’ is printed. Why is that? Well as mentioned above, when performing ’bna+=1’ we are
increasing the **integer** pointer by 1, (translates to 4 bytes on most systems) which is equivalent to 4 characters
(each character is only 1 byte) Because pointer arithmetic in C is always automatically scaled by the size of the
type that is pointed to, POSIX standards forbid arithmetic on void pointers. Having said that, compilers will often
treat the underlying type as char. Here is a machine translation. The following two pointer arithmetic operations
are equal
int *ptr1 = ...;
// 1
int *offset = ptr1 + 4;
// 2
char *temp_ptr1 = (char*) ptr1;
int *offset = (int*)(temp_ptr1 + sizeof(int)*4);
Every time you do pointer arithmetic, take a deep breath and make sure that you are shifting over the
number of bytes you think you are shifting over.
C automatically promotes void* to its appropriate type. gcc and clang are not totally ISO C compliant,
meaning that they will permit arithmetic on a void pointer. They will treat it as a char pointer. Do not do this
because it is not portable - it is not guaranteed to work with all compilers!
Common Bugs
Nul Bytes
What’s wrong with this code?
In the above code it simply changes the dest pointer to point to source string. Also the NUL bytes are not
copied. Here is a better version -
while( *src ) {*dest = *src; src ++; dest++; }
*dest = *src;
Note that it is also common to see the following kind of implementation, which does everything inside the
expression test, including copying the NUL byte. However, this is bad style, as a result of doing multiple operations
in the same line.
Double Frees
A double free error is when a program accidentally attempt to free the same allocation twice.
int *p = malloc(sizeof(int));
free(p);
The fix is first to write correct programs! Secondly, it is a good habit to set pointers to NULL, once the memory
has been freed. This ensures that the pointer cannot be used incorrectly without the program crashing.
int *f() {
int result = 42;
static int imok;
return &imok; // OK - static variables are not on the stack
return &result; // Not OK
}
Automatic variables are bound to stack memory only for the lifetime of the function. After the function returns,
it is an error to continue to use the memory.
struct User {
char name[100];
};
typedef struct User user_t;
In the above example, we needed to allocate enough bytes for the struct. Instead, we allocated enough bytes
to hold a pointer. Once we start using the user pointer we will corrupt memory. The correct code is shown below.
struct User {
char name[100];
};
typedef struct User user_t;
#define N (10)
int i = N, array[N];
for( ; i >= 0; i--) array[i] = i;
C fails to check if pointers are valid. The above example writes into array[10] which is outside the array
bounds. This can cause memory corruption because that memory location is probably being used for something
else. In practice, this can be harder to spot because the overflow/underflow may occur in a library call. Here is
our old friend gets.
int myfunction() {
int x;
int y = x + 2;
...
Automatic variables hold garbage or bit pattern that happened to be in memory or register. It is an error to
assume that it will always be initialized to zero.
void myfunct() {
char array[10];
char *p = malloc(10);
Automatic (temporary variables) and heap allocations may contain random bytes or garbage.
Logic and Program flow mistakes
These are a set of mistakes that may let the program compile but perform unintended functionality.
More confusingly, if we forget an equals sign in the equality operator we will end up assigning that variable.
Most of the time this isn’t what we want to do.
The quick way to fix that is to get in the habit of putting constants first. This mistake is common enough in
while loop conditions. Most modern compilers disallows assigning variables a condition without parenthesis.
This piece of code calls getline, and assigns the return value or the number of bytes read to nread. It also in
the same line checks if that value is -1 and if so terminates the loop. It is always good practice to put parentheses
around any assignment condition.
Extra Semicolons
This is a pretty simple one, don’t put semicolons when unneeded.
It is OK to have this kind of code because the C language uses semicolons (;) to separate statements. If
there is no statement in between semicolons, then there is nothing to do and the compiler moves on to the next
statement To save a lot of confusion, always use braces. It increases the number of lines of code, which is a great
productivity metric.
Topics
• C-strings representation
• C-strings as pointers
• sizeof char
• sizeof x vs x*
• Dereferencing pointers
• Address-of operator
• Pointer arithmetic
• String duplication
• String truncation
• double-free error
• String literals
• Print formatting.
• static memory
• Buffering of stdout
Questions/Exercises
int main(){
fprintf(stderr, "Hello ");
fprintf(stdout, "It’s a small ");
fprintf(stderr, "World\n");
fprintf(stdout, "place\n");
return 0;
}
• What are the differences between the following two declarations? What does sizeof return for one of
them?
• Code up a simple my_strcmp. How about my_strcat, my_strcpy, or my_strdup? Bonus: Code the
functions while only going through the strings once.
int *ptr;
sizeof(ptr);
sizeof(*ptr);
• What is malloc? How is it different from calloc. Once memory is allocated how can we use realloc?
• Pointer Arithmetic. Assume the following addresses. What are the following shifts?
– ptr + 2
– ptr + 4
– ptr[0] + 4
– ptr[1] + 2000
– *((int)(ptr + 1)) + 3
• Write a function that accepts a path as a string, and opens that file, prints the file contents 40 bytes at a
time but, every other print reverses the string (try using the POSIX API for this).
• What are some differences between the POSIX file descriptor model and C’s FILE* (i.e. what function calls
are used and which is buffered)? Does POSIX use C’s FILE* internally or vice versa?
Rapid Fire: Pointer Arithmetic
Pointer arithmetic is important! Take a deep breath and figure out how many bytes each operation moves a
pointer. The following is a rapid fire section. We’ll use the following definitions:
How many bytes are moved over from the following additions?
1. int_ + 1
2. long_ + 7
3. short_ - 6
4. short_ - sizeof(long)
2. 56
3. -12
4. -16
5. 0
6. -16
7. 72
Bibliography
[1] Ieee standard for floating-point arithmetic. IEEE Std 754-2008, pages 1–70, Aug 2008. doi: 10.1109/IEEESTD.
2008.4610935.
[2] Tom Duff. Tom duff on duff’s device. URL https://www.lysator.liu.se/c/duffs-device.html.
[3] Fortran 72. FORTRAN IV PROGRAMMER’S REFERENCE MANUAL. Manual, DIGITAL EQUIPMENT CORPO-
RATION, Maynard, MASSACHUSETTS, May 1972. URL http://www.bitsavers.org/www.computer.
museum.uq.edu.au/pdf/DEC-10-AFDO-D%20decsystem10%20FORTRAN%20IV%20Programmer%
27s%20Reference%20Manual.pdf.
[5] ISO 1124:2005. ISO C Standard. Standard, International Organization for Standardization, Geneva, CH,
March 2005. URL http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf.
[6] B.W. Kernighan and D.M. Ritchie. The C Programming Language. Prentice-Hall software series. Prentice Hall,
1988. ISBN 9780131103627. URL https://books.google.com/books?id=161QAAAAMAAJ.
[7] Robert Love. Linux Kernel Development. Addison-Wesley Professional, 3rd edition, 2010. ISBN 0672329468,
9780672329463.
[8] Dennis M. Ritchie. The development of the c language. SIGPLAN Not., 28(3):201–208, March 1993. ISSN
0362-1340. doi: 10.1145/155360.155580. URL http://doi.acm.org/10.1145/155360.155580.
Processes
4
Who needs process isolation?
Intel Marketing on Meltdown and Spectre
To understand what a process is, you need to understand what an operating system is. An operating system
is a program that provides an interface between hardware and user software as well as providing a set of tools
that the software can use. The operating system manages hardware and gives user programs a uniform way
of interacting with hardware as long as the operating system can be installed on that hardware. Although this
idea sounds like it is the end-all, we know that there are many different operating systems with their own quirks
and standards. As a solution to that, there is another layer of abstraction: POSIX or portable operating systems
interface. This is a standard (or many standards now) that an operating system must implement to be POSIX
compatible – most systems that we’ll be studying are almost POSIX compatible due more to political reasons.
Before we talk about POSIX systems, we should understand what the idea of a kernel is generally. In an
operating system (OS), there are two spaces: kernel space and user space. Kernel space is a power operating
mode that allows the system to interact with the hardware and has the potential to destroy your machine. User
space is where most applications run because they don’t need this level of power for every operation. When a user
space program needs additional power, it interacts with the hardware through a system call that is conducted by
the kernel. This adds a layer of security so that normal user programs can’t destroy your entire operating system.
For the purposes of our class, we’ll talk about single machine multiple user operating systems. This is where there
is a central clock on a standard laptop or desktop. Other OSes relax the central clock requirement (distributed) or
the “standardness” of the hardware (embedded systems). Other invariants make sure events happen at particular
times too.
The operating system is made up of many different pieces. There may be a program running to handle
incoming USB connections, another one to stay connected to the network, etc. The most important one is the
kernel – although it might be a set of processes – which is the heart of the operating system. The kernel has many
important tasks. The first of which is booting.
1. The computer hardware executes code from read-only memory, called firmware.
2. The firmware executes a bootloader, which often conforms to the Extensible Firmware Interface (EFI),
which is an interface between the system firmware and the operating system.
3. The bootloader’s boot manager loads the operating system kernels, based on the boot settings.
5. The kernel executes startup scripts like starting networking and USB handling.
81
6. The kernel executes userland scripts like starting a desktop, and you get to use your computer!
When a program is executing in user space, the kernel provides some important services to programs in User
space.
• Managing virtual memory and low-level binary devices such as USB drivers
• Managing filesystems
The kernel creates the first process init.d (an alternative is system.d). init.d boots up programs such as
graphical user interfaces, terminals, etc – by default, this is the only process explicitly created by the system. All
other processes are instantiated by using the system calls fork and exec from that single process.
File Descriptors
Although these were mentioned in the last chapter, we are going to give a quick reminder about file descriptors. A
zine from Julia Evans gives some more details [8].
The kernel keeps track of the file descriptors and what they point to. Later we will learn two things: that file
descriptors point to more than files and that the operating system keeps track of them.
Notice that file descriptors may be reused between processes, but inside of a process, they are unique. File
descriptors may have a notion of position. These are known as seekable streams. A program can read a file on disk
completely because the OS keeps track of the position in the file, an attribute that belongs to your process as well.
Other file descriptors point to network sockets and various other pieces of information, that are unseekable
streams.
Processes
A process is an instance of a computer program that may be running. Processes have many resources at their
disposal. At the start of each program, a program gets one process, but each program can make more processes. A
program consists of the following:
• A binary format: This tells the operating system about the various sections of bits in the binary – which
parts are executable, which parts are constants, which libraries to include etc.
• Constants
int secrets;
secrets++;
printf("%d\n", secrets);
On two different terminals, they would both print out 1 not 2. Even if we changed the code to attempt to
affect other process instances, there would be no way to change another process’ state unintentionally. However,
there are other intentional ways to change the program states of other processes.
Process Contents
Memory Layout
When a process starts, it gets its own address space. Each process gets the following.
• A Stack. The stack is the place where automatically allocated variables and function call return addresses
are stored. Every time a new variable is declared, the program moves the stack pointer down to reserve
space for the variable. This segment of the stack is writable but not executable. This behavior is controlled
by the no-execute (NX) bit, sometimes called the WX̂ (write XOR execute) bit, which helps prevent malicious
code, such as shellcode from being run on the stack.
If the stack grows too far – meaning that it either grows beyond a preset boundary or intersects the heap –
the program will stack overflow error, most likely resulting in a SEGFAULT. The stack is statically allocated
by default; there is only a certain amount of space to which one can write.
• A Heap. The heap is a contiguous, expanding region of memory [5]. If a program wants to allocate an
object whose lifetime is manually controlled or whose size cannot be determined at compile-time, it would
want to create a heap variable.
The heap starts at the top of the text segment and grows upward, meaning malloc may push the heap
boundary – called the program break – upward.
We will explore this in more depth in our chapter on memory allocation. This area is also writable but
not executable. One can run out of heap memory if the system is constrained or if a program run out of
addresses, a phenomenon that is more common on a 32-bit system.
• A Data Segment
This segment contains two parts, an initialized data segment, and an uninitialized segment. Furthermore,
the initialized data segment is divided into a readable and writable section.
– Initialized Data Segment This contains all of a program’s globals and any other static variables.
This section starts at the end of the text segment and starts at a constant size because the number of
globals is known at compile time. The end of the data segment is called the program break and can
be extended via the use of brk / sbrk.
This section is writable [10, P. 124]. Most notably, this section contains variables that were initialized
with a static initializer, as follows:
int global = 1;
– Uninitialized Data Segment / BSS BSS stands for an old assembler operator known as Block Started
by Symbol.
This contains all of your globals and any other static duration variables that are implicitly zeroed out.
Example:
int assumed_to_be_zero;
This variable will be zeroed; otherwise, we would have a security risk involving isolation from other
processes.
They get put in a different section to speed up process start up time.
This section starts at the end of the data segment and is also static in size because the amount of
globals is known at compile time.
Currently, both the initialized and BSS data segments are combined and referred to as the data segment
[10, P. 124], despite being somewhat different in purpose.
– A Text Segment. This is where all executable instructions are stored, and is readable (function
pointers) but not writable.
The program counter moves through this segment executing instructions one after the other.
It is important to note that this is the only executable section of the program, by default.
If a program’s code while it’s running, the program most likely will SEGFAULT.
There are ways around it, but we will not be exploring these in this course.
Why doesn’t it always start at zero? This is because of a security feature called address space layout
randomization.
The reasons for and explanation about this is outside the scope of this class, but it is good to know
about its existence.
Having said that, this address can be made constant, if a program is compiled with the DEBUG flag.
Stack
Heap
Data
Text
Other Contents
To keep track of all these processes, your operating system gives each process a number called the process ID
(PID). Processes are also given the PID of their parent process, called parent process ID (PPID). Every process has
a parent, that parent could be init.d.
Processes could also contain the following information:
• Running State - Whether a process is getting ready, running, stopped, terminated, etc. (more on this is
covered in the chapter on Scheduling).
• File Descriptors - A list of mappings from integers to real devices (files, USB flash drives, sockets)
• Permissions - What user the file is running on and what group the process belongs to. The process can
then only perform operations based on the permissions given to the user or group, such as accessing files.
There are tricks to make a program take a different user than who started the program i.e. sudo takes a
program that a user starts and executes it as root. More specifically, a process has a real user ID (identifies
the owner of the process), an effective user ID (used for non-privileged users trying to access files only
accessible by superusers), and a saved user ID (used when privileged users perform non-privileged actions).
• Arguments - a list of strings that tell your program what parameters to run under.
• Environment Variables - a list of key-value pair strings in the form NAME=VALUE that one can modify. These
are often used to specify paths to libraries and binaries, program configuration settings, etc.
According to the POSIX specification, a process only needs a thread and address space, but most kernel
developers and users know that only these aren’t enough [6].
Intro to Fork
A word of warning
Process forking is a powerful and dangerous tool. If you make a mistake resulting in a fork bomb, you can bring
down an entire system. To reduce the chances of this, limit your maximum number of processes to a small
number e.g. 40 by typing ulimit -u 40 into a command line. Note, this limit is only for the user, which means
if you fork bomb, then you won’t be able to kill all created process since calling killall requires your shell to
fork(). Quite unfortunate. One solution is to spawn another shell instance as another user (for example root)
beforehand and kill processes from there.
Another is to use the built-in exec command to kill all the user processes (you only have one attempt at this).
Finally, you could reboot the system, but you only have one shot at this with the exec function.
When testing fork() code, ensure that you have either root and/or physical access to the machine involved. If
you must work on fork() code remotely, remember that kill -9 -1 will save you in the event of an emergency. Fork
can be extremely dangerous if you aren’t prepared for it. You have been warned.
Fork Functionality
The fork system call clones the current process to create a new process, called a child process. This occurs by
duplicating the state of the existing process with a few minor differences.
• The child process executes the next line after the fork() as the parent process does.
• Just as a side remark, in older UNIX systems, the entire address space of the parent process was directly
copied regardless of whether the resource was modified or not. The current behavior is for the kernel
to perform a copy-on-write, which saves a lot of resources, while being time efficient [7, Copy-on-write
section].
Here is a simple example of this address space cloning. The following program may print out 42 twice - but
the fork() is after the printf!? Why?
m
The printf line is executed only once however notice that the printed contents are not flushed to standard
out. There’s no newline printed, we didn’t call fflush, or change the buffering mode. The output text is therefore
still in process memory waiting to be sent. When fork() is executed the entire process memory is duplicated
including the buffer. Thus, the child process starts with a non-empty output buffer which may be flushed when
the program exits. We say may because the contents may be unwritten given a bad program exit as well.
To write code that is different for the parent and child process, check the return value of fork(). If fork()
returns -1, that implies something went wrong in the process of creating a new child. One should check the value
stored in errno to determine what kind of error occurred. Common errors include EAGAIN and ENOENT Which
are essentially "try again – resource temporarily unavailable", and "no such file or directory".
Similarly, a return value of 0 indicates that we are operating in the context of the child process, whereas a
positive integer shows that we are in the context of the parent process.
The positive value returned by fork() is the process id (pid) of the child.
A way to remember what is represented by the return value of fork is, that the child process can find its
parent - the original process that was duplicated - by calling getppid() - so does not need any additional return
information from fork(). However, the parent process may have many child processes, and therefore needs to
be explicitly informed of its child PIDs.
According to the POSIX standard, every process only has a single parent process.
The parent process can only know the PID of the new child process from the return value of fork:
m
pid_t id = fork();
if (id == -1) exit(1); // fork failed
if (id > 0) {
// Original parent
// A child process with id ’id’
// Use waitpid to wait for the child to finish
} else { // returned zero
// Child Process
}
A slightly silly example is shown below. What will it print? Try running this program with multiple arguments.
m
#include <unistd.h>
#include <stdio.h>
int main(int argc, char **argv) {
pid_t id;
int status;
while (--argc && (id=fork())) {
waitpid(id,&status,0); /* Wait for child*/
}
printf("%d:%s\n", argc, argv[argc]);
return 0;
}
Another example is below. This is the amazing parallel apparent-O(N) sleepsort is today’s silly winner. First
published on 4chan in 2011. A version of this awful but amusing sorting algorithm is shown below. This sorting
algorithm may fail to produce the correct output.
$ ./ssort 1 3 2 4
exit()
exit()
exit()
exit()
The algorithm isn’t actually O(N) because of how the system scheduler works. In essence, this program
outsources the actual sorting to the operating system.
Fork Bomb
A ‘fork bomb’ is what we warned you about earlier. This occurs when there is an attempt to create an infinite
number of processes. This will often bring a system to a near-standstill, as it attempts to allocate CPU time
and memory to a large number of processes that are ready to run. System administrators don’t like them and
may set upper limits on the number of processes each user can have, or revoke login rights because they create
disturbances in the Force for other users’ programs. A program can limit the number of child processes created by
using setrlimit().
Fork bombs are not necessarily malicious - they occasionally occur due to programming errors. Below is a
simple example that is malicious.
#include <unistd.h>
#define HELLO_NUMBER 10
int main(){
pid_t children[HELLO_NUMBER];
int i;
for(i = 0; i < HELLO_NUMBER; i++){
pid_t child = fork();
if(child == -1) {
break;
}
if(child == 0) {
// Child
execlp("ehco", "echo", "hello", NULL);
}
else{
// Parent
children[i] = child;
}
}
int j;
for(j = 0; j < i; j++){
waitpid(children[j], NULL, 0);
}
return 0;
}
We misspelled ehco, so the exec call fails. What does this mean? Instead of creating 10 processes, we created
1024 processes, fork bombing our machine. How could we prevent this? Add an exit right after exec, so that if
exec fails, we won’t end up calling fork an unbounded number of times. There are various other ways. What
if we removed the echo binary? What if the binary itself creates a fork bomb?
Signals
We won’t fully explore signals until the end of the course, but it is relevant to broach the subject now because
various semantics related to fork and other function calls detail what a signal is.
A signal can be thought of as a software interrupt. This means that a process that receives a signal stops the
execution of the current program and makes the program respond to the signal.
There are various signals defined by the operating system, two of which you may already know: SIGSEGV and
SIGINT. The first is caused by an illegal memory access, and the second is sent by a user wanting to terminate a
program. In each case, the program jumps from the current line being executed to the signal handler. If no signal
handler is supplied by the program, a default handler is executed – such as terminating the program, or ignoring
the signal.
Here is an example of a simple user-defined signal handler:
A signal has four stages in its life cycle: generated, pending, blocked, and received state. These refer to when
a process generates a signal, the kernel is about to deliver a signal, the signal is blocked, and when the kernel
delivers a signal, each of which requires some time to complete. Read more in the introduction to the Signals
chapter.
The terminology is important because fork and exec require different operations based on the state a signal is
in.
To note, it is generally poor programming practice to use signals in program logic, which is to send a signal to
perform a certain operation. The reason: signals have no time frame of delivery and no assurance that they will
be delivered. There are better ways to communicate between two processes.
If you want to read more, feel free to skip ahead to the chapter on POSIX signals and read it over. It isn’t long
and gives you the long and short about how to deal with signals in processes.
2. A child will inherit any open file descriptors of the parent. That means if a parent half of the file and forks,
the child will start at that offset. A read on the child’s end will shift the parent’s offset by the same amount.
Any other flags are also carried over.
3. Pending signals are not inherited. This means that if a parent has a pending signal and creates a child, the
child will not receive that signal unless another process signals the child.
4. The process will be created with one thread (more on that later. The general consensus is to not create
processes and threads at the same time).
5. Since we have copy on write (COW), read-only memory addresses are shared between processes.
6. If a program sets up certain regions of memory, they can be shared between processes.
8. The process’ current working directory (often abbreviated to CWD) is inherited but can be changed.
• The parent is notified via a signal, SIGCHLD, when the child process finishes but not vice versa.
• The child does not inherit pending signals or timer alarms. For a complete list see the fork man page
One process will read one part of the file, the other process will read another part of the file. In the following
example, there are two descriptions caused by two different file handles.
if(!fork) {
int file = open(...);
read(file, ...);
} else {
int file = open(...);
read(file, ...);
}
size_t buffer_cap = 0;
char * buffer = NULL;
ssize_t nread;
FILE * file = fopen("test.txt", "r");
int count = 0;
while((nread = getline(&buffer, &buffer_cap, file) != -1) {
printf("%s", buffer);
if(fork() == 0) {
exit(0);
}
wait(NULL);
}
The initial thought may be that it prints the file line by line with some extra forking. It is actually undefined
behavior because we didn’t prepare the file descriptors. To make a long story short, here is what to do to avoid
the example.
1. You as the programmer need to make sure that all of your file descriptors are prepared before forking.
3. If the FILE* is open for reading and has been read fully, it is already prepared.
5. If the file descriptor is prepared, it must unactive in the parent process if the child process is using it or
vice versa. A process is using it if it is read or written or if that process for whatever reason calls exit. If a
process uses it when the other process is as well, the whole application’s behavior is undefined.
So how would we fix the code? We would have to flush the file before forking and refrain from using it until
after the wait call – more on the specifics of this next section.
size_t buffer_cap = 0;
char * buffer = NULL;
ssize_t nread;
FILE * file = fopen("test.txt", "r");
int count = 0;
while((nread = getline(&buffer, &buffer_cap, file) != -1) {
printf("%s", buffer);
fflush(file);
if(fork() == 0) {
exit(0);
}
wait(NULL);
}
What if the parent process and the child process need to perform asynchronously and need to keep the file
handle open? Due to event ordering, we need to make sure that parent process knows that the child is finished
using wait. We’ll talk about Inter-Process communication in a later chapter, but now we can use the double fork
method.
//...
fflush(file);
pid_t child = fork();
if(child == 0) {
fclose(file);
if (fork() == 0) {
// Do asynchronous work
// Safe exit, this child doesn’t know about
// the file descriptor
exit(0);
}
exit(0);
}
waitpid(child, NULL, 0);
To parse the POSIX documentation, we’ll have to go deep into the terminology. The sentence that sets the
expectation is the following
The result of function calls involving any one handle (the "active handle") is defined elsewhere in this
volume of POSIX.1-2008, but if two or more handles are used, and any one of them is a stream, the
application shall ensure that their actions are coordinated as described below. If this is not done, the
result is undefined.
What this means is that if we don’t follow POSIX to the letter when using two file descriptors that refer to the
same description across processes, we get undefined behavior. To be technical, the file descriptor must have a
“position” meaning that it needs to have a beginning and an end like a file, not like an arbitrary stream of bytes.
POSIX then goes on to introduce the idea of an active handle, where a handle may be a file descriptor or a FILE*
pointer. File handles don’t have a flag called “active”. An active file descriptor is one that is currently being used
for reading and writing and other operations (such as exit). The standard says that before a fork that the
application or your code must execute a series of steps to prepare the state of the file. In simplified terms, the
descriptor needs to be closed, flushed, or read to its entirety – the gory details are explained later.
For a handle to become the active handle, the application shall ensure that the actions below are
performed between the last use of the handle (the current active handle) and the first use of the
second handle (the future active handle). The second handle then becomes the active handle. All
activity by the application affecting the file offset on the first handle shall be suspended until it again
becomes the active file handle. (If a stream function has as an underlying function one that affects
the file offset, the stream function shall be considered to affect the file offset.)
Summarizing as if two file descriptors are actively being used, the behavior is undefined. The other note is
that after a fork, the library code must prepare the file descriptor as if the other process were to make the file
active at any time. The last bullet point concerns itself with how a process prepares a file descriptor in our case.
If the stream is open with a mode that allows reading and the underlying open file description refers
to a device that is capable of seeking, the application shall either perform an fflush(), or the stream
shall be closed.
The documentation says that the child needs to perform an fflush or close the stream because the file descriptor
needs to be prepared in case the parent process needs to make it active. glibc is in a no-win situation if it closes
a file descriptor that the parent may expect to be open, so it’ll opt for the fflush on exit because exit in POSIX
terminology counts as accessing a file. That means that for our parent process, this clause gets triggered.
If any previous active handle has been used by a function that explicitly changed the file offset,
except as required above for the first handle, the application shall perform an lseek() or fseek() (as
appropriate to the type of handle) to an appropriate location.
Since the child calls fflush and the parent didn’t prepare, the operating system chooses to where the file gets
reset. Different file systems will do different things which are supported by the standard. The OS may look at
modification times and conclude that the file hasn’t changed so no resets are needed or may conclude that exit
denotes a change and needs to rewind the file back to the beginning.
If the parent process wants to wait for the child to finish, it must use waitpid (or wait), both of which wait for
a child to change process states, which can be one of the following:
Note that waitpid can be set to be non-blocking, which means they will return immediately, letting a program
know if the child has exited.
wait is a simpler version of waitpid. wait accepts a pointer to an integer and waits on any child process.
After the first one changes state, wait returns. Here is the behavior of waitpid:
1. A program can wait on a specific process, or it can pass in special values for the pid to do different things
(check the man pages).
2. The last parameter to waitpid is an option parameter. The options are listed below:
4. WNOWAIT - Wait, but leave the child wait-able by another wait call
Exit statuses or the value stored in the integer pointer for both of the calls above are explained below.
Exit statuses
To find the return value of main() or value included in exit()), Use the Wait macros - typically a program
will use WIFEXITED and WEXITSTATUS . See wait/waitpid man page for more information.
int status;
pid_t child = fork();
if (child == -1) {
return 1; //Failed
}
if (child > 0) {
// Parent, wait for child to finish
pid_t pid = waitpid(child, &status, 0);
if (pid != -1 && WIFEXITED(status)) {
int exit_status = WEXITSTATUS(status);
printf("Process %d returned %d" , pid, exit_status);
}
} else {
// Child, do something interesting
execl("/bin/ls", "/bin/ls", ".", (char *) NULL); // "ls ."
}
A process can only have 256 return values, the rest of the bits are informational, and the information is
extracted with bit shifting. However, the kernel has an internal way of keeping track of signaled, exited, or stopped
processes. This API is abstracted so that that the kernel developers are free to change it at will. Remember: these
macros only make sense if the precondition is met. For example, a process’ exit status won’t be defined if the
process isn’t signaled. The macros will not do the checking for the program, so it’s up to the programmer to make
sure the logic is correct. As an example above, the program should use the WIFSTOPPED to check if a process
was stopped and then the WSTOPSIG to find the signal that stopped it. As such, there is no need to memorize
the following. This is a high-level overview of how information is stored inside the status variables. From the
sys/wait.h of an old Berkeley Standard Distribution(BSD) kernel [1]:
There is a convention about exit codes. If the process exited normally and everything was successful, then a
zero should be returned. Beyond that, there aren’t too many widely accepted conventions. If a program specifies
return codes to mean certain conditions, it may be able to make more sense of the 256 error codes. For example,
a program could return 1 if the program went to stage 1 (like writing to a file) 2 if it did something else, etc.
Usually, UNIX programs are not designed to follow this policy, for the sake of simplicity.
pid_t child;
if (child == 0) {
// Do background stuff e.g. call exec
} else { /* I’m the parent! */
sleep(4); // so we can see the cleanup
puts("Parent is done");
}
return 0;
}
1. More than one child may have finished but the parent will only get one SIGCHLD signal (signals are not
queued)
2. SIGCHLD signals can be sent for other reasons (e.g. a child process has temporarily stopped)
3. It uses the deprecated signal code, instead of the more portable sigaction.
}
}
exec
To make the child process execute another program, use one of the exec functions after forking. The exec set
of functions replaces the process image with that of the specified program. This means that any lines of code
after the exec call are replaced with those of the executed program. Any other work a program wants the child
process to do should be done before the exec call. The naming schemes can be shortened mnemonically.
1. e – An array of pointers to environment variables is explicitly passed to the new process image.
3. p – Uses the PATH environment variable to find the file named in the file argument to be executed.
Note that if the information is passed via an array, the last element must be followed by a NULL element to
terminate the array.
An example of this code is below. This code executes ls
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <stdio.h>
} else {
// Other versions of exec pass in arguments as arrays
// Remember first arg is the program name
// Last arg must be a char pointer to NULL
int main() {
close(1); // close standard out
open("log.txt", O_RDWR | O_CREAT | O_APPEND, S_IRUSR | S_IWUSR);
puts("Captain’s log");
chdir("/usr/include");
// execl( executable, arguments for executable including program
name and NULL at the end)
The example writes "Captain’s Log" to a file then prints everything in /usr/include to the same file. There’s no
error checking in the above code (we assume close, open, chdir etc. work as expected).
1. open – will use the lowest available file descriptor (i.e. 1) ; so standard out(stdout) is now redirected to
the log file.
3. execl – Replace the program image with /bin/ls and call its main() method
5. We need the "return 0;" because compilers complain if we don’t have it.
1. File descriptors are preserved after an exec. That means if a program open a file and doesn’t to close it,
it remains open in the child. This is a problem because usually the child doesn’t know about those file
descriptors. Nevertheless, they take up a slot in the file descriptor table and could possibly prevent other
processes from accessing the file. The one exception to this is if the file descriptor has the Close-On-Exec
flag set (O_CLOEXEC) – we will go over setting flags later.
2. Various signal semantics. The executed processes preserve the signal mask and the pending signal set but
does not preserve the signal handlers since it is a different program.
4. The operating system may open up 0, 1, 2 – stdin, stdout, stderr, if they are closed after exec, most of the
time they leave them closed.
5. The executed process runs as the same PID and has the same parent and process group as the previous
process.
6. The executed process is run on the same user and group with the same working directory
Shortcuts
system pre-packs the above code [9, P. 371]. The following is a snippet of how to use system.
#include <unistd.h>
#include <stdlib.h>
The system call will fork, execute the command passed by parameter and the original parent process will
wait for this to finish. This also means that system is a blocking call. The parent process can’t continue until the
process started by system exits. Also, system actually creates a shell that is then given the string, which is more
overhead than using exec directly. The standard shell will use the PATH environment variable to search for a
filename that matches the command. Using system will usually be sufficient for many simple run-this-command
problems but can quickly become limiting for more complex or subtle problems, and it hides the mechanics of the
fork-exec-wait pattern, so we encourage you to learn and use fork exec and waitpid instead. It also tends to
be a huge security risk. By allowing someone to access a shell version of the environment, the program can run
into all sorts of problems:
Passing something along the lines of argv[1] = "; sudo su" is a huge security risk called privilege escalation.
A common programming pattern is to call fork followed by exec and wait. The original process calls fork, which
creates a child process. The child process then uses exec to start the execution of a new program. Meanwhile, the
parent uses wait (or waitpid) to wait for the child process to finish.
Parent fork() wait()
Child exec()
#include <unistd.h>
int main() {
pid_t pid = fork();
if (pid < 0) {// fork failure
exit(1);
} else if (pid > 0) {
int status;
waitpid(pid, &status, 0);
} else {
execl("/bin/ls", "/bin/ls", NULL);
exit(1); // For safety.
}
}
Why not execute ls directly? The reason is that now we have a monitor program – our parent that can do
other things. It can proceed and execute another function, or it can also modify the state of the system or read the
output of the function call.
Environment Variables
Environment variables are variables that the system keeps for all processes to use. Your system has these set up
right now! In Bash, some are already defined
$ echo $HOME
/home/bhuvy
$ echo $PATH
/usr/local/sbin:/usr/bin:...
How would a program later these in C? They can call getenv and setenv function respectively.
char* home = getenv("HOME"); // Will return /home/bhuvy
setenv("HOME", "/home/bhuvan", 1 /*set overwrite to true*/ );
Environment variables are important because they are inherited between processes and can be used to specify
a standard set of behaviors [2], although you don’t need to memorize the options. Another security related
concern is that environment variables cannot be read by an outside process, whereas argv can be.
Further Reading
Read the man pages and the POSIX groups above! Here are some guiding questions. Note that we aren’t expecting
you to memorize the man page.
• fork
• exec
• wait
Topics
• Correct use of fork, exec and waitpid
• Understanding what fork and exec and waitpid do. E.g. how to use their return values.
• Process memory layout (where is the heap, stack etc; invalid memory addresses).
• getpid vs getppid
Questions/Exercises
• What is the difference between execs with a p and without a p? What does the operating system
• How does a program pass in command line arguments to execl*? How about execv*? What should be
the first command line argument by convention?
• What is the int *status pointer passed into wait? When does wait fail?
• What are some differences between SIGKILL, SIGSTOP, SIGCONT, SIGINT? What are the default behav-
iors? Which ones can a program set up a signal handler for?
• My terminal is anchored to PID = 1337 and has become unresponsive. Write me the terminal command
and the C code to send SIGQUIT to it.
• Can one process alter another processes memory through normal means? Why?
• Where is the heap, stack, data, and text segment? Which segments can a program write to? What are
invalid memory addresses?
• What is an orphan? How does it become a zombie? What should a parent do to avoid this?
• Don’t you hate it when your parents tell you that you can’t do something? Write a program that sends
SIGSTOP to a parent process.
• Write a function that fork exec waits an executable, and using the wait macros tells me if the process exited
normally or if it was signaled. If the process exited normally, then print that with the return value. If not,
then print the signal number that caused the process to terminate.
Bibliography
[7] Daniel Bovet and Marco Cesati. Understanding The Linux Kernel. Oreilly & Associates Inc, 2005. ISBN
0596005652.
[9] Larry Jones. Wg14 n1539 committee draft iso/iec 9899: 201x, 2010.
[10] Peter Van der Linden. Expert C programming: deep C secrets. Prentice Hall Professional, 1994.
Memory Allocators
5
Memory memory everywhere but not an allocation to be made
A fragmented heap
Introduction
Memory allocation is important! Allocating and deallocating heap memory is one of the most common operations
in any application. The heap at the system level is contiguous series of addresses that the program can expand or
contract and use as its accord [2]. In POSIX, this is called the system break. We use sbrk to move the system
break. Most programs don’t interact directly with this call, they use a memory allocation system around it to
handle chunking up and keeping track of which memory is allocated and which is freed.
We will mainly be looking into simple allocators. Just know that there are other ways of dividing up memory
like with mmap or other allocation schemes and methods like jemalloc.
• malloc(size_t bytes) is a C library call and is used to reserve a contiguous block of memory that may
be uninitialized [4, P. 348]. Unlike stack memory, the memory remains allocated until free is called with
the same pointer. If malloc can either return a pointer to at least that much free space requested or NULL.
That means that malloc can return NULL even if there is some space. Robust programs should check the
return value. If your code assumes malloc succeeds, and it does not, then your program will likely crash
(segfault) when it tries to write to address 0. Also, malloc leaves garbage in memory because of performance
– check your code to make sure that a program all program values are initialized.
• realloc(void *space, size_t bytes) allows a program to resize an existing memory allocation
that was previously allocated on the heap (via malloc, calloc, or realloc) [4, P. 349]. The most common use
of realloc is to resize memory used to hold an array of values. There are two gotchas with realloc. One, a
new pointer may be returned. Two, it can fail. A naive but readable version of realloc is suggested below
with sample usage.
107
void * realloc(void * ptr, size_t newsize) {
// Simple implementation always reserves more memory
// and has no error checking
void *result = malloc(newsize);
size_t oldsize = ... //(depends on allocator’s internal data
structure)
if (ptr) memcpy(result, ptr, newsize < oldsize ? newsize :
oldsize);
free(ptr);
return result;
}
int main() {
// 1
int *array = malloc(sizeof(int) * 2);
array[0] = 10; array[1] = 20;
// Oops need a bigger array - so use realloc..
array = realloc(array, 3 * sizeof(int));
array[2] = 30;
The above code is fragile. If realloc fails then the program leaks memory. Robust code checks for the
return value and only reassigns the original pointer if not NULL.
int main() {
// 1
int *array = malloc(sizeof(int) * 2);
array[0] = 10; array[1] = 20;
void *tmp = realloc(array, 3 * sizeof(int));
if (tmp == NULL) {
// Nothing to do here.
} else if (tmp == array) {
// realloc returned same space
array[2] = 30;
} else {
// realloc returned different space
array = tmp;
array[2] = 30;
}
}
• calloc(size_t nmemb, size_t size) initializes memory contents to zero and also takes two argu-
ments: the number of items and the size in bytes of each item. An advanced discussion of these limitations
is in this article. Programmers often use calloc rather than explicitly calling memset after malloc, to
set the memory contents to zero because certain performance considerations are taken into account. Note
calloc(x,y) is identical to calloc(y,x), but you should follow the conventions of the manual. A naive
implementation of calloc is below.
• free takes a pointer to the start of a piece of memory and makes it available for use in subsequent calls to
the other allocation functions. This is important because we don’t want every process in our address space
to take an enormous amount of memory. Once we are done using memory, we stop using it with ‘free‘. A
simple usage is below.
Note that the memory that was newly obtained by the operating system must be zeroed out. If the operating
system left the contents of physical RAM as-is, it might be possible for one process to learn about the memory of
another process that had previously used the memory. This would be a security leak. Unfortunately, this means
that for malloc requests before any memory has been freed is often zero. This is unfortunate because many
programmers mistakenly write C programs that assume allocated memory will always be zero.
Intro to Allocating
Let’s try to write Malloc. Here is our first attempt at it – the naive version.
• System calls are slow compared to library calls. We should reserve a large amount of memory and only
occasionally ask for more from the system.
• No reuse of freed memory. Our program never re-uses heap memory - it keeps asking for a bigger heap.
If this allocator was used in a typical program, the process would quickly exhaust all available memory.
Instead, we need an allocator that can efficiently use heap space and only ask for more memory when necessary.
Some programs use this type of allocator. Consider a video game allocating objects to load the next scene. It is
considerably faster to do the above and throw the entire block of memory away than it is to do the following
placement strategies.
Placement Strategies
During program execution, memory is allocated and deallocated, so there will be a gap in the heap memory that
can be re-used for future memory requests. The memory allocator needs to keep track of which parts of the heap
are currently allocated and which are parts are available. Suppose our current heap size is 64K. Let’s say that our
heap looks like the following table.
If a new malloc request for 2KiB is executed (malloc(2048)), where should malloc reserve the memory? It
could use the last 2KiB hole, which happens to be the perfect size! Or it could split one of the other two free holes.
These choices represent different placement strategies. Whichever hole is chosen, the allocator will need to split
the hole into two. The newly allocated space, which will be returned to the program and a smaller hole if there is
spare space left over. A perfect-fit strategy finds the smallest hole that is of sufficient size (at least 2KiB):
2KiB 28KiB
Free Free
A first-fit strategy finds the first available hole that is of sufficient size so break the 16KiB hole into two. We
don’t even have to look through the entire heap!
2KiB 14KiB
Free Free
One thing to keep in mind is those placement strategies don’t need to replace the block. For example, our first
fit allocator could’ve returned the original block unbroken. Notice that this would lead to about 14KiB of space to
be unused by the user and the allocator. We call this internal fragmentation.
In contrast, external fragmentation is that even though we have enough memory in the heap, it may be divided
up in a way so a continuous block of that size is unavailable. In our previous example, of the 64KiB of heap
memory, 17KiB is allocated, and 47KiB is free. However, the largest available block is only 30KiB because our
available unallocated heap memory is fragmented into smaller pieces.
• Fiddly implementation – lots of pointer manipulation using linked lists and pointer arithmetic.
• Both fragmentation and performance depend on the application allocation profile, which can be evaluated
but not predicted and in practice, under-specific usage conditions, a special-purpose allocator can often
out-perform a general-purpose implementation.
• The allocator doesn’t know the program’s memory allocation requests in advance. Even if we did, this is the
Knapsack problem which is known to be NP-hard!
Different strategies affect the fragmentation of heap memory in non-obvious ways, which only are discovered
by mathematical analysis or careful simulations under real-world conditions (for example simulating the memory
allocation requests of a database or webserver).
First, we will have a more mathematical, one-shot approach to each of these algorithms [3]. The paper
describes a scenario where you have a certain number of bins and a certain number of allocations, and you are
trying to fit the allocations in as few bins as possible, hence using as little memory as possible. The paper discusses
theoretical implications and puts a nice limit on the ratios in the long run between the ideal memory usage and
the actual memory usage. For those who are interested, the paper concludes that actual memory usage over ideal
memory usage as the number of bins increases – the bins can have any distribution – is about 1.7 for First-Fit and
lower bounded by 1.7 for best fit. The problem with this analysis is that few real-world applications need this
type of one-shot allocation. Video game object allocations will typically designate a different subheap for each
level and fill up that subheap if they need a quick memory allocation scheme that they can throw away.
In practice, we’ll be using the result from a more rigorous survey conducted in 2005 [7]. The survey makes
sure to note that memory allocation is a moving target. A good allocation scheme to one program may not be a
good allocation scheme for another program. Programs don’t uniformly follow the distribution of allocations. The
survey talks about all the allocation schemes that we have introduced as well as a few extra ones. Here are some
summarized takeaways
1. Best fit may have problems when a block is chosen that is almost the right size, and the remaining space is
split so small that a program probably won’t use it. A way to get around this could be to set a threshold for
splitting. This small splitting isn’t observed as frequently under a regular workload. Also, the worst-case
behavior of Best-Fit is bad, but it doesn’t usually happen [p. 43].
2. The survey also talks about an important distinction of First-Fit. There are multiple notions of first. First
could be ordered in terms of the time of ‘free‘’ing, or it could be ordered through the addresses of the start
of the block, or it could be ordered by the time of last free – first being least recently used. The survey didn’t
go too in-depth into the performance of each but did make a note that address-ordered and Least Recently
Used (LRU) lists ended up with better performance than the most recently used first.
3. The survey concludes by first saying that under simulated random (assuming uniform at random) workloads,
best fit and first fit do as well. Even in practice, both best and address ordered first fit do about as equally as
well with a splitting threshold and coalescing. The reasons why aren’t entirely known.
1. Best fit may take less time than a full heap scan. When a block of perfect size or perfect size within a
threshold is found, that can be returned, depending on what edge-case policy you have.
2. Worst fit follows this as well. Your heap could be represented with the max-heap data structure and each
allocation call could simply pop the top off, re-heapify, and possibly insert a split memory block. Using
Fibonacci heaps, however, could be extremely inefficient.
3. First-Fit needs to have a block order. Most of the time programmers will default to linked lists which is
a fine choice. There aren’t too many improvements you can make with a least recently used and most
recently used linked list policy, but with address ordered linked lists you can speed up insertion from O(n)
to O(log(n)) by using a randomized skip-list in conjunction with your singly-linked list. An insert would use
the skip list as shortcuts to find the right place to insert the block and removal would go through the list as
normal.
4. There are many placement strategies that we haven’t talked about, one is next-fit which is first fit on the
next fit block. This adds deterministic randomness – pardon the oxymoron. You won’t be expected to know
this algorithm, know as you are implementing a memory allocator as part of a machine problem, there are
more than these.
A memory allocator needs to keep track of which bytes are currently allocated and which are available for use.
This section introduces the implementation and conceptual details of building an allocator, or the actual code that
implements malloc and free.
Conceptually, we are thinking about creating linked lists and lists of blocks! Please enjoy the following ASCII
art. bt is short for boundary tag.
free()
We will have implicit pointers in our next block, meaning that we can get from one block to another using
addition. This is in contrast to an explicit metadata *next field in our meta block.
char* p
p + sizeof(meta)
Metadata
Usable
BTag
Space
p+sizeof(meta)+p->size
p+sizeof(meta)+p->size + sizeof(BTag)
One can grab the next block by finding the end of the current one. That is what we mean by “implicit list”.
The actual spacing may be different. The metadata can contain different things. A minimal metadata
implementation would simply have the size of the block.
Since we write integers and pointers into memory that we already control, we can later consistently hop from
one address to the next. This internal information represents some overhead. Meaning even if we had requested
1024 KiB of contiguous memory from the system, we an allocation of that size will fail.
Our heap memory is a list of blocks where each block is either allocated or unallocated. Thus there is
conceptually a list of free blocks, but it is implicit in the form of block size information that we store as part of
each block. Let’s think of it in terms of a simple implementation.
typedef struct {
size_t block_size;
char data[0];
} block;
block *p = sbrk(100);
p->size = 100 - sizeof(*p) - sizeof(BTag);
// Other block allocations
We could navigate from one block to the next block by adding the block’s size.
Make sure to get your casting right! Otherwise, the program will move an extreme amount of bytes over.
The calling program never sees these values. They are internal to the implementation of the memory allocator.
As an example, suppose your allocator is asked to reserve 80 bytes (malloc(80)) and requires 8 bytes of internal
header data. The allocator would need to find an unallocated space of at least 88 bytes. After updating the heap
data it would return a pointer to the block. However, the returned pointer points to the usable space, not the
internal data! Instead, we would return the start of the block + 8 bytes. In the implementation, remember that
pointer arithmetic depends on type. For example, p += 8 adds 8 * sizeof(p), not necessarily 8 bytes!
typedef struct {
size_t block_size;
int is_free;
char data[0];
} block;
block *p = sbrk(100);
p->size = 100 - sizeof(*p) - sizeof(boundary_tag);
// Other block allocations
If the program wants certain bits to hold different pieces of information, use bit fields!
typedef struct {
unsigned int block_size : 7;
unsigned int is_free : 1;
} size_free;
typedef struct {
size_free info;
char data[0];
} block;
The compiler will handle the shifting. After setting up your fields then it becomes simply looping through
each of the blocks and checking the appropriate fields
Here is a visual representation of what happens. If we assume that we have a block that looks like this, we
want to spit if the allocation is let’s say 16 bytes The split we’ll have to do is the following.
0x0 0x34
The block that malloc gives you is guaranteed to be aligned so that it can hold any type of data. On
GNU systems, the address is always a multiple of eight on most systems and a multiple of 16 on 64-bit
systems." For example, if you need to calculate how many 16 byte units are required, don’t forget to
round up.
The additional constant ensures incomplete units are rounded up. Note, real code is more likely to symbol sizes
e.g. sizeof(x) - 1, rather than coding numerical constant 15. Here’s a great article on memory alignment, if
you are further interested
Another added effect could be internal fragmentation happens when the given block is larger than their
allocation size. Let’s say that we have a free block of size 16B (not including metadata). If they allocate 7 bytes,
the allocator may want to round up to 16B and return the entire block. This gets sinister when implementing
coalescing and splitting. If the allocator doesn’t implement either, it may end up returning a block of size 64B for
a 7B allocation! There is a lot of overhead for that allocation which is what we are trying to avoid.
Implementing free
When free is called we need to re-apply the offset to get back to the ‘real’ start of the block – to where we stored
the size information. A naive implementation would simply mark the block as unused. If we are storing the block
allocation status in a bitfield, then we need to clear the bit:
p->info.is_free = 0;
However, we have a bit more work to do. If the current block and the next block (if it exists) are both free we
need to coalesce these blocks into a single block. Similarly, we also need to check the previous block, too. If that
exists and represents an unallocated memory, then we need to coalesce the blocks into a single large block.
To be able to coalesce a free block with a previous free block we will also need to find the previous block, so
we store the block’s size at the end of the block, too. These are called “boundary tags” [5]. These are Knuth’s
solution to the coalescing problem both ways. As the blocks are contiguous, the end of one block sits right next to
the start of the next block. So the current block (apart from the first one) can look a few bytes further back to
look up the size of the previous block. With this information, the allocator can now jump backward!
Take for example a double coalesce. If we wanted to free the middle block we need to turn the surrounding
blocks into one big blocks
free()
Meta Tag
typedef struct {
size_t info;
struct block *next;
char data[0];
} block;
Here is what that would look like along with our implicit linked list
Where do we store the pointers of our linked list? A simple trick is to realize that the block itself is not being
used and store the next and previous pointers as part of the block, though you have to ensure that the free blocks
are always sufficiently large to hold two pointers. We still need to implement Boundary Tags, so we can correctly
free blocks and coalesce them with their two neighbors. Consequently, explicit free lists require more code and
complexity. With explicitly linked lists a fast and simple ‘Find-First’ algorithm is used to find the first sufficiently
large link. However, since the link order can be modified, this corresponds to different placement strategies. If the
links are maintained from largest to smallest, then this produces a ‘Worst-Fit’ placement strategy.
There are edge cases though, consider how to maintain your free list if also double coalescing. We’ve included
a figure with a common mistake.
free()
Meta Tag
Meta Tag
We recommend when trying to implement malloc that you draw out all the cases conceptually and then write
the code.
The newly deallocated block can be inserted easily into two possible positions: at the beginning or in address
order. Inserting at the beginning creates a LIFO (last-in, first-out) policy. The most recently deallocated spaces
will be reused. Studies suggest fragmentation is worse than using address order [7].
Inserting in address order (“Address ordered policy”) inserts deallocated blocks so that the blocks are visited
in increasing address order. This policy required more time to free a block because the boundary tags (size data)
must be used to find the next and previous unallocated blocks. However, there is less fragmentation.
A segregated allocator is one that divides the heap into different areas that are handled by different sub-allocators
dependent on the size of the allocation request. Sizes are grouped into powers of two and each size is handled by
a different sub-allocator and each size maintains its free list.
A well-known allocator of this type is the buddy allocator [6, P. 85]. We’ll discuss the binary buddy allocator
which splits allocation into blocks of size 2n ; n = 1, 2, 3, ... times some base unit number of bytes, but others also
exist like Fibonacci split where the allocation is rounded up to the next Fibonacci number. The basic concept is
simple: If there are no free blocks of size 2n , go to the next level and steal that block and split it into two. If two
neighboring blocks of the same size become unallocated, they can coalesce together into a single large block of
twice the size.
Buddy allocators are fast because the neighboring blocks to coalesce with can be calculated from the deallocated
block’s address, rather than traversing the size tags. Ultimate performance often requires a small amount of
assembler code to use a specialized CPU instruction to find the lowest non-zero bit.
The main disadvantage of the Buddy allocator is that they suffer from internal fragmentation because allocations
are rounded up to the nearest block size. For example, a 68-byte allocation will require a 128-byte block.
The SLUB allocator is a slab allocator that serves different needs for the Linux kernel SLUB. Imagine you are
creating an allocator for the kernel, what are your requirements? Here is a hypothetical shortlist.
1. First and foremost is you want a low memory footprint to have the kernel be able to be installed on all types
of hardware: embedded, desktop, supercomputer, etc.
2. Then, you want the actual memory to be as contiguous as possible to make use of caching. Every time a
system call is performed, the kernel’s pages need to get loaded into memory. This means that if they are all
contiguous, the processor will be able to cache them more efficiently
Enter the SLUB allocator kmalloc. The SLUB allocator is a segregated list allocator with minimal splitting
and coalescing. The difference here is that the segregated list focuses on more realistic allocation sizes, instead of
powers of two. SLUB also focuses on a low overall memory footprint while keeping pages in the cache. There are
blocks of different sizes and the kernel rounds up each allocation request to the lowest block size that satisfies it.
One of the big differences between this allocator and the others is that it usually conforms to page sizes. We’ll talk
about virtual memory and pages in another chapter, but the kernel will be working with direct memory pages in
spans of 4Kib or 4096 Bytes.
Further Reading
Guiding questions
• Does realloc accept, as its argument, the number of elements or space (in bytes)?
• Slab Allocation
• Best Fit
• Worst Fit
• First Fit
• Buddy Allocator
• Internal Fragmentation
• External Fragmentation
• sbrk
• Natural Alignment
• Boundary Tag
• Coalescing
• Splitting
Questions/Exercises
• What is a Best Fit placement strategy? How is it with External Fragmentation? Time Complexity?
• What is a Worst Fit placement strategy? Is it any better with External Fragmentation? Time Complexity?
• What is the First Fit Placement strategy? It’s a little bit better with Fragmentation, right? Expected Time
Complexity?
• Let’s say that we are using a buddy allocator with a new slab of 64kb. How does it go about allocating
1.5kb?
• What is Coalescing/Splitting? How do they increase/decrease fragmentation? When can you coalesce or
split?
• How do boundary tags work? How can they be used to coalesce or split?
Bibliography
[1] Virtual memory allocation and paging, May 2001. URL https://ftp.gnu.org/old-gnu/Manuals/
glibc-2.2.3/html_chapter/libc_3.html.
[3] M. R. Garey, R. L. Graham, and J. D. Ullman. Worst-case analysis of memory allocation algorithms. In
Proceedings of the Fourth Annual ACM Symposium on Theory of Computing, STOC ’72, pages 143–150, New
York, NY, USA, 1972. ACM. doi: 10.1145/800152.804907. URL http://doi.acm.org/10.1145/800152.
804907.
[4] Larry Jones. Wg14 n1539 committee draft iso/iec 9899: 201x, 2010.
[5] D.E. Knuth. The Art of Computer Programming: Fundamental Algorithms. Number v. 1-2 in Addison-Wesley
series in computer science and information processing. Addison-Wesley, 1973. ISBN 9780201038217. URL
https://books.google.com/books?id=dC05RwAACAAJ.
[6] C.P. Rangan, V. Raman, and R. Ramanujam. Foundations of Software Technology and Theoretical Computer
Science: 19th Conference, Chennai, India, December 13-15, 1999 Proceedings. FOUNDATIONS OF SOFTWARE
TECHNOLOGY AND THEORETICAL COMPUTER SCIENCE. Springer, 1999. ISBN 9783540668367. URL
https://books.google.com/books?id=0uHME7EfjQEC.
[7] Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. Dynamic storage allocation: A survey
and critical review. In Henry G. Baler, editor, Memory Management, pages 1–116, Berlin, Heidelberg, 1995.
Springer Berlin Heidelberg. ISBN 978-3-540-45511-0.
Threads
6
If you think your programs crashing before, wait until they crash
ten times as fast
Bhuvy
A thread is short for ‘thread-of-execution’. It represents the sequence of instructions that the CPU has and
will execute. To remember how to return from function calls, and to store the values of automatic variables and
parameters a thread uses a stack. Almost weirdly, a thread is a process, meaning that creating a thread is similar
to fork, except there is no copying meaning no copy on write. What this allows is for a process to share the
same address space, variables, heap, file descriptors and etc. The actual system call to create a thread is similar
to fork. It’s clone. We won’t go into the specifics, but you can read the man pages keeping in mind that it is
outside the direct scope of this course. LWP or Lightweight Processes or threads are preferred to forking for a
lot of scenarios because there is a lot less overhead creating them. But in some cases, notably python uses this,
multiprocessing is the way to make your code faster.
Processes vs threads
• When more security is desired. For example, Chrome browser uses different processes for different tabs.
• When running an existing and complete program then a new process is required, for example starting ‘gcc’.
• When you are running into synchronization primitives and each process is operating on something in the
system.
• When you have too many threads – the kernel tries to schedule all the threads near each other which could
cause more harm than good.
• If one thread blocks in a task (say IO) then all threads block. Processes don’t have that same restriction.
• When the amount of communication is minimal enough that simple IPC needs to be used.
125
• You want to leverage the power of a multi-core system to do one task
Thread Internals
Your main function and other functions has automatic variables. We will store them in memory using a stack and
keep track of how large the stack is by using a simple pointer (the “stack pointer”). If the thread calls another
function, we move our stack pointer down, so that we have more space for parameters and automatic variables.
Once it returns from a function, we can move the stack pointer back up to its previous value. We keep a copy of
the old stack pointer value - on the stack! This is why returning from a function is quick. It’s easy to ‘free’ the
memory used by automatic variables because the program needs to change the stack pointer.
In a multi-threaded program, there are multiple stacks but only one address space. The pthread library
allocates some stack space and uses the clone function call to start the thread at that stack address.
Reserved
Space
A program can have more than one thread running inside a process. The programget the first thread for free!
It runs the code you write inside ‘main’. If a program need more threads, it can call pthread_create to create a
new thread using the pthread library. You’ll need to pass a pointer to a function so that the thread knows where
to start.
The threads all live inside the same virtual memory because they are part of the same process. Thus they can
all see the heap, the global variables, and the program code.
Reserved
Space
int *b
1st Thread's
Stack
Heap
1
Thus, a program can have two (or more) CPUs working on your program at the same time and inside the
same process. It’s up to the operating system to assign the threads to CPUs. If a program has more active threads
than CPUs, the kernel will assign the thread to a CPU for a short duration or until it runs out of things to do and
then will automatically switch the CPU to work on another thread. For example, one CPU might be processing the
game AI while another thread is computing the graphics output.
Simple Usage
To use pthreads, include pthread.h and compile and link with -pthread or -lpthread compiler option. This
option tells the compiler that your program requires threading support. To create a thread, use the function
pthread_create. This function takes four arguments:
• The second is a pointer to attributes that we can use to tweak and tune some of the advanced features of
pthreads.
The argument void *(*start_routine) (void *) is difficult to read! It means a pointer that takes a
void * pointer and returns a void * pointer. It looks like a function declaration except that the name of the
function is wrapped with (* .... )
#include <stdio.h>
#include <pthread.h>
In the above example, the result will be NULL because the busy function returned NULL. We need to pass the
address-of result because pthread_join will be writing into the contents of our pointer.
In the man pages, it warns that programmers should use pthread_t as an opaque type and not look at the
internals. We do ignore that often, though.
Pthread Functions
• pthread_create. Creates a new thread. Every thread gets a new stack. If a program calls pthread_create
twice, Your process will contain three stacks - one for each thread. The first thread is created when the
process start, the other two after the create. Actually, there can be more stacks than this, but let’s keep
it simple. The important idea is that each thread requires a stack because the stack contains automatic
variables and the old CPU PC register, so that it can go back to executing the calling function after the
function is finished.
• pthread_cancel stops a thread. Note the thread may still continue. For example, it can be terminated
when the thread makes an operating system call (e.g. write). In practice, pthread_cancel is rarely
used because a thread won’t clean up open resources like files. An alternative implementation is to use a
boolean (int) variable whose value is used to inform other threads that they should finish and clean up.
• pthread_exit(void *) stops the calling thread meaning the thread never returns after calling pthread_exit.
The pthread library will automatically finish the process if no other threads are running. pthread_exit(...)
is equivalent to returning from the thread’s function; both finish the thread and also set the return value
(void *pointer) for the thread. Calling pthread_exit in the main thread is a common way for simple
programs to ensure that all threads finish. For example, in the following program, the myfunc threads
will probably not have time to get started. On the other hand exit() exits the entire process and sets the
process’ exit value. This is equivalent to return (); in the main method. All threads inside the process
are stopped. Note the pthread_exit version creates thread zombies; however, this is not a long-running
process, so we don’t care.
int main() {
pthread_t tid1, tid2;
pthread_create(&tid1, NULL, myfunc, "Jabberwocky");
pthread_create(&tid2, NULL, myfunc, "Vorpel");
if (keep_threads_going) {
pthread_exit(NULL);
} else {
exit(42); //or return 42;
}
• pthread_join() waits for a thread to finish and records its return value. Finished threads will continue
to consume resources. Eventually, if enough threads are created, pthread_create will fail. In practice,
this is only an issue for long-running processes but is not an issue for simple, short-lived processes as all
thread resources are automatically freed when the process exits. This is equivalent to turning your children
into zombies, so keep this in mind for long-running processes. In the exit example, we could also wait on
all the threads.
// ...
void* result;
pthread_join(tid1, &result);
pthread_join(tid2, &result);
return 42;
// ...
Race Conditions
Race conditions are whenever the outcome of a program is determined by its sequence of events determined by
the processor. This means that the execution of the code is non-deterministic. Meaning that the same program
can run multiple times and depending on how the kernel schedules the threads could produce inaccurate results.
The following is the canonical race condition.
int main() {
int data = 1;
pthread_t one, two;
pthread_create(&one, NULL, thread_main, &data);
pthread_create(&two, NULL, thread_main, &data);
pthread_join(one, NULL);
pthread_join(two, NULL);
printf("%d\n", data);
return 0;
}
Breaking down the assembly there are many different accesses of the code. We will assume that data is stored
in the eax register. The code to increment is the following with no optimization (assume int_ptr contains eax).
Thread 2 x=1 x += x *p = x
Thread 1 x=2 x += x *p = x
int data 1 1 2 2 2 4
This access pattern will cause the variable data to be 4. The problem is when the instructions are executed in
parallel.
Thread 2 x=1 x += x *p = x
Thread 1 x=1 x += x *p = x
int data 1 1 2 2
This access pattern will cause the variable data to be 2. This is undefined behavior and a race condition.
What we want is one thread to access the part of the code at a time.
But when compiled with -O2, assembly output is a single instruction.
Shouldn’t that fix it? It is a single assembly instruction so no interleaving? It doesn’t fix the problems that the
hardware itself may experience a race condition because we as programmers didn’t tell the hardware to check for
it. The easiest way is to add the lock prefix [1, p. 1120].
But we don’t want to be coding in assembly! We need to come up with a software solution to this problem.
Here is another small race condition. The following code is supposed to start ten threads with the integers 0
through 9 inclusive. However, when run prints out 1 7 8 8 8 8 8 8 8 10! Or seldom does it print out what
we expect. Can you see why?
#include <pthread.h>
void* myfunc(void* ptr) {
int i = *((int *) ptr);
printf("%d ", i);
return NULL;
}
int main() {
// Each thread gets a different value of i to process
int i;
pthread_t tid;
for(i =0; i < 10; i++) {
pthread_create(&tid, NULL, myfunc, &i); // ERROR
}
pthread_exit(NULL);
}
The above code suffers from a race condition - the value of i is changing. The new threads start later in
the example output the last thread starts after the loop has finished. To overcome this race-condition, we will give
each thread a pointer to its own data area. For example, for each thread we may want to store the id, a starting
value and an output value. We will instead treat i as a pointer and cast it by value.
int main() {
// Each thread gets a different value of i to process
int i;
pthread_t tid;
for(i =0; i < 10; i++) {
pthread_create(&tid, NULL, myfunc, (void *)i);
}
pthread_exit(NULL);
}
Race conditions aren’t in our code. They can be in provided code Some functions like asctime, getenv,
strtok, strerror not thread-safe. Let’s look at a simple function that is also not ‘thread-safe’. The result buffer
could be stored in global memory. This is good in a single-threaded program. We wouldn’t want to return a
pointer to an invalid address on the stack, but there’s only one result buffer in the entire memory. If two threads
were to use it at the same time, one would corrupt the other.
char *to_message(int num) {
static char result [256];
if (num < 10) sprintf(result, "%d : blah blah" , num);
else strcpy(result, "Unknown");
return result;
}
There are ways around this like using synchronization locks, but first let’s do this by design. How would you
fix the function above? You can change any of the parameters and any return types. Here is one valid solution.
Instead of making the function responsible for the memory, we made the caller responsible! A lot of programs,
and hopefully your programs, have minimal communication needed. Often a malloc call is less work than locking
a mutex or sending a message to another thread.
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
return NULL;
}
int main() {
pthread_t tid1, tid2;
pthread_create(&tid1,NULL, sleepnprint, "New Thread One");
pthread_create(&tid2,NULL, sleepnprint, "New Thread Two");
child = fork();
printf("%d:%s\n",getpid(), "fork()ing complete");
sleep(3);
pthread_exit(NULL);
return 0; /* Never executes */
}
In practice, creating threads before forking can lead to unexpected errors because (as demonstrated above)
the other threads are immediately terminated when forking. Another thread might have locked a mutex like by
calling malloc and never unlock it again. Advanced users may find pthread_atfork useful however we suggest
a program avoid creating threads before forking unless you fully understand the limitations and difficulties of this
approach.
With your new understanding of threads, all you need to do is create a thread for the left half, and one for the
right half. Given that your CPU has multiple real cores, you will see a speedup following Amdahl’s Law. The time
complexity analysis gets interesting here as well. The parallel algorithm runs in O(log3 (n)) running time because
we have the analysis assumes that we have a lot of cores.
In practice though, we typically do two changes. One, once the array gets small enough, we ditch the Parallel
Merge Sort algorithm and do conventional sort that works fast on small arrays, usually cache coherency rules at
this level. The other thing that we know is that CPUs don’t have infinite cores. To get around that, we typically
keep a worker pool. You won’t see the speedup right away because of things like cache coherency and scheduling
extra threads. Over the bigger pieces of code though, you will start to see speedups.
Another embarrassingly parallel problem is parallel map. Say we want to apply a function to an entire array,
one element at a time.
Since none of the elements depend on any other element, how would you go about parallelizing this? What
do you think would be the best way to split up the work between threads.
Extra: Scheduling
There are a few ways to split up the work. These are common to the OpenMP framework [2].
• static scheduling breaks up the problems into fixed-size chunks (predetermined) and have each thread
work on each of the chunks. This works well when each of the subproblems takes roughly the same time
because there is no additional overhead. All you need to do is write a loop and give the map function to
each sub-array.
• dynamic scheduling as a new problem becomes available to have a thread serve it. This is useful when
you don’t know how long the scheduling will take
• guided scheduling This is a mix of the above with a mix of the benefits and tradeoffs. You start with
static scheduling and move slowly to dynamic if needed
• runtime scheduling You have absolutely no idea how long the problems are going to take. Instead of
deciding it yourself, let the program decide what to do!
No need to memorize any of the scheduling routines though. Openmp is a standard that is an alternative to
pthreads. For example, here is how to parallelize a for loop
Static scheduling will divide the problem into fixed-size chunks Dynamic scheduling will give a job once the
loop is over Guided scheduling is Dynamic with chunks Runtime is a whole bag of worms.
Other Problems
From Wikipedia
• The Mandelbrot set, Perlin noise, and similar images, where each point is calculated independently.
• Rendering of computer graphics. In computer animation, each frame may be rendered independently (see
parallel rendering).
• Notable real-world examples include distributed.net and proof-of-work systems used in cryptocurrency.
• BLAST searches in bioinformatics for multiple queries (but not for individual large queries)
• Large scale facial recognition systems that compare thousands of arbitrary acquired faces (e.g., a security
or surveillance video via closed-circuit television) with a similarly large number of previously stored faces
(e.g., a rogues gallery or similar watch list).
• Sieving step of the quadratic sieve and the number field sieve.
1. They are pretty new. Even though the standard came out in roughly 2011, POSIX threads have been around
forever. A lot of their quirks have been ironed out.
2. You lose expressivity. This is a concept that we’ll talk about in later chapters, but when you make something
portable, you lose some expressivity with the host hardware. That means that the threads.h library is pretty
bare bones. It is hard to set CPU affinities. Schedule threads together. Efficiently look at the internals for
performance reasons.
3. A lot of legacy code is already written with POSIX threads in mind. Other libraries like OpenMP, CUDA, MPI
will either use POSIX processes or POSIX threads with a begrudging port to Windows.
// 8 KiB stacks
#define STACK_SIZE (8 * 1024 * 1024)
int main() {
// Allocate stack space for the child
char *child_stack = malloc(STACK_SIZE);
// Remember stacks work by growing down, so we need
// to give the top of the stack
char *stack_top = stack + STACK_SIZE;
return 0;
}
It seems pretty simple right? Why not use this functionality? First, there is a decent bit of boilerplate code.
In addition, pthreads are part of the POSIX standard and have defined functionality. Pthreads let a program set
various attributes – some that resemble the option in clone – to customize your thread. But as we mentioned
earlier, with each later of abstraction for portability reasons we lose some functionality. clone can do some neat
things like keeping different parts of your heap the same while creating copies of other pages. A program has
finer control of scheduling because it is a process with the same mappings.
At no time in this course should you be using clone. But in the future, know that it is a perfectly viable
alternative to fork. You have to be careful and research edge cases.
Further Reading
Guiding questions
• What are a few things that threads share in a process? What are a few things that threads have different?
• What are some examples of non thread safe library functions? Why might they not be thread safe?
• man page
• Concise third party sample code explaining create, join and exit
Topics
• pthread life-cycle
• Using pthread_join
• Using pthread_create
• Using pthread_exit
Questions
• How does a program get a return value given a pthread_t? What are the ways a thread can set that return
value? What happens if a program discards the return value?
• What does pthread_exit do if it is not the last thread? What other functions are called when after calling
pthread_exit?
• Give me three conditions under which a multi-threaded process will exit. Are there any more?
Bibliography
[1] Part Guide. Intel R 64 and ia-32 architectures software developers manual. Volume 3B: System programming
[2] A. Silberschatz, P.B. Galvin, and G. Gagne. Operating System Concepts. Wiley, 2005. ISBN 9780471694663.
URL https://books.google.com/books?id=FH8fAQAAIAAJ.
Synchronization
7
When multithreading gets interesting
Bhuvy
Synchronization coordinates various tasks so that they all finishin the the correct state. In C, we have series of
mechanisms to control what threads are allowed to perform at a given state. Most of the time, the threads can
progress without having to communicate, but every so often two or more threads may want to access a critical
section. A critical section is a section of code that can only be executed by one thread at a time if the program is
to function correctly. If two threads (or processes) were to execute code inside the critical section at the same
time, it is possible that the program may no longer have the correct behavior.
As we said in the previous chapter, race conditions happen when an operation touches a piece of memory
at the same time as another thread. If the memory location is only accessible by one thread, for example the
automatic variable i below, then there is no possibility of a race condition and no Critical Section associated with
i. However, the sum variable is a global variable and accessed by two threads. It is possible that two threads may
attempt to increment the variable at the same time.
#include <stdio.h>
#include <pthread.h>
int main() {
pthread_t tid1, tid2;
pthread_create(&tid1, NULL, countgold, NULL);
pthread_create(&tid2, NULL, countgold, NULL);
141
//Wait for both threads to finish:
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
A typical output of the above code is ARGGGH sum is <some number less than expected> because
there is a race condition. The code allows two threads to read and write sum at the same time. For example, both
threads copy the current value of sum into CPU that runs each thread (let’s pick 123). Both threads increment
one to their own copy. Both threads write back the value (124). If the threads had accessed the sum at different
times then the count would have been 125. A few of the possible different orderings are below.
Permissible Pattern
Partial Overlap
Full Overlap
We would like the first pattern of the code being mutually exclusive. Which leads us to our first synchronization
primitive, a Mutex.
Mutex
To ensure that only one thread at a time can access a global variable, use a mutex – short for Mutual Exclusion.
If one thread is currently inside a critical section we would like another thread to wait until the first thread is
complete. A mutex isn’t a primitive in the truest sense, though it is one of the smallest that has useful threading
API. A mutex also isn’t a data structure. It is an abstract data type. There are many ways to implement a mutex,
and we’ll give a few in this chapter. For right now let’s use the black box that the pthread library gives us. Here is
how we declare a mutex.
Mutex Lifetime
There are a few ways of initializing a mutex. A program can use the macro PTHREAD_MUTEX_INITIALIZER
only for global (‘static’) variables. m = PTHREAD_MUTEX_INITIALIZER is functionally equivalent to the more
general purpose pthread_mutex_init(m,NULL). The init version includes options to trade performance for
additional error-checking and advanced sharing options. The init version also makes sure that the mutex is
correctly initialized after the call, global mutexes are initialized on the first lock. A program can also call the init
function inside of a program for a mutex located on the heap.
Once we are finished with the mutex we should also call pthread_mutex_destroy(m) too. Note, a program
can only destroy an unlocked mutex, destroy on a locked mutex is undefined behavior. Things to keep in mind
about init and destroy A program doesn’t need to destroy a mutex created with the global initializer.
3. Keep to the pattern of one and only one thread initializing a mutex.
4. Copying the bytes of the mutex to a new memory location and then using the copy is not supported. To
reference a mutex, a program must to have a pointer to that memory address.
Mutex Usages
How does one use a mutex? Here is a complete example in the spirit of the earlier piece of code.
#include <stdio.h>
#include <pthread.h>
int sum = 0;
pthread_mutex_lock(&m);
// Other threads that call lock will have to wait until we call
unlock
int main() {
pthread_t tid1, tid2;
pthread_create(&tid1, NULL, countgold, NULL);
pthread_create(&tid2, NULL, countgold, NULL);
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
In the code above, the thread gets the lock to the counting house before entering. The critical section is only
the sum+=1 so the following version is also correct.
for (i = 0; i < 10000000; i++) {
pthread_mutex_lock(&m);
sum += 1;
pthread_mutex_unlock(&m);
}
return NULL;
}
This process runs slower because we lock and unlock the mutex a million times, which is expensive - at least
compared with incrementing a variable. In this simple example, we didn’t need threads - we could have added up
twice! A faster multi-thread example would be to add one million using an automatic (local) variable and only
then adding it to a shared total after the calculation loop has finished:
int local = 0;
for (i = 0; i < 10000000; i++) {
local += 1;
}
pthread_mutex_lock(&m);
sum += local;
pthread_mutex_unlock(&m);
return NULL;
}
If you know the Gaussian sum, you can avoid race conditions altogether, but this is for illustration.
Starting with the gotchas. Firstly, C Mutexes do not lock variables. A mutex is a simple data structure. It
works with code, not data. If a mutex is locked, the other threads will continue. It’s only when a thread attempts
to lock a mutex that is already locked, will the thread have to wait. As soon as the original thread unlocks the
mutex, the second (waiting) thread will acquire the lock and be able to continue. The following code creates a
mutex that does effectively nothing.
int a;
pthread_mutex_t m1 = PTHREAD_MUTEX_INITIALIZER,
m2 = = PTHREAD_MUTEX_INITIALIZER;
// later
// Thread 1
pthread_mutex_lock(&m1);
a++;
pthread_mutex_unlock(&m1);
// Thread 2
pthread_mutex_lock(&m2);
a++;
pthread_mutex_unlock(&m2);
1. Don’t cross the streams! If using threads, don’t fork in the middle of your program. This means any time
after your mutexes have been initialized.
2. The thread that locks a mutex is the only thread that can unlock it.
3. Each program can have multiple mutex locks. A thread safe design might include a lock with each data
structure, one lock per heap, or one lock per set of data structures If a program has only one lock, then
there may be significant contention for the lock. If two threads were updating two different counters, it isn’t
necessary to use the same lock.
5. There will always be a small amount of overhead of calling pthread_mutex_lock and pthread_mutex_unlock.
However, this is the price to pay for correctly functioning programs!
8. Using an uninitialized mutex or using a mutex that has already been destroyed
10. Deadlock
Mutex Implementation
So we have this cool data structure. How do we implement it? A naive, incorrect implementation is shown below.
The unlock function simply unlocks the mutex and returns. The lock function first checks to see if the lock is
already locked. If it is currently locked, it will keep checking again until another thread has unlocked the mutex.
For the time being, we’ll avoid the condition that other threads are able to unlock a lock they don’t own and focus
on the mutual exclusion aspect.
// Version 1 (Incorrect!)
m->locked = 1;
}
#define UNLOCKED 0
#define LOCKED 1
#define UNASSIGNED_OWNER 0
This is the initialization code, nothing fancy here. We set the state of the mutex to unlocked and set the owner
to locked.
What does this code do? It initializes a variable that we will keep as the unlocked state. Atomic Compare and
Exchange is an instruction supported by most modern architectures (on x86 it’s lock cmpxchg). The pseudocode
for this operation looks like this
Except it is all done atomically meaning in one uninterruptible operation. What does the weak part mean?
Atomic instructions are prone to spurious failures meaning that there are two versions to these atomic functions
a strong and a weak part, strong guarantees the success or failure while weak may fail even when the operation
succeeds. These are the same spurious failures that you’ll see in condition variables below. We are using weak
because weak is faster, and we are in a loop! That means we are okay if it fails a little bit more often because we
will keep spinning around anyway.
Inside the while loop, we have failed to grab the lock! We reset zero to unlocked and sleep for a little while.
When we wake up we try to grab the lock again. Once we successfully swap, we are in the critical section! We set
the mutex’s owner to the current thread for the unlock method and return successfully.
How does this guarantee mutual exclusion? When working with atomics we are unsure! But in this simple
example, we can because the thread that can successfully expect the lock to be UNLOCKED (0) and swap it to a
LOCKED (1) state is considered the winner. How do we implement unlock?
To satisfy the API, a thread can’t unlock the mutex unless the thread is the one who owns it. Then we unassign
the mutex owner, because critical section is over after the atomic. We want a strong exchange because we don’t
want to block. We expect the mutex to be locked, and we swap it to unlock. If the swap was successful, we
unlocked the mutex. If the swap wasn’t, that means that the mutex was UNLOCKED and we tried to switch it
from UNLOCKED to UNLOCKED, preserving the behavior of unlock.
What is this memory order business? We were talking about memory fences earlier, here it is! We won’t go
into detail because it is outside the scope of this course but in the scope of this article. We need consistency to
make sure no loads or stores are ordered before or after. A program need to create dependency chains for more
efficient ordering.
Semaphore
A semaphore is another synchronization primitive. It is initialized to some value. Threads can either sem_wait
or sem_post which lowers or increases the value. If the value reaches zero and a wait is called, the thread will
be blocked until a post is called.
Using a semaphore is as easy as using a mutex. First, decide if on the initial value, for example the number
of remaining spaces in an array. Unlike pthread mutex there are no shortcuts to creating a semaphore - use
sem_init.
#include <semaphore.h>
sem_t s;
int main() {
sem_init(&s, 0, 10); // returns -1 (=FAILED) on OS X
sem_wait(&s); // Could do this 10 times without blocking
sem_post(&s); // Announce that we’ve finished (and one more
resource item is available; increment count)
sem_destroy(&s); // release resources of the semaphore
}
When using a semaphore, wait and post can be called from different threads! Unlike a mutex, the increment
and decrement can be from different threads.
This becomes especially useful if you want to use a semaphore to implement a mutex. A mutex is a semaphore
that always waits before it posts. Some textbooks will refer to a mutex as a binary semaphore. You do have
to be careful to never add more than one to a semaphore or otherwise your mutex abstraction breaks. That is
usually why a mutex is used to implement a semaphore and vice versa.
sem_t s;
sem_init(&s, 0, 1);
sem_wait(&s);
// Critical Section
sem_post(&s);
But be warned, it isn’t the same! A mutex can handle what we call lock inversion well. Meaning the following
code breaks with a traditional mutex, but produces a race condition with threads.
// Thread 1
sem_wait(&s);
// Critical Section
sem_post(&s);
// Thread 2
// Some threads want to see the world burn
sem_post(&s);
// Thread 3
sem_wait(&s);
// Not thread-safe!
sem_post(&s);
// Thread 1
mutex_lock(&s);
// Critical Section
mutex_unlock(&s);
// Thread 2
// Foiled!
mutex_unlock(&s);
// Thread 3
mutex_lock(&s);
// Now it’s thread-safe
mutex_unlock(&s);
Also, binary semaphores are different than mutexes because one thread can unlock a mutex from a different
thread.
Signal Safety
Also, sem_post is one of a handful of functions that can be correctly used inside a signal handler pthread_mutex_unlock
is not. We can release a waiting thread that can now make all of the calls that we disallowed to call inside the
signal handler itself e.g. printf. Here is some code that utilizes this;
#include <stdio.h>
#include <pthread.h>
#include <signal.h>
#include <semaphore.h>
#include <unistd.h>
sem_t s;
int main() {
int ok = sem_init(&s, 0, 0 /* Initial value of zero*/);
if (ok == -1) {
perror("Could not create unnamed semaphore");
return 1;
}
signal(SIGINT, handler); // Too simple! See Signals chapter
pthread_t tid;
pthread_create(&tid, NULL, singsong, NULL);
pthread_exit(NULL); /* Process will exit when there are no more
threads */
}
Other uses for semaphores are keeping track of empty spaces in arrays. We will discuss these in the thread-safe
data structures section.
Condition Variables
Condition variables allow a set of threads to sleep until woken up. The API allows either one or all threads to
be woken up. If a program only wakes one thread, the operating system will decide which thread to wake up.
Threads don’t wake threads other directly like by id. Instead, a thread ‘signal’s the condition variable, which then
will wake up one (or all) threads that are sleeping inside the condition variable.
Condition variables are also used with a mutex and with a loop, so when woken up they have to check a condition
in a critical section. If a thread needs to be woken up outside of a critical section, there are other ways to do
this in POSIX. Threads sleeping inside a condition variable are woken up by calling pthread_cond_broadcast
(wake up all) or pthread_cond_signal (wake up one). Note despite the function name, this has nothing to do
with POSIX signals!
Occasionally, a waiting thread may appear to wake up for no reason. This is called a spurious wakeup. If you
read the hardware implementation of a mutex section, this is similar to the atomic failure of the same name.
Why do spurious wakeups happen? For performance. On multi-CPU systems, it is possible that a race condition
could cause a wake-up (signal) request to be unnoticed. The kernel may not detect this lost wake-up call but can
detect when it might occur. To avoid the potentially lost signal, the thread is woken up so that the program code
can test the condition again.
// Thread 1
while (answer < 42) pthread_cond_wait(cv);
// Thread 2
answer = 42
pthread_cond_signal(cv);
Table 7.4: Signaling without Mutex
Thread 1 Thread 2
while(answer < 42)
answer++
pthread_cond_signal(cv)
pthread_cond_wait(cv)
The problem here is that a programmer expects the signal to wake up the waiting thread. Since instructions are
allowed to be interleaved without a mutex, this causes an interleaving that is confusing to application designers.
Note that technically the API of the condition variable is satisfied. The wait call happens-after the call to signal,
and signal is only required to release at most a single thread whose call to wait happened-before.
Another problem is the need to satisfy real-time scheduling concerns which we only outline here. In a time-
critical application, the waiting thread with the highest priority should be allowed to continue first. To satisfy this re-
quirement the mutex must also be locked before calling pthread_cond_signal or pthread_cond_broadcast.
For the curious, here is a longer, historical discussion.
Condition variables are always used with a mutex lock. Before calling wait, the mutex lock must be locked
and wait must be wrapped with a loop.
pthread_cond_t cv;
pthread_mutex_t m;
int count;
// Initialize
pthread_cond_init(&cv, NULL);
pthread_mutex_init(&m, NULL);
count = 0;
// Thread 1
pthread_mutex_lock(&m);
while (count < 10) {
pthread_cond_wait(&cv, &m);
/* Remember that cond_wait unlocks the mutex before blocking
(waiting)! */
/* After unlocking, other threads can claim the mutex. */
/* When this thread is later woken it will */
/* re-lock the mutex before returning */
}
pthread_mutex_unlock(&m);
// Thread 2:
while (1) {
pthread_mutex_lock(&m);
count++;
pthread_cond_signal(&cv);
/* Even though the other thread is woken up it cannot not return
*/
/* from pthread_cond_wait until we have unlocked the mutex. This
is */
/* a good thing! In fact, it is usually the best practice to call
*/
/* cond_signal or cond_broadcast before unlocking the mutex */
pthread_mutex_unlock(&m);
}
This is a pretty naive example, but it shows that we can tell threads to wake up in a standardized manner. In
the next section, we will use these to implement efficient blocking data structures.
Naturally, we want our data structures to be thread-safe as well! We can use mutexes and synchronization
primitives to make that happen. First a few definitions. Atomicity is when an operation is thread-safe. We have
atomic instructions in hardware by providing the lock prefix
lock ...
But Atomicity also applies to higher orders of operations. We say a data structure operation is atomic if it happens
all at once and successfully or not at all.
As such, we can use synchronization primitives to make our data structures thread-safe. For the most part,
we will be using mutexes because they carry more semantic meaning than a binary semaphore. Note, this is an
introduction. Writing high-performance thread-safe data structures requires its own book! Take for example the
following thread-unsafe stack.
void push(double v) {
values[count++] = v;
}
double pop() {
return values[--count];
}
int is_empty() {
return count == 0;
}
Version 1 of the stack is thread-unsafe because if two threads call push or pop at the same time then the results
or the stack can be inconsistent. For example, imagine if two threads call pop at the same time then both threads
may read the same value, both may read the original count value.
To turn this into a thread-safe data structure we need to identify the critical sections of our code, meaning we
need to ask which section(s) of the code must only have one thread at a time. In the above example the push,
pop, and is_empty functions access the same memory and all critical sections for the stack. While push (and
pop) is executing, the data structure is an inconsistent state, for example the count may not have been written to,
so it may still contain the original value. By wrapping these methods with a mutex we can ensure that only one
thread at a time can update (or read) the stack. A candidate ‘solution’ is shown below. Is it correct? If not, how
will it fail?
pthread_mutex_t m1 = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t m2 = PTHREAD_MUTEX_INITIALIZER;
void push(double v) {
pthread_mutex_lock(&m1);
values[count++] = v;
pthread_mutex_unlock(&m1);
}
double pop() {
pthread_mutex_lock(&m2);
double v = values[--count];
pthread_mutex_unlock(&m2);
return v;
}
int is_empty() {
pthread_mutex_lock(&m1);
return count == 0;
pthread_mutex_unlock(&m1);
}
Version 2 contains at least one error. Take a moment to see if you can the error(s) and work out the
consequence(s).
If three threads called push() at the same time, the lock m1 ensures that only one thread at time manipulates
the stack on push or is_empty – Two threads will need to wait until the first thread completes A similar argument
applies to concurrent calls to pop. However, Version 2 does not prevent push and pop from running at the same
time because push and pop use two different mutex locks. The fix is simple in this case - use the same mutex
lock for both the push and pop functions.
The code has a second error. is_empty returns after the comparison and leaves the mutex unlocked. However,
the error would not be spotted immediately. For example, suppose one thread calls is_empty and a second
thread later calls push. This thread would mysteriously stop. Using debugger, you can discover that the thread is
stuck at the lock() method inside the push method because the lock was never unlocked by the earlier is_empty
call. Thus an oversight in one thread led to problems much later in time in an arbitrary other thread. Let’s try to
rectify these problems
void push(double v) {
pthread_mutex_lock(&m);
values[count++] = v;
pthread_mutex_unlock(&m);
}
double pop() {
pthread_mutex_lock(&m);
double v = values[--count];
pthread_mutex_unlock(&m);
return v;
}
int is_empty() {
pthread_mutex_lock(&m);
int result = count == 0;
pthread_mutex_unlock(&m);
return result;
}
Version 3 is thread-safe. We have ensured mutual exclusion for all of the critical sections. There are a few
things to note.
• is_empty is thread-safe but its result may already be out-of-date. The stack may no longer be empty by
the time the thread gets the result! This is usually why in thread-safe data structures, functions that return
sizes are removed or deprecated.
• There is no protection against underflow (popping on an empty stack) or overflow (pushing onto an
already-full stack)
The last point can be fixed using counting semaphores. The implementation assumes a single stack. A more
general-purpose version might include the mutex as part of the memory structure and use pthread_mutex_init
to initialize the mutex. For example,
int main() {
stack_t *s1 = stack_create(10 /* Max capacity*/);
stack_t *s2 = stack_create(10);
push(s1, 3.141);
push(s2, pop(s1));
stack_destroy(s2);
stack_destroy(s1);
}
Before we fix the problems with semaphores. How would we fix the problems with condition variables? Try it
out before you look at the code in the previous section. We need to wait in push and pop if our stack is full or
empty respectively. Attempted solution:
Does the following solution work? Take a second before looking at the answer to spot the errors.
So did you catch all of them?
1. The first one is a simple one. In push, our check should be against the total capacity, not zero.
3. We never signal any of the threads! Threads could get stuck waiting indefinitely.
Let’s fix those errors Does this solution work?
This solution doesn’t work either! The problem is with the signal. Can you see why? What would you do to fix
it?
Now, how would we use counting semaphores to prevent over and underflow? Let’s discuss it in the next
section.
Using Semaphores
Let’s use a counting semaphore to keep track of how many spaces remain and another semaphore to track the
number of items in the stack. We will call these two semaphores sremain and sitems. Remember sem_wait
will wait if the semaphore’s count has been decremented to zero (by another thread calling sem_post).
// Sketch #1
sem_t sitems;
sem_t sremain;
void stack_init(){
sem_init(&sitems, 0, 0);
sem_init(&sremain, 0, 10);
}
double pop() {
// Wait until there’s at least one item
sem_wait(&sitems);
...
void push(double v) {
// Wait until there’s at least one space
sem_wait(&sremain);
...
}
Sketch #2 has implemented the post too early. Another thread waiting in push can erroneously attempt to
write into a full stack. Similarly, a thread waiting in the pop() is allowed to continue too early.
// Sketch #2 (Error!)
double pop() {
// Wait until there’s at least one item
sem_wait(&sitems);
sem_post(&sremain); // error! wakes up pushing() thread too early
return values[--count];
}
void push(double v) {
// Wait until there’s at least one space
sem_wait(&sremain);
sem_post(&sitems); // error! wakes up a popping() thread too early
values[count++] = v;
}
Sketch 3 implements the correct semaphore logic, but can you spot the error?
// Sketch #3 (Error!)
double pop() {
// Wait until there’s at least one item
sem_wait(&sitems);
double v= values[--count];
sem_post(&sremain);
return v;
}
void push(double v) {
// Wait until there’s at least one space
sem_wait(&sremain);
values[count++] = v;
sem_post(&sitems);
}
Sketch 3 correctly enforces buffer full and buffer empty conditions using semaphores. However, there is
no mutual exclusion. Two threads can be in the critical section at the same time, which would corrupt the data
structure or least lead to data loss. The fix is to wrap a mutex around the critical section:
void init() {
sem_init(&sitems, 0, 0);
sem_init(&sremains, 0, SPACES); // 10 spaces
}
double pop() {
// Wait until there’s at least one item
sem_wait(&sitems);
void push(double v) {
// Wait until there’s at least one space
sem_wait(&sremain);
double pop() {
pthread_mutex_lock(&m);
sem_wait(&sitems);
double v= values[--count];
pthread_mutex_unlock(&m);
sem_post(&sremain);
return v;
}
void push(double v) {
sem_wait(&sremain);
pthread_mutex_lock(&m);
values[count++] = v;
pthread_mutex_unlock(&m);
sem_post(&sitems);
}
Rather than giving you the answer, we’ll let you think about this. Is this a permissible way to lock and unlock?
Is there a series of operations that could cause a race condition? How about deadlock? If there is, provide it. If
there isn’t, provide a short justification proof of why that won’t happen.
As already discussed, there are critical parts of our code that can only be executed by one thread at a time. We
describe this requirement as ‘mutual exclusion’. Only one thread (or process) may have access to the shared
resource. In multi-threaded programs, we can wrap a critical section with mutex lock and unlock calls:
How would we implement these lock and unlock calls? Can we create a pure software algorithm that assures
mutual exclusion? Here is our attempt from earlier.
pthread_mutex_lock(p_mutex_t *m) {
while(m->lock) ;
m->lock = 1;
}
pthread_mutex_unlock(p_mutex_t *m) {
m->lock = 0;
}
As we touched on earlier, this implementation does not satisfy Mutual Exclusion even considering that threads
can unlock other threads locks. Let’s take a close look at this ‘implementation’ from two threads running around
the same time.
To simplify the discussion, we consider only two threads. Note these arguments work for threads and processes
and the classic CS literature discusses these problems in terms of two processes that need exclusive access to
a critical section or shared resource. Raising a flag represents a thread/process’s intention to enter the critical
section.
There are three main desirable properties that we desire in a solution to the critical section problem.
1. Mutual Exclusion. The thread/process gets exclusive access. Others must wait until it exits the critical
section.
2. Bounded Wait. A thread/process cannot get superseded by another thread infinite amounts of time.
3. Progress. If no thread/process is inside the critical section, the thread/process should be able to proceed
without having to wait.
With these ideas in mind, let’s examine another candidate solution that uses a turn-based flag only if two
threads both required access at the same time.
Naive Solutions
Remember that the pseudo-code outlined below is part of a larger program. The thread or process will typically
need to enter the critical section many times during the lifetime of the process. So, imagine each example as
wrapped inside a loop where for a random amount of time the thread or process is working on something else.
Is there anything wrong with the candidate solution described below?
// Candidate #1
wait until your flag is lowered
raise my flag
// Do Critical Section stuff
lower my flag
Answer: Candidate solution #1 also suffers from a race condition because both threads/processes could read
each other’s flag value as lowered and continue.
This suggests we should raise the flag before checking the other thread’s flag, which is candidate solution #2
below.
// Candidate #2
raise my flag
wait until your flag is lowered
// Do Critical Section stuff
lower my flag
Candidate #2 satisfies mutual exclusion. It is impossible for two threads to be inside the critical section at the
same time. However, this code suffers from deadlock! Suppose two threads wish to enter the critical section at
the same time.
Both processes are now waiting for the other one to lower their flags. Neither one will enter the critical section
as both are now stuck forever! This suggests we should use a turn-based variable to try to resolve who should
proceed.
Turn-based solutions
The following candidate solution #3 uses a turn-based variable to politely allow one thread and then the other to
continue
// Candidate #3
wait until my turn is myid
// Do Critical Section stuff
turn = yourid
Candidate #3 satisfies mutual exclusion. Each thread or process gets exclusive access to the Critical Section.
However, both threads/processes must take a strict turn-based approach to use the critical section. They are forced
into an alternating critical section access pattern. If thread 1 wishes to read a hash table every millisecond, but
another thread writes to a hash table every second, then the reading thread would have to wait another 999ms
before being able to read from the hash table again. This ‘solution’ is ineffective because our threads should be
able to make progress and enter the critical section if no other thread is currently in the critical section.
Analyzing these solutions is tricky. Even peer-reviewed papers on this specific subject contain incorrect
solutions [? ]! At first glance, it appears to satisfy Mutual Exclusion, Bounded Wait and Progress The turn-based
flag is only used in the event of a tie, so Progress and Bounded Wait is allowed and mutual exclusion appears to
be satisfied. Perhaps you can find a counter-example?
Candidate #4 fails because a thread does not wait until the other thread lowers its flag. After some thought or
inspiration, the following scenario can be created to demonstrate how Mutual Exclusion is not satisfied.
Imagine the first thread runs this code twice. The turn flag now points to the second thread. While the first
thread is still inside the Critical Section, the second thread arrives. The second thread can immediately continue
into the Critical Section!
Working Solutions
The first solution to the problem was Dekker’s Solution. Dekker’s Algorithm (1962) was the first provably correct
solution. Though, it was in an unpublished paper, so it was not discovered until later [1] (this is an English
transcribed version released in 1965). A version of the algorithm is below.
raise my flag
while (your flag is raised) :
if it is your turn to win :
lower my flag
wait while your turn
raise my flag
// Do Critical Section stuff
set your turn to win
lower my flag
Notice how the process’s flag is always raised during the critical section no matter if the loop is iterated zero,
once or more times. Further, the flag can be interpreted as an immediate intent to enter the critical section. Only
if the other process has also raised the flag will one process defer, lower their intent flag and wait. Let’s check the
conditions.
1. Mutual Exclusion. Let’s try to sketch a simple proof. The loop invariant is that at the start of checking the
condition, your flag has to be raised – this is by exhaustion. Since the only way that a thread can leave the
loop is by having the condition be false, the flag must be raised for the entirety of the critical section. Since
the loop prevents a thread from exiting while the other thread’s flag is raised and a thread has its flag raised
in the critical section, the other thread can’t enter the critical section at the same time.
2. Bounded Wait. Assuming that the critical section ends in finite time, a thread once it has left the critical
section cannot then get the critical section back. The reason being is the turn variable is set to the other
thread, meaning that that thread now has priority. That means a thread cannot be superseded infinitely by
another thread.
3. Progress. If the other thread isn’t in the critical section, it will simply continue with a simple check. We didn’t
make any statement about if threads are randomly stopped by the system scheduler. This is an idealized
scenario where threads will keep executing instructions.
Peterson’s Solution
Peterson published his novel and surprisingly simple solution in 1981 [2]. A version of his algorithm is shown
below that uses a shared variable turn.
// Candidate #5
raise my flag
turn = other_thread_id
while (your flag is up and turn is other_thread_id)
loop
// Do Critical Section stuff
lower my flag
This solution satisfies Mutual Exclusion, Bounded Wait and Progress. If thread #2 has set turn to 2 and is
currently inside the critical section. Thread #1 arrives, sets the turn back to 1 and now waits until thread 2 lowers
the flag.
1. Mutual Exclusion. Let’s try to sketch a simple proof again. A thread doesn’t get into the critical section until
the turn variable is yours or the other thread’s flag isn’t up. If the other thread’s flag isn’t up, it isn’t trying
to enter the critical section. That is the first action the thread does and the last action the thread undoes. If
the turn variable is set to this thread, that means that the other thread has given the control to this thread.
Since my flag is raised and the turn variable is set, the other thread has to wait in the loop until the current
thread is done.
2. Bounded Wait. After one thread lowers, a thread waiting in the while loop will leave because the first
condition is broken. This means that threads cannot win all the time.
3. Progress. If no other thread is contesting, other thread’s flags are not up. That means that a thread can go
past the while loop and do critical section items.
An efficient compiler would infer that flag2 variable is never changed inside the loop, so that test can be
optimized to while(true) Using volatile goes some way to prevent compiler optimizations of this kind.
Let’s say that we solved this by telling the compiler not to optimize. Independent instructions can be re-ordered
by an optimizing compiler or at runtime by an out-of-order execution optimization by the CPU.
A related challenge is that CPU cores include a data cache to store recently read or modified main memory
values. Modified values may not be written back to main memory or re-read from memory immediately. Thus
data changes, such as the state of a flag and turn variable in the above example, may not be shared between two
CPU codes.
But there is a happy ending. Modern hardware addresses these issues using ‘memory fences’ also known as a
memory barrier. This prevents instructions from getting ordered before or after the barrier. There is a performance
loss, but it is needed for correct programs!
Also, there are CPU instructions to ensure that main memory and the CPU’s cache is in a reasonable and
coherent state. Higher-level synchronization primitives, such as pthread_mutex_lock are will call these CPU
instructions as part of their implementation. Thus, in practice, surrounding critical sections with a mutex lock and
unlock calls is sufficient to ignore these lower-level problems.
For further reading, we suggest the following web post that discusses implementing Peterson’s algorithm on
an x86 process and the Linux documentation on memory barriers.
1. Memory Fences
2. Memory Barriers
Now that we have a solution to the critical section problem, We can reasonably implement a mutex. How would
we implement other synchronization primitives? Let’s start with a semaphore. To implement a semaphore with
efficient CPU usage, we will say that we have implemented a condition variable. Implementing an O(1) space
condition variable using only a mutex is not trivial, or at least an O(1) heap condition variable is not trivial. We
don’t want to call malloc while implementing a primitive, or we may deadlock!
• We can implement a counting semaphore using condition variables.
s->count = value;
pthread_mutex_init(&s->m, NULL);
pthread_cond_init(&s->cv, NULL);
return 0;
}
Our implementation of sem_post needs to increment the count. We will also wake up any threads sleeping
inside the condition variable. Notice we lock and unlock the mutex so only one thread can be inside the critical
section at a time.
pthread_mutex_unlock(&s->m);
}
Our implementation of sem_wait may need to sleep if the semaphore’s count is zero. Just like sem_post,
we wrap the critical section using the lock, so only one thread can be executing our code at a time. Notice if
the thread does need to wait then the mutex will be unlocked, allowing another thread to enter sem_post and
awaken us from our sleep!
Also notice that even if a thread is woken up before it returns from pthread_cond_wait, it must re-acquire
the lock, so it will have to wait until sem_post finishes.
That is a complete implementation of a counting semaphore Notice that we are calling sem_post every single
time. In practice, this means sem_post would unnecessary call pthread_cond_signal even if there are no
waiting threads. A more efficient implementation would only call pthread_cond_signal when necessary i.e.
• An advanced use of sem_init allows semaphores to be shared across processes. Our implementation only
works for threads inside the same process. We could fix this by setting the condition variable and mutex
attributes.
This is all the boring definitional stuff. The interesting stuff is below.
pthread_mutex_lock(m);
remove_from_list(cv, &my_node);
So how does this work? Instead of allocating space which could lead to deadlock. We keep the data structures
or the linked list nodes on each thread’s stack. The linked list in the wait function is created While the thread
has the mutex lock this is important because we may have a race condition on the insert and removal. A more
robust implementation would have a mutex per condition variable.
What is the note about (dynamic)? In the pthread man pages, wait creates a runtime binding to a mutex. This
means that after the first call is called, a mutex is associated with a condition variable while there is still a thread
waiting on that condition variable. Each new thread coming in must have the same mutex, and it must be locked.
Hence, the beginning and end of wait (everything besides the while loop) are mutually exclusive. After the last
thread leaves, meaning when head is NULL, then the binding is lost.
The signal and broadcast functions merely tell either one thread or all threads respectively that they should be
woken up. It doesn’t modify the linked lists because there is no mutex to prevent corruption if two threads
call signal or broadcast
Now an advanced point. Do you see how a broadcast could cause a spurious wakeup in this case? Consider
this series of events.
4. Another thread calls wait on the condition variable and adds itself to the queue.
There is no assurance as to when the broadcast was called and when threads were added in a high-performance
mutex. The ways to prevent this behavior are to include Lamport timestamps or require that broadcast be called
with the mutex in question. That way something that happens-before the broadcast call doesn’t get signaled after.
The same argument is put forward for signal too.
Did you also notice something else? This is why we ask you to signal or broadcast before you unlock. If
you broadcast after you unlock, the time that broadcast takes could be infinite!
2. First thread is freed, broadcast thread is frozen. Since the mutex is unlocked, it locks and continues.
4. With our implementation of a condition variable, this would be terminated. If you had an implementation
that appended to the tail of the list and iterated form the head to the tail, this could go on infinitely many
times.
In high-performance systems, we want to make sure that each thread that calls wait isn’t passed by another
thread that calls wait. With the current API that we have, we can’t assure that. We’d have to ask users to pass in a
mutex or use a global mutex. Instead, we tell programmers to always signal or broadcast before unlocking.
Barriers
Suppose we wanted to perform a multi-threaded calculation that has two stages, but we don’t want to advance
to the second stage until the first stage is completed. We could use a synchronization method called a barrier.
When a thread reaches a barrier, it will wait at the barrier until all the threads reach the barrier, and then they’ll
all proceed together.
Think of it like being out for a hike with some friends. You make a mental note of how many friends you have
and agree to wait for each other at the top of each hill. Say you’re the first one to reach the top of the first hill.
You’ll wait there at the top for your friends. One by one, they’ll arrive at the top, but nobody will continue until
the last person in your group arrives. Once they do, you’ll all proceed.
Pthreads has a function pthread_barrier_wait() that implements this. You’ll need to declare a pthread_barrier_t
variable and initialize it with pthread_barrier_init(). pthread_barrier_init() takes the number of
threads that will be participating in the barrier as an argument. Here is a sample program using barriers.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <time.h>
#define THREAD_COUNT 4
pthread_barrier_t mybarrier;
pthread_barrier_wait(&mybarrier);
int main() {
int i;
pthread_t ids[THREAD_COUNT];
int short_ids[THREAD_COUNT];
srand(time(NULL));
pthread_barrier_init(&mybarrier, NULL, THREAD_COUNT + 1);
for (i=0; i < THREAD_COUNT; i++) {
short_ids[i] = i;
pthread_create(&ids[i], NULL, threadFn, &short_ids[i]);
}
printf("main() is ready.\n");
pthread_barrier_wait(&mybarrier);
printf("main() is going!\n");
pthread_barrier_destroy(&mybarrier);
return 0;
}
Now let’s implement our own barrier and use it to keep all the threads in sync in a large calculation. Here is
our thought process,
2. Barrier! Wait for all threads to finish first calculation before continuing
// double data[256][8192]
Our main thread will create the 16 threads, and we will divide each calculation into 16 separate pieces. Each
thread will be given a unique value (0,1,2,..15), so it can work on its own block. Since a (void*) type can hold
small integers, we will pass the value of i by casting it to a void pointer.
#define N (16)
double data[256][8192] ;
int main() {
pthread_t ids[N];
for(int i = 0; i < N; i++) {
pthread_create(&ids[i], NULL, calc, (void *) i);
}
//...
}
Note, we will never dereference this pointer value as an actual memory location.
We will cast it straight back to an integer.
After calculation 1 completes, we need to wait for the slower threads unless we are the last thread! So, keep
track of the number of threads that have arrived at our barrier ‘checkpoint’.
// Global:
int remain = N;
However, the code has a few flaws. One is two threads might try to decrement remain. The other is the loop
is a busy loop. We can do better! Let’s use a condition variable and then we will use a broadcast/signal functions
to wake up the sleeping threads.
A reminder, a condition variable is similar to a house! Threads go there to sleep (pthread_cond_wait). A
threa can choose to wake up one thread (pthread_cond_signal) or all of them (pthread_cond_broadcast).
If there are no threads currently waiting then these two calls have no effect.
A condition variable version is usually similar to a busy loop incorrect solution - as we will show next. First,
let’s add a mutex and condition global variables and don’t forget to initialize them in main.
//global variables
pthread_mutex_t m;
pthread_cond_t cv;
int main() {
pthread_mutex_init(&m, NULL);
pthread_cond_init(&cv, NULL);
We will use the mutex to ensure that only one thread modifies remain at a time. The last arriving thread needs
to wake up all sleeping threads - so we will use pthread_cond_broadcast(cv) not pthread_cond_signal
pthread_mutex_lock(&m);
remain--;
if (remain == 0) {
pthread_cond_broadcast(&cv);
}
else {
while(remain != 0) {
pthread_cond_wait(&cv, &m);
}
}
pthread_mutex_unlock(&m);
When a thread enters pthread_cond_wait, it releases the mutex and sleeps. After, the thread will be woken
up. Once we bring a thread back from its sleep, before returning it must wait until it can lock the mutex. Notice
that even if a sleeping thread wakes up early, it will check the while loop condition and re-enter wait if necessary.
The above barrier is not reusable. Meaning that if we stick it into any old calculation loop there is a good
chance that the code will encounter a condition where the barrier either deadlocks or thread races ahead one
iteration faster. Why is that? Because of the ambitious thread.
We will assume that one thread is much faster than all the other threads. With the barrier API, this thread
should be waiting, but it may not be. To make it concrete, let’s look at this code
4. This single thread performs its calculation before any other threads even wake up
All the other threads who should’ve woken up never do and our implementation deadlocks. How would you
go about solving this? Hint: If multiple threads call barrier_wait in a loop then one can guarantee that they
are on the same iteration.
Attempt #1
void read() {
lock(&m)
// do read stuff
unlock(&m)
}
void write() {
lock(&m)
// do write stuff
unlock(&m)
}
At least our first attempt does not suffer from data corruption. Readers must wait while a writer is writing and
vice versa! However, readers must also wait for other readers. Let’s try another implementation.
Attempt #2:
void read() {
while(writing) {/*spin*/}
reading = 1
// do read stuff
reading = 0
}
void write() {
while(reading || writing) {/*spin*/}
writing = 1
// do write stuff
writing = 0
}
Our second attempt suffers from a race condition. Imagine if two threads both called read and write or both
called write at the same time. Both threads would be able to proceed! Secondly, we can have multiple readers
and multiple writers, so let’s keep track of the total number of readers or writers Which brings us to attempt #3.
Attempt #3
Remember that pthread_cond_wait performs Three actions. Firstly, it atomically unlocks the mutex and then
sleeps (until it is woken by pthread_cond_signal or pthread_cond_broadcast). Thirdly, the awoken
thread must re-acquire the mutex lock before returning. Thus only one thread can actually be running inside the
critical section defined by the lock and unlock() methods.
Implementation #3 below ensures that a reader will enter the cond_wait if any writers are writing.
read() {
lock(&m)
while (writing)
cond_wait(&cv, &m)
reading++;
/* Read here! */
reading--
cond_signal(&cv)
unlock(&m)
}
However, only one reader a time can read because candidate #3 did not unlock the mutex. A better version
unlocks before reading.
read() {
lock(&m);
while (writing)
cond_wait(&cv, &m)
reading++;
unlock(&m)
/* Read here! */
lock(&m)
reading--
cond_signal(&cv)
unlock(&m)
}
Does this mean that a writer and read could read and write at the same time? No! First of all, remember
cond_wait requires the thread re-acquire the mutex lock before returning. Thus only one thread can be executing
code inside the critical section (marked with **) at a time!
read() {
lock(&m);
** while (writing)
** cond_wait(&cv, &m)
** reading++;
unlock(&m)
/* Read here! */
lock(&m)
** reading--
** cond_signal(&cv)
unlock(&m)
}
Writers must wait for everyone. Mutual exclusion is assured by the lock.
write() {
lock(&m);
** while (reading || writing)
** cond_wait(&cv, &m);
** writing++;
**
** /* Write here! */
** writing--;
** cond_signal(&cv);
unlock(&m);
}
Candidate #3 above also uses pthread_cond_signal. This will only wake up one thread. If many readers
are waiting for the writer to complete, only one sleeping reader will be awoken from their slumber. The reader
and writer should use cond_broadcast so that all threads should wake up and check their while-loop condition.
Starving writers
Candidate #3 above suffers from starvation. If readers are constantly arriving then a writer will never be able to
proceed (the ‘reading’ count never reduces to zero). This is known as starvation and would be discovered under
heavy loads. Our fix is to implement a bounded-wait for the writer. If a writer arrives they will still need to wait
for existing readers however future readers must be placed in a “holding pen” and wait for the writer to finish.
The “holding pen” can be implemented using a variable and a condition variable so that we can wake up the
threads once the writer has finished.
The plan is that when a writer arrives, and before waiting for current readers to finish, register our intent to
write by incrementing a counter ‘writer’
write() {
lock()
writer++
And incoming readers will not be allowed to continue while writer is nonzero. Notice ‘writer’ indicates a writer
has arrived, while ‘reading’ and ‘writing’ counters indicate there is an active reader or writer.
read() {
lock()
// readers that arrive *after* the writer arrived will have to
wait here!
while(writer)
cond_wait(&cv,&m)
Attempt #4
Below is our first working solution to the Reader-Writer problem. Note if you continue to read about the “Reader
Writer problem” then you will discover that we solved the “Second Reader Writer problem” by giving writers
preferential access to the lock. This solution is not optimal. However, it satisfies our original problem of N active
readers, single active writer, and avoiding starvation of the writer if there is a constant stream of readers.
Can you identify any improvements? For example, how would you improve the code so that we only woke up
readers or one writer?
reader() {
lock(&m)
while (writers)
cond_wait(&turn, &m)
// No need to wait while(writing here) because we can only exit
the above loop
// when writing is zero
reading++
unlock(&m)
lock(&m)
reading--
cond_broadcast(&turn)
unlock(&m)
}
writer() {
lock(&m)
writers++
while (reading || writing)
cond_wait(&turn, &m)
writing++
unlock(&m)
// perform writing here
lock(&m)
writing--
writers--
cond_broadcast(&turn)
unlock(&m)
}
Ring Buffer
A ring buffer is a simple, usually fixed-sized, storage mechanism where contiguous memory is treated as if it is
circular, and two index counters keep track of the current beginning and end of the queue. As array indexing is
not circular, the index counters must wrap around to zero when moved past the end of the array. As data is added
(enqueued) to the front of the queue or removed (dequeued) from the tail of the queue, the current items in the
buffer form a train that appears to circle the track
A simple (single-threaded) implementation is shown below. Note, enqueue and dequeue do not guard against
underflow or overflow. It’s possible to add an item when the queue is full and possible to remove an item when
the queue is empty. If we added 20 integers (1, 2, 3, . . . , 20) to the queue and did not dequeue any items then
values, 17,18,19,20 would overwrite the 1,2,3,4. We won’t fix this problem right now, instead of when we
create the multi-threaded version we will ensure enqueue-ing and dequeue-ing threads are blocked while the ring
buffer is full or empty respectively.
bu er[out]
15 0
14 1 C
13 2
3 D
12
11 4 E
10 5
9 6 F
8 7
bu er[in]
void *buffer[16];
unsigned int in = 0, out = 0;
void enqueue(void *value) {/* Add one item to the front of the
queue*/
buffer[in] = value;
in++; /* Advance the index for next time */
if (in == 16) in = 0; /* Wrap around! */
}
void *dequeue() {/* Remove one item to the end of the queue.*/
void *result = buffer[out];
out++;
if (out == 16) out = 0;
return result;
}
This buffer does not yet prevent overwrites. For that, we’ll turn to our multi-threaded attempt that will block a
thread until there is space or there is at least one item to remove.
Multithreaded Correctness
The following code is an incorrect implementation. What will happen? Will enqueue and/or dequeue block? Is
mutual exclusion satisfied? Can the buffer underflow? Can the buffer overflow? For clarity, pthread_mutex is
shortened to p_m and we assume sem_wait cannot be interrupted.
#define N 16
void *b[N]
int in = 0, out = 0
p_m_t lock
sem_t s1,s2
void init() {
p_m_init(&lock, NULL)
sem_init(&s1, 0, 16)
sem_init(&s2, 0, 0)
}
enqueue(void *value) {
p_m_lock(&lock)
Analysis
Before reading on, see how many mistakes you can find. Then determine what would happen if threads called the
enqueue and dequeue methods.
• The enqueue method waits and posts on the same semaphore (s1) and similarly with enqueue and (s2)
i.e. we decrement the value and then immediately increment the value, so by the end of the function the
semaphore value is unchanged!
• The initial value of s1 is 16, so the semaphore will never be reduced to zero - enqueue will not block if the
ring buffer is full - so overflow is possible.
• The initial value of s2 is zero, so calls to dequeue will always block and never return!
• The order of mutex lock and sem_wait will need to be swapped; however, this example is so broken that
this bug has no effect!
Another Analysis
The following code is an incorrect implementation. What will happen? Will enqueue and/or dequeue block? Is
mutual exclusion satisfied? Can the buffer underflow? Can the buffer overflow? For clarity pthread_mutex is
shortened to p_m and we assume sem_wait cannot be interrupted.
void *b[16]
int in = 0, out = 0
p_m_t lock
sem_t s1, s2
void init() {
sem_init(&s1,0,16)
sem_init(&s2,0,0)
}
enqueue(void *value){
sem_wait(&s2)
p_m_lock(&lock)
p_m_unlock(&lock)
sem_post(&s1)
}
void *dequeue(){
sem_wait(&s1)
p_m_lock(&lock)
void *result = b[(out++) & (N-1)]
p_m_unlock(&lock)
sem_post(&s2)
return result;
}
• The initial value of s2 is 0. Thus enqueue will block on the first call to sem_wait even though the buffer is
empty!
• The initial value of s1 is 16. Thus dequeue will not block on the first call to sem_wait even though the buffer
is empty - Underflow! The dequeue method will return invalid data.
• The code does not satisfy Mutual Exclusion. Two threads can modify in or out at the same time! The code
appears to use mutex lock. Unfortunately, the lock was never initialized with pthread_mutex_init()
or PTHREAD_MUTEX_INITIALIZER - so the lock may not work (pthread_mutex_lock may simply do
nothing)
#include <pthread.h>
#include <semaphore.h>
// N must be 2^i
#define N (16)
void *b[N]
int in = 0, out = 0
p_m_t lock = PTHREAD_MUTEX_INITIALIZER
sem_t countsem, spacesem
void init() {
sem_init(&countsem, 0, 0)
sem_init(&spacesem, 0, 16)
}
1. The lock is only held during the critical section (access to the data structure).
2. A complete implementation would need to guard against early returns from sem_wait due to POSIX signals.
enqueue(void *value){
// wait if there is no space left:
sem_wait( &spacesem )
p_m_lock(&lock)
b[ (in++) & (N-1) ] = value
p_m_unlock(&lock)
The dequeue implementation is shown below. Notice the symmetry of the synchronization calls to enqueue.
In both cases, the functions first wait if the count of spaces or count of items is zero.
void *dequeue(){
// Wait if there are no items in the buffer
sem_wait(&countsem)
p_m_lock(&lock)
void *result = b[(out++) & (N-1)]
p_m_unlock(&lock)
return result
}
• What would happen if the order of pthread_mutex_unlock and sem_post calls were swapped?
• What would happen if the order of sem_wait and pthread_mutex_lock calls were swapped?
You thought that you were using different processes, so you don’t have to synchronize? Think again! You may not
have race conditions within a process but what if your process needs to interact with the system around it? Let’s
consider a motivating example
void write_string(const char *data) {
int fd = open("my_file.txt", O_WRONLY);
write(fd, data, strlen(data));
close(fd);
}
int main() {
if(!fork()) {
write_string("key1: value1");
wait(NULL);
} else {
write_string("key2: value2");
}
return 0;
}
If none of the system calls fail then we should get something that looks like this given the file was empty to
begin with.
key1: value1
key2: value2
key2: value2
key1: value1
Interruption
But, there is a hidden nuance. Most system calls can be interrupted meaning that the operating system can
stop an ongoing system call because it needs to stop the process. So barring fork wait open and close from
failing – they typically go to completion – what happens if write fails? If write fails and no bytes are written,
we can get something like key1: value1 or key2: value2. This is data loss which is incorrect but won’t
corrupt the file. What happens if write gets interrupted after a partial write? We get all sorts of madness. For
example,
int main() {
pthread_mutexattr_init(&attr);
pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);
pmutex = mmap (NULL, sizeof(pthread_mutex_t),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1, 0);
pthread_mutex_init(pmutex, &attrmutex);
if(!fork()) {
write_string("key1: value1");
wait(NULL);
pthread_mutex_destroy(pmutex);
pthread_mutexattr_destroy(&attrmutex);
munmap((void *)pmutex, sizeof(*pmutex));
} else {
write_string("key2: value2");
}
return 0;
}
What the code does in main is initialize a process shared mutex using a piece of shared memory. You will
find out what this call to mmap does later – assume for the time being that it creates memory that is shared
between processes. We can initialize a pthread_mutex_t in that special piece of memory and use it as normal.
To counter write failing, we have put the write call inside a while loop that keeps writing so long as there are
bytes left to write. Now if all the other system calls function, there should be more race conditions.
Most programs try to avoid this problem entirely by writing to separate files, but it is good to know that there
are mutexes across processes, and they are useful. A program can use all of the primitives that were mentioned
previouslty! Barriers, semaphores, and condition variables can all be initialized on a shared piece of memory and
used in similar ways to their multithreading counterparts.
• You don’t have to worry about arbitrary memory addresses becoming race condition candidates. Only areas
that specifically mapped are in danger.
• You get the nice isolation of processes so if one process fails the system can maintain intact.
• When you have a lot of threads, creating a process might ease the system load
When using atomics, you need to specify the right model of synchronization to ensure a program behaves correctly.
You can read more about them On the gcc wiki These examples are adapted from those.
Sequentially Consistent
Sequentially consistent is the simplest, least error-prone and most expensive model. This model says that any
change that happens, all changes before it will be synchronized between all threads.
Thread 1 Thread 2
1.0 atomic_store(x, 1)
1.1 y = 10 2.1 if (atomic_load(x) == 0)
1.2 atomic_store(x, 0); 2.2 y != 10 && abort();
Will never quit. This is because either the store happens before the if statement in thread 2 and y == 1 or the
store happens after and x does not equal 2.
Relaxed
Relaxed is a simple memory order providing for more optimizations. This means that only a particular operation
needs to be atomic. One can have stale reads and writes, but after reading the new value, it won’t become old.
-Thread 1- -Thread 2-
atomic_store(x, 1); printf("%d\n", x) // 1
atomic_store(x, 0); printf("%d\n", x) // could be 1 or 0
printf("%d\n", x) // could be 1 or 0
But that means that previous loads and stores don’t need to affect other threads. In the previous example, the
code can now fail.
Acquire/Release
The order of atomic variables don’t need to be consistent – meaning if atomic var y is assigned to 10 then atomic
var x to be 0 those don’t need to propagate, and a threa could get stale reads. Non-atomic variables have to get
updated in all threads though.
Consume
Imagine the same as above except non-atomic variables don’t need to get updated in all threads. This model was
introduced so that there can be an Acquire/Release/Consume model without mixing in Relaxed because Consume
is similar to relax.
There are a lot of other methods of concurrency than described in this book. Posix threads are the finest grained
thread construct, allowing for tight control of the threads and the CPU. Other languages have their abstractions.
We’ll talk about a language go that is similar to C in terms of simplicity and design, go or golang To get the 5
minute introduction, feel free to read the learn x in y guide for go. Here is how we create a "thread" in go.
func hello(out) {
fmt.Println(out);
}
func main() {
to_print := "Hello World!"
go hello(to_print)
}
This actually creates what is known as a goroutine. A goroutine can be thought of as a lightweight thread.
Internally, it is a worker pool of threads that executes instructions of all the running goroutines. When a goroutine
needs to be stopped, it is frozen and "context switched" to another thread. Context switch is in quotes because
this is done at the run time level versus real context switching which is done at the operating system level.
The advantage to gofuncs is pretty self explanatory. There is no boilerplate code, or joining, or odd casting
void *.
We can still use mutexes in go to perform our end result. Consider the counting example as before.
var counter = 0;
var mut sync.Mutex;
var wg sync.WaitGroup;
func plus() {
mut.Lock()
counter += 1
mut.Unlock()
wg.Done()
}
func main() {
num := 10
wg.Add(num);
for i := 0; i < num; i++ {
go plus()
}
wg.Wait()
fmt.Printf("%d\n", counter);
But that’s boring and error prone. Instead, let’s use the actor model. Let’s designate two actors. One is the
main actor that will be performing the main instruction set. The other actor will be the counter. The counter is
responsible for adding numbers to an internal variable. We’ll send messages between the threads when we want
to add and see the value.
const (
addRequest = iota;
outputRequest = iota;
)
for {
req := <- requestChannel;
if req == addRequest {
counter += 1
} else if req == outputRequest {
outputChannel <- counter
}
}
}
func main() {
// Set up the actor
requestChannel := make(chan int)
outputChannel := make(chan int)
go counterActor(requestChannel, outputChannel)
num := 10
for i := 0; i < num; i++ {
requestChannel <- addRequest
}
requestChannel <- outputRequest
new_count := <- outputChannel
fmt.Printf("%d\n", new_count);
}
Although there is a bit more boilerplate code, we don’t have mutexes anymore! If we wanted to scale this
operation and do other things like increment by a number, or write to a file, we can have that particular actor
take care of it. This differentiation of responsibilities is important to make sure your design scales well. There are
even libraries that handle all of the boilerplate code as well.
External Resources
• Can a thread copy the underlying bytes of a mutex instead of using a pointer?
• sem_init
• sem_wait
• sem_post
• sem_destroy
Topics
• Atomic operations
• Critical Section
• Implementing a barrier
• Implementing a ring buffer
• Using pthread_mutex
Questions
• What are some downsides to atomic operations? What would be faster: keeping a local variable or many
atomic operations?
• Once you have identified a critical section, what is one way of assuring that only one thread will be in the
section at a time?
struct linked_list;
struct node;
void add_linked_list(linked_list *ll, void* elem){
node* packaged = new_node(elem);
if(ll->head){
ll->head =
}else{
packaged->next = ll->head;
ll->head = packaged;
ll->size++;
}
}
• What is a producer consumer problem? How might the above be a producer consumer problem be used in
the above section? How is a producer consumer problem related to a reader writer problem?
• What is a condition variable? Why is there an advantage to using one over a while loop?
if(not_ready){
pthread_cond_wait(&cv, &mtx);
}
• What is a counting semaphore? Give me an analogy to a cookie jar/pizza box/limited food item.
• Give me an implementation of a reader-writer lock with condition variables, make a struct with whatever
you need, it needs to be able to support the following functions
typedef struct {
} rw_lock_t;
The only specification is that in between reader_lock and reader_unlock, no writers can write. In
between the writer locks, only one writer may be writing at a time.
• Write code to implement a producer consumer using ONLY three counting semaphores. Assume there can
be more than one thread calling enqueue and dequeue. Determine the initial value of each semaphore.
• Write code to implement a producer consumer using condition variables and a mutex. Assume there can be
more than one thread calling enqueue and dequeue.
• Use CVs to implement add(unsigned int) and subtract(unsigned int) blocking functions that never allow the
global value to be greater than 100.
void main() {
pthread_mutex_t mutex;
pthread_cond_t cond;
pthread_mutex_init(&mutex, NULL);
pthread_cond_init(&cond, NULL);
pthread_cond_broadcast(&cond);
pthread_cond_wait(&cond,&mutex);
return 0;
}
pthread_mutex_lock(&m);
money -= amount;
pthread_mutex_unlock(&m);
}
• Sketch how to use a binary semaphore as a mutex. Remember in addition to mutual exclusion, a mutex can
only ever be unlocked by the thread who called it.
sem_t sem;
void lock() {
void unlock() {
[1] T.J. Dekker and Edsgar Dijkstra. Over de sequentialiteit van procesbeschrijvingen, 1965. URL http:
//www.cs.utexas.edu/users/EWD/transcriptions/EWD00xx/EWD35.html.
[2] Gary L. Peterson. Myths about the mutual exclusion problem. Inf. Process. Lett., 12:115–116, 1981.
Deadlock
8
No, you can’t always get what you want
You can’t always get what you want
You can’t always get what you want
But if you try sometimes you find
You get what you need
The philosophers Jagger & Richards
Deadlock is defined as when a system cannot make and forward progress. We define a system for the rest
of the chapter as a set of rules by which a set of processes can move from one state to another, where a state is
either working or waiting for a particular resource. Forward progress is defined as if there is at least one process
working or we can award a process waiting for a resource that resource. In a lot of systems, Deadlock is avoided
by ignoring the entire concept [5, P.237]. Have you heard about turn it on and off again? For products where the
stakes are low (User Operating Systems, Phones), it may be more efficient to allow deadlock. But in the cases
where “failure is not an option” - Apollo 13, you need a system that tracks, breaks, or prevents deadlocks. Apollo
13 didn’t fail because of deadlock, but it wouldn’t be good to restart the system on liftoff.
Mission-critical operating systems need this guarantee formally because playing the odds with people’s lives
isn’t a good idea. Okay so how do we do this? We model the problem. Even though it is a common statistical
phrase that all models are wrong, the more accurate the model is to the system the higher the chance the method
will work.
199
Resource Allocation Graphs
Resource 1 Resource 2
One such way is modeling the system with a resource allocation graph (RAG). A resource allocation graph tracks
which resource is held by which process and which process is waiting for a resource of a particular type. It is
a simple yet powerful tool to illustrate how interacting processes can deadlock. If a process is using a resource,
an arrow is drawn from the resource node to the process node. If a process is requesting a resource, an arrow
is drawn from the process node to the resource node. If there is a cycle in the Resource Allocation Graph and
each resource in the cycle provides only one instance, then the processes will deadlock. For example, if process 1
holds resource A, process 2 holds resource B and process 1 is waiting for B and process 2 is waiting for A, then
processes 1 and 2 will be deadlocked 8.1. We’ll make the distinction that the system is in deadlock by definition if
all workers cannot perform an operation other than waiting. We can detect a deadlock by traversing the graph
and searching for a cycle using a graph traversal algorithm, such as the Depth First Search (DFS). This graph is
considered as a directed graph and we can treat both the processes and resources as nodes.
m
typedef struct {
int node_id; // Node in this particular graph
Graph **reachable_nodes; // List of nodes that can be
reached from this node
int size_reachable_nodes; // Size of the List
} Graph;
Resource 1 Resource 2
Coffman Conditions
Surely cycles in RAGs happen all the time in an OS, so why doesn’t it grind to a halt? You may not see deadlock
because the OS may preempt some processes breaking the cycle but there is still a chance that your three lonely
processes could deadlock.
There are four necessary and sufficient conditions for deadlock – meaning if these conditions hold then there
is a non-zero probability that the system will deadlock at any given iteration. These are known as the Coffman
Conditions [2].
• Mutual Exclusion: No two processes can obtain a resource at the same time.
• Circular Wait: There exists a cycle in the Resource Allocation Graph, or there exists a set of processes {P1,
P2,. . . } such that P1 is waiting for resources held by P2, which is waiting for P3,. . . , which is waiting for P1.
• Hold and Wait: Once a resource is obtained, a process keeps the resource locked.
Proof: Deadlock can happen if and only if the four Coffman conditions are satisfied.
→ If the system is deadlocked, the four Coffman conditions are apparent.
• For contradiction, assume that there is no circular wait. If not then that means the resource allocation
graph is acyclic, meaning that there is at least one process that is not waiting on any resource to be
freed. Since the system can move forward, the system is not deadlocked.
• For contradiction, assume that there is no mutual exclusion. If not, that means that no process is
waiting on any other process for a resource. This breaks circular wait and the previous argument
proves correctness.
• For contradiction, assume that processes don’t hold and wait but our system still deadlocks. Since we
have circular wait from the first condition at least one process must be waiting on another process. If
that and processes don’t hold and wait, that means one process must let go of a resource. Since the
system has moved forward, it cannot be deadlocked.
• For contradiction, assume that we have preemption, but the system cannot be un-deadlocked. Have
one process, or create one process, that recognizes the circular wait that must be apparent from above
and break one of the links. By the first branch, we must not have deadlocked.
← If the four conditions are apparent, the system is deadlocked. We will prove that if the system
is not deadlocked, the four conditions are not apparent. Though this proof is not formal, let us build a
system with the three requirements not including circular wait. Let assume that there is a set of processes
P = {p1 , p2 , ..., pn } and there is a set of resources R = {r1 , r2 , ..., rm }. For simplicity, a process can only
request one resource at a time but the proof can be generalized to multiple. Let assume that the system is a
state at time t. Let us assume that the state of the system is a tuple (h t , w t ) where there are two functions
h t : R → P ∪ {unassigned} that maps resources to the processes that own them (this is a function, meaning
that we have mutual exclusion) and or unassigned and w t : P → R ∪ {satisfied} that maps the requests that
each process makes to a resource or if the process is satisfied. If the process is satisfied, we consider the
work trivial and the process exits, releasing all resources – this can also be generalized. Let L t ⊆ P × R be
a set of lists of requests that a process uses to release a resource at any given time. The evolution of the
system is at each step at every time.
• If that resource is available give it to that process, generating a new (h t+1 , w t+1 ) and exit the current
iteration.
• Else find another process and try the same resource allocation procedure in the previous step.
If all processes have been surveyed and if all are requesting a resource and none can be granted a
resource, consider it deadlocked. More formally, this system is deadlocked means if ∃t 0 , ∀t ≥ t 0 , ∀p ∈
P, w t (p) 6= satisfied and ∃q, q 6= p → h t (w t (p)) = q (which is what we need to prove).
Mutual exclusion and no pre-emption are encoded into the system. Circular wait implies the second
condition, a resource is owned by another process which is owned by another process meaning at this state
∀p ∈ P, ∃q 6= p → h t (w t (p)) = q. Circular wait also implies that at this current state, no process is satisfied,
meaning at this state ∀p ∈ P, w t (p) 6= satisfied. Hold and wait simply proves the condition that from this
point onward, the system will not change, which is all the conditions that we needed to show.
If a system breaks any of them, it cannot have deadlock! Consider the scenario where two students need to
write both pen and paper and there is only one of each. Breaking mutual exclusion means that the students share
the pen and paper. Breaking circular wait could be that the students agree to grab the pen then the paper. As
proof by contradiction, say that deadlock occurs under the rule and the conditions. Without loss of generality,
that means a student would have to be waiting on a pen while holding the paper and the other waiting on a pen
and holding the paper. We have contradicted ourselves because one student grabbed the paper without grabbing
the pen, so deadlock fails to occur. Breaking hold and wait could be that the students try to get the pen and then
the paper and if a student fails to grab the paper then they release the pen. This introduces a new problem called
livelock which will be discussed later. Breaking preemption means that if the two students are in deadlock the
teacher can come in and break up the deadlock by giving one of the students a held item or tell both students to
put the items down.
Livelock relates to deadlock. Consider the breaking hold-and-wait solution as above. Though deadlock is
avoided, if the philosopher picks up the same device again and again in the same pattern, no work will be done.
Livelock is generally harder to detect because the processes generally look like they are working to the outside
operating system whereas in deadlock the operating system generally knows when two processes are waiting on
a system-wide resource. Another problem is that there are necessary conditions for livelock (i.e. deadlock fails
to occur) but not sufficient conditions – meaning there is no set of rules where livelock has to occur. You must
formally prove in a system by what is known as an invariant. One has to enumerate each of the steps of a system
and if each of the steps eventually – after some finite number of steps – leads to forward progress, the system fails
to livelock. There are even better systems that prove bounded waits; a system can only be livelocked for at most n
cycles which may be important for something like stock exchanges.
Ignoring deadlock is the most obvious approach. Quite humorously, the name for this approach is called the
Ostrich Algorithm. Though there is no apparent source, the idea for the algorithm comes from the concept of
an ostrich sticking its head in the sand. When the operating system detects deadlock, it does nothing out of the
ordinary, and any deadlock usually goes away. An operating system preempts processes when stopping them for
context switches. The operating system can interrupt any system call, potentially breaking a deadlock scenario.
The OS also makes some files read-only thus making the resource shareable. What the algorithm refers to is that if
there is an adversary that specifically crafts a program – or equivalently a user who poorly writes a program – that
the OS deadlocks. For everyday life, this tends to be fine. When it is not we can turn to the following method.
Deadlock detection allows the system to enter a deadlocked state. After entering, the system uses the
information to break deadlock. As an example, consider multiple processes accessing files. The operating system
can keep track of all of the files/resources through file descriptors at some level either abstracted through an
API or directly. If the operating system detects a directed cycle in the operating system file descriptor table it
may break one process’ hold through scheduling for example and let the system proceed. Why this is a popular
choice in this realm is that there is no way of knowing which resources a program will select without running
the program. This is an extension of Rice’s theorem [4] that says that we cannot know any semantic feature
without running the program (semantic meaning like what files it tries to open). So theoretically, it is sound. The
problem then gets introduced that we could reach a livelock scenario if we preempt a set of resources again and
again. The way around this is mostly probabilistic. The operating system chooses a random resource to break
hold-and-wait. Now even though a user can craft a program where breaking hold and wait on each resource
will result in a livelock, this doesn’t happen as often on machines that run programs in practice or the livelock
that does happen happens for a couple of cycles. These systems are good for products that need to maintain a
non-deadlocked state but can tolerate a small chance of livelock for a short time.
• if now person j is either satisfied (l t+1, j == a j ) or min(ai − l t+1,i ) ≤ p. In other words, we have enough
money to suit one other person. If either, consider the transaction safe and give them the money.
Why does this work? Well at the start we are in a safe state – defined by we have enough money to suit at
least one person. Each of these "loans" results in a safe state. If we have exhausted our reserve, one person is
working and will give us money greater than or equal to our previous "loan", thus putting us in a safe state again.
Since we can always make one additional move, the system can never deadlock. Now, there is no guarantee that
the system won’t livelock. If the process we hope to request something never does, no work will be done – but
not due to deadlock. This analogy expands to higher orders of magnitude but requires that either a process can
do its work entirely or there exists a process whose combination of resources can be satisfied, which makes the
algorithm a little more tricky (an additional for loop) but nothing too bad. There are some notable downsides.
• The program first needs to know how much of each resource a process needs. A lot of times that is impossible
or the process requests the wrong amount because the programmer didn’t foresee it.
• We know in most systems that resources vary, pipes and sockets for example. This could mean that the
runtime of the algorithm could be slow for systems with millions of resources.
• Also, this can’t keep track of the resources that come and go. A process may delete a resource as a side
effect or create a resource. The algorithm assumes a static allocation and that each process performs a
non-destructive operation.
Dining Philosophers
The Dining Philosophers problem is a classic synchronization problem. Imagine we invite n (let’s say 6) philosophers
to a meal. We will sit them at a table with 6 chopsticks, one between each philosopher. A philosopher alternates
between wanting to eat or think. To eat the philosopher must pick up the two chopsticks either side of their
position. The original problem required each philosopher to have two forks, but one can eat with a single fork so
we rule this out. However, these chopsticks are shared with his neighbor.
Resources
Processes
Is it possible to design an efficient solution such that all philosophers get to eat? Or, will some philosophers
starve, never obtaining a second chopstick? Or will all of them deadlock? For example, imagine each guest picks
up the chopstick on their left and then waits for the chopstick on their right to be free. Oops - our philosophers have
deadlocked! Each philosopher is essentially the same, meaning that each philosopher has the same instruction set
based on the other philosopher i.e. you can’t tell every even philosopher to do one thing and every odd philosopher
to do another thing.
Failed Solutions
So now you are thinking about breaking one of the Coffman Conditions. Let’s break Hold and Wait!
Now our philosopher picks up the left fork and tries to grab the right. If it’s available, they eat. If it’s not
available, they put the left fork down and try again. No deadlock! But, there is a problem. What if all the
philosophers pick up their left at the same time, try to grab their right, put their left down, pick up their left, try to
grab their right and so on. Here is what a time evolution of the system would look like.
Figure 8.5: Livelock Failure
We have now livelocked our solution! Our poor philosophers are still starving, so let’s give them some proper
solutions.
Viable Solutions
The naive arbitrator solution has one arbitrator a mutex for example. Have each of the philosophers ask the
arbitrator for permission to eat or trylock an arbitrator mutex. This solution allows one philosopher to eat at a
time. When they are done, another philosopher can ask for permission to eat. This prevents deadlock because
there is no circular wait! No philosopher has to wait for any other philosopher. The advanced arbitrator solution
is to implement a class that determines if the philosopher’s forks are in the arbitrator’s possession. If they are,
they give them to the philosopher, let him eat, and take the forks back. This has the bonus of being able to have
multiple philosophers eat at the same time.
There are a lot of problems with these solutions. One is that they are slow and have a single point of
failure. Assuming that all the philosophers are good-willed, the arbitrator needs to be fair. In practical systems,
the arbitrator tends to give forks to the same processes because of scheduling or pseudo-randomness. Another
important thing to note is that this prevents deadlock for the entire system. But in our model of dining philosophers,
the philosopher has to release the lock themselves. Then, you can consider the case of the malicious philosopher
(let’s say Descartes because of his Evil Demons) could hold on to the arbitrator forever. He would make forward
progress and the system would make forward progress but there is no way of ensuring that each process makes
forward progress without assuming something about the processes or having true preemption – meaning that a
higher authority (let’s say Steve Jobs) tells them to stop eating forcibly.
Proof: Stallings’ Solution Doesn’t Deadlock. Let’s number the philosophers {p0 , p1 , .., pn−1 } and the
resources {r0 , r1 , .., rn−1 }. A philosopher pi needs resource ri−1 mod n and ri+1 mod n . Without loss of
generality, let us take pi out of the picture. Each resource had exactly two philosophers that could use it.
Now resources ri−1 mod n and ri+1 mod n only have on philosopher waiting on it. Even if hold and wait, no
preemption, and mutual exclusion or present, the resources can never enter a state where one philosopher
requests them and they are held by another philosopher because only one philosopher can request them.
Since there is no way to generate a cycle otherwise, circular wait cannot hold. Since circular wait cannot
hold, deadlock cannot happen.
Here is a visualization of the worst-case. The system is about to deadlock, but the approach resolves it.
Topics
• Coffman Conditions
• Dining Philosophers
• Failed DP Solutions
• Livelocking DP Solutions
Questions
• What does each of the Coffman conditions mean? Define each one.
• Give a real-life example of breaking each Coffman condition in turn. A situation to consider: Painters, Paint,
Paint Brushes etc. How would you assure that work would get done?
// Thread 2
pthread_mutex_lock(m2) // success
pthread_mutex_lock(m1) // blocks
What happens and why? What happens if a third thread calls pthread_mutex_lock(m1) ?
• How many processes are blocked? As usual, assume that a process can complete if it can acquire all of the
resources listed below.
– P1 acquires R1
– P2 acquires R2
– P1 acquires R3
– P2 waits for R3
– P3 acquires R5
– P1 waits for R4
– P3 waits for R1
– P4 waits for R5
– P5 waits for R1
Bibliography
[1] K. M. Chandy and J. Misra. The drinking philosophers problem. ACM Trans. Program. Lang. Syst., 6(4):
632–646, October 1984. ISSN 0164-0925. doi: 10.1145/1780.1804. URL http://doi.acm.org/10.
1145/1780.1804.
[2] Edward G Coffman, Melanie Elphick, and Arie Shoshani. System deadlocks. ACM Computing Surveys (CSUR),
3(2):67–78, 1971.
[3] Edsger W. Dijkstra. Hierarchical ordering of sequential processes. published as [? ]WD:EWD310pub, n.d.
URL http://www.cs.utexas.edu/users/EWD/ewd03xx/EWD310.PDF.
[4] H. G. Rice. Classes of recursively enumerable sets and their decision problems. Transactions of the American
Mathematical Society, 74(2):358–366, 1953. ISSN 00029947. URL http://www.jstor.org/stable/
1990888.
[5] A. Silberschatz, P.B. Galvin, and G. Gagne. OPERATING SYSTEM PRINCIPLES, 7TH ED. Wiley student edition.
Wiley India Pvt. Limited, 2006. ISBN 9788126509621. URL https://books.google.com/books?id=
WjvX0HmVTlMC.
[6] William Stallings. Operating Systems: Internals and Design Principles 7th Ed. by Stallings (In-
ternational Economy Edition). PE, 2011. ISBN 9332518807. URL https://www.amazon.
com/Operating-Systems-Internals-Principles-International/dp/9332518807?
SubscriptionId=0JYN1NVW651KCA56C102&tag=techkie-20&linkCode=xm2&camp=2025&
creative=165953&creativeASIN=9332518807.
Virtual Memory and Interprocess Communication
9
Abbott: Now you’ve got it.
Costello: I throw the ball to Naturally.
Abbott: You don’t! You throw it to Who!
Costello: Naturally.
Abbott: Well, that’s it - say it that way.
Costello: That’s what I said.
Abbott and Costello on Effective Communication
In simple embedded systems and early computers, processes directly access memory – “Address 1234” cor-
responds to a particular byte stored in a particular part of physical memory. For example, the IBM 709 had to
read and write directly to tape with no level of abstraction [3, P. 65]. Even in systems after that, it was hard to
adopt virtual memory because virtual memory required the whole fetch cycle to be altered through hardware
– a change many manufacturers still thought was expensive. In the PDP-10, a workaround was used by using
different registers for each process and then virtual memory was added later [1]. In modern systems, this is no
longer the case. Instead, each process is isolated, and there is a translation process between the address of a
particular CPU instruction or piece of data of a process and the actual byte of physical memory (“RAM”). Memory
addresses no longer map to physical addresses The process runs inside virtual memory. Virtual memory keeps
processes safe because one process cannot directly read or modify another process’s memory. Virtual memory also
allows the system to efficiently allocate and reallocate portions of memory to different processes. The modern
process of translating memory is as follows.
1. A process makes a memory request
2. The circuit first checks the Translation Lookaside Buffer (TLB) if the address page is cached into memory. It
skips to the reading from/writing to phase if found otherwise the request goes to the MMU.
3. The Memory Management Unit (MMU) performs the address translation. If the translation succeeds, the
page gets pulled from RAM – conceptually the entire page isn’t loaded up. The result is cached in the TLB.
4. The CPU performs the operation by either reading from the physical address or writing to the address.
Translating Addresses
The Memory Management Unit is part of the CPU, and it converts a virtual memory address into a physical address.
First, we’ll talk about what the virtual memory abstraction is and how to translate addresses
215
To illustrate, consider a 32-bit machine, meaning pointers are 32-bits. They can address 232 different locations
or 4GB of memory where one address is one byte. Imagine we had a large table for every possible address
where we will store the ‘real’ i.e. physical address. Each physical address will need 4 bytes – to hold the 32-bits.
Naturally, This scheme would require 16 billion bytes to store all of the entries. It should be painfully obvious that
our lookup scheme would consume all of the memory that we could buy for our 4GB machine. Our lookup table
should be smaller than the memory we have otherwise we will have no space left for our actual programs and
operating system data. The solution is to chunk memory into small regions called ‘pages’ and ‘frames’ and use a
lookup table for each page.
Terminology
A page is a block of virtual memory. A typical block size on Linux is 4KiB or 212 addresses, though one can find
examples of larger blocks. So rather than talking about individual bytes, we can talk about blocks of 4KiBs, each
block is called a page. We can also number our pages (“Page 0” “Page 1” etc). Let’s do a sample calculation of
how many pages are there assume page size of 4KiB.
For a 32-bit machine,
We also call this a frame or sometimes called a ‘page frame’ is a block of physical memory or RAM – Random
Access Memory. A frame is the same number of bytes as a virtual page or 4KiB on our machine. It stores the bytes
of interest. To access a particular byte in a frame, an MMU goes from the start of the frame and adds the offset –
discussed later.
A page table is a map from a number to a particular frame. For example Page 1 might be mapped to frame
45, page 2 mapped to frame 30. Other frames might be currently unused or assigned to other running processes
or used internally by the operating system. Implied from the name, imagine a page table as a table.
2 24
In practice, we will omit the first column because it will always be sequentially 0, 1, 2, etc and instead we’ll
use the offset from the start of the table as the entry number.
Now to go through the actual calculations. We will assume that a 32-bit machine has 4KiB pages Naturally, to
address all the possible entries, there are 220 frames. Since there are 220 possible frames, we will need 20 bits to
number all of the possible frames meaning Frame Number must be 2.5 bytes long. In practice, we’ll round that
up to 4 bytes and do something interesting with the rest of the bits. With 4 bytes per entry x 220 entries = 4 MiB
of physical memory are required to hold the entire page table for a process.
Remember our page table maps pages to frames, but each frame is a block of contiguous addresses. How
do we calculate which particular byte to use inside a particular frame? The solution is to re-use the lowest
bits of the virtual memory address directly. For example, suppose our process is reading the following address-
VirtualAddress = 11110000111100001111000010101010 (binary)
So to give an example say we have the virtual address above. How would we split it up using a one-page table
to frame scheme?
Page Number
11110000111100001111 000010101010
O set
We can imagine the steps to dereference as one process. In general, it looks like the following.
0 20 32
Physical Frame
Base Directory
Subpage Num
Byte
Base Directory
11110000111100001110 421
11110000111100001111 357
11110000111100010000 361
000010101010 Byte
And if we were reading from it, ’return’ that value. This sounds like a perfect solution. Take each address and
map it to a virtual address in sequential order. The process will believe that the address looks continuous, but the
top 20 bits are used to figure out page_num, which will allow us to find the frame number, find the frame, add
the offset – derived from the last 12 bits – and do the read or write.
There are other ways to split it as well. On a machine with page size 256 Bytes, then the lowest 8 bits (10101010)
will be used as the offset. The remaining upper bits will be the page number (111100001111000011110000).
This offset is treated as a binary number and is added to the start of the frame when we get it.
We do have a problem with 64-bit operating systems. For a 64-bit machine with 4KiB pages, each entry needs
52 bits. Meaning we need roughly With 252 entries, that’s 255 bytes (roughly 40 petabytes). So our page table
is too large. In 64-bit architecture, memory addresses are sparse, so we need a mechanism to reduce the page
table size, given that most of the entries will never be used. We’ll take about this below. There is one last piece of
terminology that needs to be covered.
Index 1
Index 2 O set
0 10 20 32
Frame Num
Byte
Following our example, here is what the dereference would look like.
0 10 20 32
1111000010 125
1111000011 173 Subpage Table #173
1111000100 126
1100001110 233
1100001111 241
1100010000 374
000010101010 Byte
Now some calculations on size. Each page_table_num index is 10 bits wide because there are only 210 possible
sub-tables, so we need 10 bits to store each directory index. We’ll round up to 2 bytes for the sake of reasoning.
If 2 bytes are used for each entry in the top-level table and there are only 210 entries, we only need 2KiB to store
this entire first level page table. Each subtable will point to physical frames, and each of their entries needs to be
the required 4 bytes to be able to address all the frames as mentioned earlier. However, for processes with only
tiny memory needs, we only need to specify entries for low memory addresses for the heap and program code and
high memory addresses for the stack.
Thus, the total memory overhead for our multi-level page table has shrunk from 4MiB for the single-level
implementation to three page tables of memory or 2KiB for the top-level and 4KiB for the two intermediate
levels of size 10KiB. Here’s why. We need at least one frame for the high-level directory and two frames for two
sub-tables. One sub-table is necessary for the low addresses – program code, constants and possibly a tiny heap.
The other sub-table is for higher addresses used by the environment and stack. In practice, real programs will
likely need more sub-table entries, as each subtable can only reference 1024*4KiB = 4MiB of address space.
The main point still stands. We have significantly reduced the memory overhead required to perform page table
lookups.
MMU Algorithm
There is a sort of pseudocode associated with the MMU. We will assume that this is for a single-level page table.
1. Receive address
4. Otherwise,
(a) If the TLB contains the physical memory, get the physical frame from the TLB and perform the read
and write.
(b) If the page exists in memory, check if the process has permissions to perform the operation on the page
meaning the process has access to the page, and it is reading from the page/writing to a page that it
has permission to do so.
i. If so then do the dereference provide the address, cache the results in the TLB
ii. Otherwise, trigger a hardware interrupt. The kernel will most likely send a SIGSEGV or a
Segmentation Violation.
(c) If the page doesn’t exist in memory, generate an Interrupt.
i. The kernel could realize that this page could either be not allocated or on disk. If it fits the
mapping, allocate the page and try the operation again.
ii. Otherwise, this is invalid access and the kernel will most likely send a SIGSEGV to the process.
This is heavily dependent on the chipset. We will include some bits that have historically been popular in chipsets.
1. The read-only bit marks the page as read-only. Attempts to write to the page will cause a page fault. The
page fault will then be handled by the Kernel. Two examples of the read-only page include sharing the C
standard library between multiple processes for security you wouldn’t want to allow one process to modify
the library and Copy-On-Write where the cost of duplicating a page can be delayed until the first write
occurs.
2. The execution bit defines whether bytes in a page can be executed as CPU instructions. Processors may
merge these bits into one and deem a page either writable or executable. This bit is useful because it
prevents stack overflow or code injection attacks when writing user data into the heap or the stack because
those are not read-only and thus not executable. Further reading: background
3. The dirty bit allows for performance optimization. A page exclusively read from can be discarded without
syncing to disk, since the page hasn’t changed. However, if the page was written to after it’s paged in, its
dirty bit will be set, indicating that the page must be written back to the backing store. This strategy requires
that the backing store retain a copy of the page after it is paged into memory. When a dirty bit is omitted,
the backing store need only be as large as the instantaneous total size of all paged-out pages at any moment.
When a dirty bit is used, at all times some pages will exist in both physical memory and the backing store.
4. There are plenty of other bits. Take a look at your favorite architecture and see what other bits are associated!
Page Faults
A page fault may happen when a process accesses an address in a frame missing in memory. There are three types
of Page Faults
1. Minor If there is no mapping yet for the page, but it is a valid address. This could be memory asked for
by sbrk(2) but not written to yet meaning that the operating system can wait for the first write before
allocating space – if it was read from, the operating system could short circuit the operation to read 0. The
OS simply makes the page, loads it into memory, and moves on.
2. Major If the mapping to the page is exclusively on disk. The operating system will swap the page into
memory and swap another page out. If this happens frequently enough, your program is said to thrash the
MMU.
3. Invalid When a program tries to write to a non-writable memory address or read to a non-readable memory
address. The MMU generates an invalid fault and the OS will usually generate a SIGSEGV meaning
segmentation violation meaning that the program wrote outside the segment that it could write to.
mmap is a trick of virtual memory of instead of mapping a page to a frame, that frame can be backed by a file on
disk, or the frame can be shared among processes. We can use that to read from a file on disk efficiently or sync
changes to the file. One of the big optimizations is a file may be lazily allocated to memory. Take the following
code for example.
The kernel sees that the program wants to mmap the file into memory, so it will reserve some space in your
address space that is the length of the file. That means when the program writes to addr[0] that it writes to the
first byte of the file. The kernel can do some optimizations too. Instead of loading the whole file into memory, it
may only load pages at a time. A program may only access 3 or 4 pages making loading the entire file a waste of
time. Page faults are so powerful because let the operating system take control of when a file is used.
mmap Definitions
mmap does more than take a file and map it to memory. It is the general interface for creating shared memory
among processes. Currently it only supports regular files and POSIX shmem [2]. Naturally, you can read all about
it in the reference above, which references the current working group POSIX standard. Some other options to
note in the page will follow.
The first option is that the flags argument of mmap can take many options.
1. PROT_READ This means the process can read the memory. This isn’t the only flag that gives the process
read permission, however! The underlying file descriptor, in this case, must be opened with read privileges.
2. PROT_WRITE This means the process can write to the memory. This has to be supplied for a process to
write to a mapping. If this is supplied and PROT_NONE is also supplied, the latter wins and no writes can be
performed. The underlying file descriptor, in this case, must either be opened with write privileges or a
private mapping must be supplied below
3. PROT_EXEC This means the process can execute this piece of memory. Although this is not stated in POSIX
documents, this shouldn’t be supplied with WRITE or NONE because that would make this invalid under
the NX bit or not being able to execute (respectively)
4. PROT_NONE This means the process can’t do anything with the mapping. This could be useful if you
implement guard pages in terms of security. If you surround critical data with many more pages that can’t
be accessed, that decreases the chance of various attacks.
5. MAP_SHARED This mapping will be synchronized to the underlying file object. The file descriptor must’ve
been opened with write permissions in this case.
6. MAP_PRIVATE This mapping will only be visible to the process itself. Useful to not thrash the operating
system.
Remember that once a program is done mmapping that the program must munmap to tell the operating system
that it is no longer using the pages allocated, so the OS can write it back to disk and give back the addresses in
case another mmap needs to occur. There are accompanying calls msync that take a piece of mmap’ed memory
and sync the changes back to the filesystem though we won’t cover that in-depth. The other parameters to mmap
are described in the annotated walkthrough below.
off_t offset;
size_t length;
We’ll assume that all system calls succeed. First, we have to open the file and get the size.
Then, we need to introduce another variable known as page_offset. mmap doesn’t let the program pass in
any value as an offset, it needs to be a multiple of the page size. In our case, we will round down.
2. length + offset - page_offset, mmaps the “rest” of the file into memory (starting from offset)
MMAP Communication
So how would we use mmap to communicate across processes? Conceptually, it would be the same as using
threading. Let’s go through a broken down example. First, we need to allocate some space. We can do that with
the mmap call. We’ll also allocate space for 100 integers
Then, we need to fork and perform some communication. Our parent will store some values, and our child
will read those values.
Now, there is no assurance that the values will be communicated because the process used sleep, not a mutex.
Most of the time this will work.
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h> /* mmap() is defined in this header */
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
int main() {
munmap(addr,size);
return 0;
}
This piece of code allocates space for a 100 integers and creates a piece of memory that is shared between all
processes. The code then forks. The parent process writes two integers to the first two slots. To avoid a data race,
the child sleeps for a second and then prints out the stored values. This is an imperfect way to protect against
data races. We could use a mutex across the processes mentioned in the synchronization section. But for this
simple example, it works fine. Note that each process should call munmap when done using the piece of memory.
Sharing anonymous memory is an efficient form of inter-process communication because there is no copying,
system call, or disk-access overhead - the two processes share the same physical frame of main memory. On the
other hand, shared memory, like in a multithreading context, creates room for data races. Processes that share
writable memory might need to use synchronization primitives like mutexes to prevent these from happening.
Pipes
You’ve seen the virtual memory way of IPC, but there are more standard versions of IPC that are provided by the
kernel. One of the big utilities is POSIX pipes. A pipe simply takes in a stream of bytes and spits out a sequence of
bytes.
One of the big starting points of pipes was way back in the PDP-10 days. In those days, a write to the disk
or even your terminal was slow as it may have to be printed out. The Unix programmers still wanted to create
small, portable programs that did one thing well and could be composed. As such, pipes were invented to take
the output of one program and feed it to the input of another program though they have other uses today – you
can read more At the Wikipedia page Consider if you type the following into your terminal.
What does the following code do? First, it lists the current directory. The -1 means that it outputs one entry
per line. The cut command then takes everything before the first period. sort sorts all the input lines, uniq
makes sure all the lines are unique. Finally, tee outputs the contents to the file dir_contents and the terminal
for your perusal. The important part is that bash creates 5 separate processes and connects their standard
outs/stdins with pipes the trail looks something like this.
1 1 1 1 1
0 0 0 0 0
The numbers in the pipes are the file descriptors for each process and the arrow represents the redirect or
where the output of the pipe is going. A POSIX pipe is almost like its real counterpart - a program can stuff bytes
down one end and they will appear at the other end in the same order. Unlike real pipes, however, the flow is
always in the same direction, one file descriptor is used for reading and the other for writing. The pipe system
call is used to create a pipe. These file descriptors can be used with read and write. A common method of using
pipes is to create the pipe before forking to communicate with a child process
int filedes[2];
pipe (filedes);
pid_t child = fork();
if (child > 0) {/* I must be the parent */
char buffer[80];
int bytesread = read(filedes[0], buffer, sizeof(buffer));
// do something with the bytes read
} else {
write(filedes[1], "done", 4);
}
There are two file descriptors that pipe creates. filedes[0] contains the read end. filedes[1] contains
the write end. How your friendly neighborhood TAs remember it is one can read before they can write, or reading
comes before writing. You can groan all you want at it, but it is helpful to remember what is the read end and what
is the write end.
One can use pipes inside of the same process, but there tends to be no added benefit. Here’s an example
program that sends a message to itself.
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
int fh[2];
pipe(fh);
FILE *reader = fdopen(fh[0], "r");
FILE *writer = fdopen(fh[1], "w");
// Hurrah now I can use printf
printf("Writing...\n");
fprintf(writer,"%d %d %d\n", 10, 20, 30);
fflush(writer);
printf("Reading...\n");
int results[3];
int ok = fscanf(reader,"%d %d %d", results, results + 1,
results + 2);
printf("%d values parsed: %d %d %d\n", ok, results[0],
results[1], results[2]);
return 0;
}
The problem with using a pipe in this fashion is that writing to a pipe can block meaning the pipe only has a
limited buffering capacity. The maximum size of the buffer is system-dependent; typical values from 4KiB up to
128KiB though they can be changed.
int main() {
int fh[2];
pipe(fh);
int b = 0;
#define MESG "..............................."
while(1) {
printf("%d\n",b);
write(fh[1], MESG, sizeof(MESG))
b+=sizeof(MESG);
}
return 0;
}
Pipe Gotchas
Here’s a complete example that doesn’t work! The child reads one byte at a time from the pipe and prints it out -
but we never see the message! Can you see why?
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
int main() {
int fd[2];
pipe(fd);
//You must read from fd[0] and write from fd[1]
printf("Reading from %d, writing to %d\n", fd[0], fd[1]);
pid_t p = fork();
if (p > 0) {
/* I have a child, therefore I am the parent */
write(fd[1],"Hi Child!",9);
The parent sends the bytes H,i,(space),C...! into the pipe. The child starts reading the pipe one byte at
a time. In the above case, the child process will read and print each character. However, it never leaves the while
loop! When there are no characters left to read it simply blocks and waits for more unless All the writers close
their ends Another solution could also exit the loop by checking for an end-of-message marker,
If all file descriptors referring to the read end of a pipe have been closed,
then a write(2) will cause a SIGPIPE signal to be generated for the calling process.
Tip: Notice only the writer (not a reader) can use this signal. To inform the reader that a writer is closing
their end of the pipe, a program could write your special byte (e.g. 0xff) or a message ("Bye!")
Here’s an example of catching this signal that fails! Can you see why?
#include <stdio.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
int main() {
signal(SIGPIPE, no_one_listening);
int filedes[2];
pipe(filedes);
pid_t child = fork();
if (child > 0) {
/* This process is the parent. Close the listening end of the
pipe */
close(filedes[0]);
} else {
/* Child writes messages to the pipe */
write(filedes[1], "One", 3);
sleep(2);
// Will this write generate SIGPIPE ?
write(filedes[1], "Two", 3);
write(1, "Done\n", 5);
}
return 0;
}
The mistake in the above code is that there is still a reader for the pipe! The child still has the pipe’s first file
descriptor open and remember the specification? All readers must be closed
When forking, It is common practice to close the unnecessary (unused) end of each pipe in the child and parent
process. For example, the parent might close the reading end and the child might close the writing end.
The last addendum is that a program can set the file descriptor to return when there is no one listening instead
of SIGPIPE because by default SIGPIPE terminates your program. The reason that this is default behavior is it
makes the pipe example above work. Consider this useless use of cat
Which grabs 20 lines of input from urandom. head will terminate after 20 newline characters have been read.
What about cat? cat needs to receive a SIGPIPE informing it that the program tried to write to a pipe that no
one is listening on.
This command takes the output of ls -1 which lists the content of the current directory on one line each and
pipes it to cut. Cut take a delimiter, in this case, a dot, and a field position, in our case 1, and outputs per line the
nth field by each delimiter. At a high level, this grabs the file names without the extension of our current directory.
Underneath the hood, this is how bash does it internally.
#define _GNU_SOURCE
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
int main() {
int pipe_fds[2];
// Call with the O_CLOEXEC flag to prevent any commands from
blocking
pipe2(pipe_fds, O_CLOEXEC);
if(!fork()) {
// Child
// Same here, except the stdin of the process is the read end
dup2(pipe_fds[0], 0);
return 0;
}
The results of the two programs should be the same. Remember as you encounter more complicated examples
of piping processes up, a program needs to close all unused ends of pipes otherwise the program will deadlock
waiting for your processes to finish.
Pipe Conveniences
If the program already has a file descriptor, it can ‘wrap’ it yourself into a FILE pointer using fdopen.
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main() {
char *name="Fred";
int score = 123;
int filedes = open("mydata.txt", "w", O_CREAT, S_IWUSR |
S_IRUSR);
For writing to files, this is unnecessary. Use fopen which does the same as open and fdopen. However for
pipes, we already have a file descriptor, so this is a great time to use fdopen
Here’s a complete example using pipes that almost works! Can you spot the error? Hint: The parent never
prints anything!
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
int fh[2];
pipe(fh);
FILE *reader = fdopen(fh[0], "r");
FILE *writer = fdopen(fh[1], "w");
pid_t p = fork();
if (p > 0) {
int score;
fscanf(reader, "Score %d", &score);
printf("The child says the score is %d\n", score);
} else {
fprintf(writer, "Score %d", 10 + 10);
fflush(writer);
}
return 0;
}
Note the unnamed pipe resource will disappear once both the child and parent have exited. In the above
example, the child will send the bytes and the parent will receive the bytes from the pipe. However, no end-of-line
character is ever sent, so fscanf will continue to ask for bytes because it is waiting for the end of the line i.e. it
will wait forever! The fix is to ensure we send a newline character so that fscanf will return.
Named Pipes
An alternative to unnamed pipes is named pipes created using mkfifo. From the command line: mkfifo From C:
int mkfifo(const char *pathname, mode_t mode);
You give it the pathname and the operation mode, it will be ready to go! Named pipes take up virtually no
space on a file system. This means the actual contents of the pipe aren’t printed to the file and read from that
same file. What the operating system tells you when you have a named pipe is that it will create an unnamed pipe
that refers to the named pipe, and that’s it! There is no additional magic. This is for programming convenience if
processes are started without forking meaning that there would be no way to get the file descriptor to the child
process for an unnamed pipe.
1$ mkfifo fifo
1$ echo Hello > fifo
# This will hang until the following command is run on another
terminal or another process
2$ cat fifo
Hello
Any open is called on a named pipe the kernel blocks until another process calls the opposite open. Meaning,
echo calls open(.., O_RDONLY) but that blocks until cat calls open(.., O_WRONLY), then the programs are
allowed to continue.
Race condition with named pipes
What is wrong with the following program?
//Program 1
int main(){
int fd = open("fifo", O_RDWR | O_TRUNC);
write(fd, "Hello!", 6);
close(fd);
return 0;
}
//Program 2
int main() {
char buffer[7];
int fd = open("fifo", O_RDONLY);
read(fd, buffer, 6);
buffer[6] = ’\0’;
printf("%s\n", buffer);
return 0;
}
This may never print hello because of a race condition. Since a program opened the pipe in the first process
under both permissions, open won’t wait for a reader because the program told the operating system that it is a
reader! Sometimes it looks like it works because the execution of the code looks something like this.
On Linux, there are two abstractions with files. The first is the Linux fd level abstraction.
• open takes a path to a file and creates a file descriptor entry in the process table. If the file is inaccessible, it
errors out.
• read takes a certain number of bytes that the kernel has received and reads them into a user-space buffer.
If the file is not open in read mode, this will break.
• write outputs a certain number of bytes to a file descriptor. If the file is not open in write mode, this will
break. This may be buffered internally.
• close removes a file descriptor from a process’ file descriptors. This always succeeds for a valid file
descriptor.
• lseek takes a file descriptor and moves it to a certain position. It can fail if the seek is out of bounds.
• fcntl is the catch-all function for file descriptors. Set file locks, read, write, edit permissions, etc.
The Linux interface is powerful and expressive, but sometimes we need portability for example if we are
writing for a Macintosh or windows. This is where C’s abstraction comes into play. On different operating systems,
C uses the low-level functions to create a wrapper around files used everywhere, meaning that C on Linux uses
the above calls.
• fopen opens a file and returns an object. null is returned if the program doesn’t have permission for the
file.
• fread reads a certain number of bytes from a file. An error is returned if already at the end of the file when
which the program must call feof() to check if the program attempted to read past the end of the file.
But programs don’t get the expressiveness that Linux gives with system calls. A program can convert back and
forth between them with int fileno(FILE* stream) and FILE* fdopen(int fd...). Also, C files are
buffered meaning that their contents may be written to the backing after the call returns. You can change that
with C options.
Danger With portability you lose something important, the ability to tell an error. A program can fopen a
file descriptor and get a FILE* object but it won’t be the same as a file meaning that certain calls will fail or
act weirdly. The C API reduces this weirdness, but for example a program cannot fseek to a part of the file, or
perform any operations with its buffering. The problem is the API won’t give a lot of warning because C needs to
maintain compatibility with other operating systems. To keep things simple, use the C API of files when dealing
with a file on disk, which will work fine. Otherwise, be in for a rough ride for portability’s sake.
Determining File Length
For files less than the size of a long, using fseek and ftell is a simple way to accomplish this. Move to the end of
the file and find out the current position.
fseek(f, 0, SEEK_END);
long pos = ftell(f);
This tells us the current position in the file in bytes - i.e. the length of the file!
fseek can also be used to set the absolute position.
All future reads and writes in the parent or child processes will honor this position. Note writing or reading
from the file will change the current position. See the man pages for fseek and ftell for more information.
Okay so now you have a list of tools in your toolbox to tackle communicating between processes, so what should
you use?
There is no hard answer, though this is the most interesting question. Generally, we have retained pipes for
legacy reasons. This means that we only use them to redirect stdin, stdout, and stderr for the collection of logs
and similar programs. You may find processes trying to communicate with unnamed or named pipes as well. Most
of the time you won’t be dealing with this interaction directly though.
Files are used almost all the time as a form of IPC. Hadoop is a great example where processes will write to
append-only tables and then other processes will read from those tables. We generally use files under a few cases.
One case is if we want to save the intermediate results of an operation to a file for future use. Another case is if
putting it in memory would cause an out of memory error. On Linux, file operations are generally pretty cheap, so
most programmers use it for larger intermediate storage.
mmap is used for two scenarios. One is a linear or near-linear read through of the file. Meaning, a
program reads the file front to back or back to front. The key is that the program doesn’t jump around too much.
Jumping around too much causes thrashing and loses all the benefits of using mmap. The other usage of mmap is
for direct memory inter-process communication. This means that a program can store structures in a piece of
mmap’ed memory and share them between two processes. Python and Ruby use this mapping all the time to
utilize copy on write semantics.
Topics
1. Virtual Memory
2. Page Table
3. MMU/TLB
4. Address Translation
5. Page Faults
6. Frames/Pages
9. Pipes
3. What is a page table? How about a physical frame? Does a page always need to point to a physical frame?
4. What is a page fault? What are the types? When does it result in a SEGFAULT?
5. What are the advantages to a single-level page table? Disadvantages? How about a multi-level table?
7. How do you determine how many bits are used in the page offset?
8. Given a 64-bit address space, 4kb pages and frames, and a 3 level page table, how many bits are the Virtual
page number 1, VPN2, VPN3 and the offset?
11. Under what conditions will calling read() on a pipe block? Under what conditions will read() immediately
return 0
12. What is the difference between a named pipe and an unnamed pipe?
14. Write a function that uses fseek and ftell to replace the middle character of a file with an ’X’
15. Write a function that creates a pipe and uses write to send 5 bytes, "HELLO" to the pipe. Return the read file
descriptor of the pipe.
17. Why is getting the file size with ftell not recommended? How should you do it instead?
Bibliography
CPU Scheduling is the problem of efficiently selecting which process to run on a system’s CPU cores. In a busy
system, there will be more ready-to-run processes than there are CPU cores, so the system kernel must evaluate
which processes should be scheduled to run and which processes should be executed later. The system must also
decide whetherit should take a particular process and pause its execution – along with any associated threads.
The balance comes from stopping processes often enough where you have a responsive computer but infrequently
enough where the programs themselves are spending minimal time context switching. It is a hard balance to get
right.
The additional complexity of multi-threaded and multiple CPU cores are considered a distraction to this initial
exposition so are ignored here. Another gotcha for non-native speakers is the dual meaning of “Time”: The word
“Time” can be used in both clock and elapsed duration context. For example “The arrival time of the first process
was 9:00am.” and, “The running time of the algorithm is 3 seconds”.
One clarification that we will make is that our scheduling will mainly deal with short term or CPU scheduling.
That means we will assume that the processes are in memory and ready to go. The other types of scheduling
are long and medium term. Long term schedulers act as gatekeepers to the processing world. When a process
requests another process to be executed, it can either tell the process yes, no, or wait. The medium term scheduler
deals with the caveats of moving a process from the paused state in memory to the paused state on disk when
there are too many processes or some process are known to use an insignificant amount of CPU cycles. Think
about a process that only checks something once an hour.
241
High Level Scheduler Overview
Schedulers are pieces of software programs. In fact, you can implement schedulers yourself! If you are given
a list of commands to exec, a program can schedule them them with SIGSTOP and SIGCONT. These are called
user space schedulers. Hadoop and python’s celery may do some sort of user space scheduling or deal with the
operating system.
At the operating system level, you generally have this type of flowchart, described in words first below. Note,
please don’t memorize all the states.
1. New is the initial state. A process has been requested to schedule. All process requests come from fork or
clone. At this point the operating system knows it needs to create a new process.
2. A process moves from the new state to the ready. This means any structs in the kernel are allocated. From
there, it can go into ready suspended or running.
3. Running is the state that we hope most of our processes are in, meaning they are doing useful work. A
process could either get preempted, blocked, or terminate. Preemption brings the process back to the ready
state. If a process is blocked, that means it could be waiting on a mutex lock, or it could’ve called sleep –
either way, it willingly gave up control.
4. On the blocked state the operating system can either turn the process ready or it can go into a deeper state
called blocked suspended.
5. There are so-called deep slumber states called blocked suspended and blocked ready. You don’t need to
worry about these.
We will try to pick a scheme that decides when a process should move to the running state, and when it should
be moved back to the ready state. We won’t make much mention of how to factor in voluntarily blocked states
and when to switch to deep slumber states.
Measurements
Scheduling affects the performance of the system, specifically the latency and throughput of the system. The
throughput might be measured by a system value, for example, the I/O throughput - the number of bits written
per second, or the number of small processes that can complete per unit time. The latency might be measured
by the response time – elapse time before a process can start to send a response – or wait time or turnaround
time –the elapsed time to complete a task. Different schedulers offer different optimization trade-offs that may be
appropriate for desired use. There is no optimal scheduler for all possible environments and goals. For example,
Shortest Job First will minimize total wait time across all jobs but in interactive (UI) environments it would be
preferable to minimize response time at the expense of some throughput, while FCFS seems intuitively fair and
easy to implement but suffers from the Convoy Effect. Arrival time is the time at which a process first arrives at
the ready queue, and is ready to start executing. If a CPU is idle, the arrival time would also be the starting time
of execution.
What is preemption?
Without preemption, processes will run until they are unable to utilize the CPU any further. For example the
following conditions would remove a process from the CPU and the CPU would be available to be scheduled for
other processes. The process terminates due to a signal, is blocked waiting for concurrency primitive, or exits
normally. Thus once a process is scheduled it will continue even if another process with a high priority appears on
the ready queue.
With preemption, the existing processes may be removed immediately if a more preferred process is added
to the ready queue. For example, suppose at t=0 with a Shortest Job First scheduler there are two processes
(P1 P2) with 10 and 20 ms execution times. P1 is scheduled. P1 immediately creates a new process P3, with
execution time of 5 ms, which is added to the ready queue. Without preemption, P3 will run 10ms later (after P1
has completed). With preemption, P1 will be immediately evicted from the CPU and instead placed back in the
ready queue, and P3 will be executed instead by the CPU.
Any scheduler that doesn’t use some form of preemption can result in starvation because earlier processes may
never be scheduled to run (assigned a CPU). For example with SJF, longer jobs may never be scheduled if the
system continues to have many short jobs to schedule. It all depends on the type of scheduler.
• A process was blocked waiting for a read from storage or socket to complete and data is now available.
• A process thread was blocked on a synchronization primitive (condition variable, semaphore, mutex lock)
but is now able to continue.
• A process is blocked waiting for a system call to complete but a signal has been delivered and the signal
handler needs to run.
Measures of Efficiency
1. start_time is the wall-clock start time of the process (CPU starts working on it)
2. end_time is the end wall-clock of the process (CPU finishes the process)
4. arrival_time is the time the process enters the scheduler (CPU may start working on it)
1. Turnaround Time is the total time from when the process arrives to when it ends. end_time - arrival_time
2. Response Time is the total latency (time) that it takes from when the process arrives to when the CPU
actually starts working on it. start_time - arrival_time
3. Wait Time is the total wait time or the total time that a process is on the ready queue. A common mistake
is to believe it is only the initial waiting time in the ready queue. If a CPU intensive process with no I/O
takes 7 minutes of CPU time to complete but required 9 minutes of wall-clock time to complete we can
conclude that it was placed on the ready-queue for 2 minutes. For those 2 minutes, the process was ready
to run but had no CPU assigned. It does not matter when the job was waiting, the wait time is 2 minutes.
end_time - arrival_time - run_time
Convoy Effect
The convoy effect is when a process takes up a lot of the CPU time, leaving all other processes with potentially
smaller resource needs following like a Convoy Behind them.
Suppose the CPU is currently assigned to a CPU intensive task and there is a set of I/O intensive processes
that are in the ready queue. These processes require a tiny amount of CPU time but they are unable to proceed
because they are waiting for the CPU-intensive task to be removed from the processor. These processes are starved
until the CPU bound process releases the CPU. But, the CPU will rarely be released. For example, in the case of an
FCFS scheduler, we must wait until the process is blocked due to an I/O request. The I/O intensive process can
now finally satisfy their CPU needs, which they can do quickly because their CPU needs are small and the CPU is
assigned back to the CPU-intensive process again. Thus the I/O performance of the whole system suffers through
an indirect effect of starvation of CPU needs of all processes.
This effect is usually discussed in the context of FCFS scheduler; however, a Round Robin scheduler can also
exhibit the Convoy Effect for long time-quanta.
• The CPU creates a Red-Black tree with the processes virtual runtime (runtime / nice_value) and sleeper
fairness flag – if the process is waiting on something, give it the CPU when it is done waiting.
• Nice values are the kernel’s way of giving priority to certain processes, the lower nice value the higher
priority.
• The kernel chooses the lowest one based on this metric and schedules that process to run next, taking it off
the queue. Since the red-black tree is self-balancing this operation is guaranteed O(log(n)) (selecting the
min process is the same runtime)
Although it is called the Fair Scheduler there are a fair bit of problems.
• Groups of processes that are scheduled may have imbalanced loads so the scheduler roughly distributes
the load. When another CPU gets free it can only look at the average load of a group schedule, not the
individual cores. So the free CPU may not take the work from a CPU that is burning so long as the average
is fine.
• If a group of processes is running on non-adjacent cores then there is a bug. If the two cores are more than
a hop away, the load balancing algorithm won’t even consider that core. Meaning if a CPU is free and a CPU
that is doing more work is more than a hop away, it won’t take the work (may have been patched).
• After a thread goes to sleep on a subset of cores, when it wakes up it can only be scheduled on the cores that
it was sleeping on. If those cores are now busy, the thread will have to wait on them, wasting opportunities
to use other idle cores.
Scheduling Algorithms
5
4
Process
3
Number
2
1
0 2 4 6 8 10 12 14 16
Time (s)
• P1 Arrival: 0ms
• P2 Arrival: 0ms
• P3 Arrival: 0ms
• P4 Arrival: 0ms
• P5 Arrival: 0ms
The processes all arrive at the start and the scheduler schedules the job with the shortest total CPU time. The
glaring problem is that this scheduler needs to know how long this program will run over time before it ran the
program.
Technical Note: A realistic SJF implementation would not use the total execution time of the process but the
burst time or the number of CPU cycles needed to finish a program. The expected burst time can be estimated by
using an exponentially decaying weighted rolling average based on the previous burst time [3, Chapter 6]. For
this exposition, we will simplify this discussion to use the total running time of the process as a proxy for the burst
time.
Advantages
Disadvantages
2. Need to estimate the burstiness of a process which is harder than let’s say a computer network
• P2 at 0ms
• P1 at 1000ms
• P5 at 3000ms
• P4 at 4000ms
• P3 at 5000ms
Here’s what our algorithm does. It runs P2 because it is the only thing to run. Then P1 comes in at 1000ms, P2
runs for 2000ms, so our scheduler preemptively stops P2, and let’s P1 run all the way through. This is completely
up to the algorithm because the times are equal. Then, P5 Comes in – since no processes running, the scheduler
will run process 5. P4 comes in, and since the runtimes are equal P5, the scheduler stops P5 and runs P4. Finally,
P3 comes in, preempts P4, and runs to completion. Then P4 runs, then P5 runs.
Advantages
Disadvantages
• P2 at 0ms
• P1 at 1000ms
• P5 at 3000ms
• P4 at 4000ms
• P3 at 5000ms
Processes are scheduled in the order of arrival. One advantage of FCFS is that scheduling algorithm is simple
The ready queue is a FIFO (first in first out) queue. FCFS suffers from the Convoy effect. Here P2 Arrives, then P1
arrives, then P5, then P4, then P3. You can see the convoy effect for P5.
Advantages
Disadvantages
Round Robin
5
4
Process
3
Number
2
1
0 2 4 6 8 10 12 14 16
Time (s)
• P1 Arrival: 0ms
• P2 Arrival: 0ms
• P3 Arrival: 0ms
• P4 Arrival: 0ms
• P5 Arrival: 0ms
Quantum = 1000ms
Here all processes arrive at the same time. P1 is run for 1 quantum and is finished. P2 for one quantum; then,
it is stopped for P3. After all other processes run for a quantum we cycle back to P2 until all the processes are
finished.
Advantages
Disadvantages
This section could be useful for those that like to analyze these algorithms mathematically
If your co-worker asked you what scheduling algorithm to use, you may not have the tools to analyze each
algorithm. So, let’s think about scheduling algorithms at a high level and break them down by their times. We
will be evaluating this in the context of a random process timing, meaning that each process takes a random but
finite amount of time to finish.
Just a refresher, here are the terms.
Different use cases will be discussed after. Let the maximum amount of time that a process run be equal to
S. We will also assume that there are a finite number of processes running at any given time c. Here are some
concepts from queueing theory that you’ll need to know that will help simplify the theories.
1. Queueing theory involves a random variable controlling the interarrival time – or the time between two
different processes arriving. We won’t name this random variable, but we will assume that (1) it has a mean
of λ and (2) it is distributed as a Poisson random variable. This means the probability of getting a process
exp(−λ)
t units after getting another process is λ t ∗ t! where t! can be approximated by the gamma function
when dealing with real values.
2. We will be denoting the service time S, and deriving the waiting time W , and the response time R; more
specifically the expected values of all of those variables E[S] deriving turnaround time is simply S + W . For
clarity, we will introduce another variable N that is the number of people currently in the queue. A famous
result in queueing theory is Little’s Law which states E[N ] = λE[W ] meaning that the number of people
waiting is the arrival rate times the expected waiting time (assuming the queue is in a steady state).
3. We won’t make many assumptions about how much time it takes to run each process except that it will take
a finite amount of time – otherwise this gets almost impossible to evaluate. We will denote two variables
var(S)
that µ1 is the mean of the waiting time and that the coefficient of variation C is defined as C 2 = E[S]2 to
help us control for processes that take a while to finish. An important note is that when C > 1 we say that
the running times of the process are variadic. We will note below that this rockets up the wait and response
times for FCFS quadratically.
λ
4. ρ = µ < 1 Otherwise, our queue would become infinitely long
5. We will assume that there is one processor. This is known as an M/G/1 queue in queueing theory.
6. We’ll leave the service time as an expectation S otherwise we may run into over-simplifications with the
algebra. Plus it is easier to compare different queueing disciplines with a common factor of service time.
(1 + C 2 ) ρ
E[W ] = ∗ E[S]
2 (1 − ρ)
What does this say? When given as ρ → 1 or the mean job arrival rate equals the mean job processing rate,
then the wait times get long. Also, as the variance of the job increases, the wait times go up.
The response time is simple to calculate, it is the expected number of people ahead of the process in the
queue times the expected time to service each of those processes. From Little’s Law above, we can substitute
that for this. Since we already know the value of the waiting time, we can reason about the response time
as well.
3. A discussion of the results is shows something cool discovered by Conway and Al [1]. Any scheduling
discipline that isn’t preemptive and doesn’t take into account the run time of the process or a priority will
have the same wait, response, and turnaround time. We will often use this as a baseline.
E[W ] = 0
Under a non-strict analysis of processor sharing though, the number of time that the scheduler waits is best
E[S]
approximated by the number of times the scheduler need to wait. You’ll need Q service periods where Q
is the quanta, and you’ll need about E[N ] ∗ Q time in between those periods. Leading to an average time of
The reason this proof is non-rigorous is that we can’t assume that there will always be E[N ] ∗ Q time on
average in between cycles because it depends on the state of the system. This means we need to factor in
various variations in processing delay. We also can’t use Little’s Law in this case because there is no real
steady state of the system. Otherwise, we’d be able to prove some weird things.
Interestingly, we don’t have to worry about the convoy effect or any new processes coming in. The total wait
time remains bounded by the number of people in the queue. For those of you familiar with tail inequalities
since processes arrive according to a Poisson distribution, the probability that we’ll get many processes
drops off exponentially due to Chernoff bounds (all arrivals are independent of other arrivals). Meaning
roughly we can assume low variance on the number of processes. As long as the service time is reasonable
on average, the wait time will be too.
Under strict processor sharing, it is 0 because all jobs are worked on. In practice, the response time is.
E[R] = E[N ] ∗ Q
Where Q is the quanta. Using Little’s Law again, we can find out that
E[R] = λE[W ] ∗ Q
3. A different variable is the amount of service time let the service time for processor sharing be defined as
E[S]
S PS . The slowdown is E[S PS ] = 1−ρ Which means as the mean arrival rate equals the mean processing time,
then the jobs will take asymptotically as long to finish. In the non-strict analysis of processor sharing, we
assume that
E[SRR ] = E[S] + Q ∗ ε, ε > 0
4. That naturally leads to the comparison, what is better? The response time is roughly the same comparing
the non-strict versions, the wait time is roughly the same, but notice that nothing about the variation of the
jobs is put in. That’s because RR doesn’t have to deal with the convoy effect and any variances associated,
otherwise FCFS is faster in a strict sense. It also takes more time for the jobs to finish, but the overall
turnaround time is lower under high variance loads.
Non Preemptive Priority
We will introduce the notation that there are k different priorities and ρi > 0 is the average load contribution for
k x
ρi = ρ. We will also denote ρ(x) = ρi which is the load contribution for
P P
priority i We are constrained by
i=0 i=0
all higher and similar priority processes to x. The last bit of notation is that we will assume that the probability of
k
pj = 1
P
getting a process of priority i is pi and naturally
j=0
(1 + C) ρ
E[Wx ] = ∗ E[Si ]
2 (1 − ρ(x)) ∗ (1 − ρ(x − 1))
The full derivation is as always in the book. A more useful inequality is that.
1+C ρ
E[Wx ] ≤ ∗ ∗ E[Si ]
2 (1 − ρ(x))2
because the addition of ρ x can only increase the sum, decrease the denominator or increase the overall
function. This means that if one is priority 0, then a process only need to wait for the other P0 processes
which there should be ρC/(1 − ρ0 ) P0 processes arrived before to process in FCFS order. Then the next
priority has to wait for all the others and so on and so forth.
The expected overall wait time is now
k
X
E[W ] = E[Wi ] ∗ pi
i=0
Now that we have notational soup, let’s factor out the important terms.
k
X pi
i=0
(1 − ρ(i))2
1
1−ρ
In words – you can work this out with experimenting distributions – if the system has a lot of low priority
processes who don’t contribute a lot to the average load, your average wait time becomes much lower.
i
X
E[R i ] = E[N j ] ∗ E[S j ]
j=0
Which says that the scheduler needs to wait for all jobs with a higher priority and the same to go before a
process can go. Imagine a series of FCFS queues that a process needs to wait your turn. Using Little’s Law
for different colored jobs and the formula above we can simplify this
i
X
E[R i ] = λ j E[W j ] ∗ E[S j ]
j=0
And we can find the average response time by looking at the distribution of jobs
k
X k
X
E[R] = pi [ λ j E[W j ] ∗ E[S j ]]
i=0 j=0
Meaning that we are tied to wait times and service times of all other processes. If we break down this
equation, we see again if we have a lot of high priority jobs that don’t contribute a lot to the load then our
entire sum goes down. We won’t make too many assumptions about the service time for a job because that
would interfere with our analysis from FCFS where we left it as an expression.
3. As for a comparison with FCFS in the average case, it usually does better assuming that we have a smooth
probability distribution – i.e. the probability of getting any particular priority is zero. In all of our formulas,
we still have some probability mass to put on lower priority processes, bringing the expectation down. This
statement doesn’t hold for all smooth distributions but for most real-world smoothed distributions (which
tend to be smooth) they do.
4. This isn’t even to mention the idea of utility. Utility means that if we gain an amount of happiness by having
certain jobs finish, priority and preemptive priority maximize that while balancing out other measures of
efficiency.
1. Let Z x
ρ(x) = ρu du
0
2. Z k
pu du = 1
0
Probability constraint.
4. The only notational difference is we don’t have to make any assumptions about the service times of the jobs
because they are denoted by service times subscript, all other analyses are the same.
5. This means if you want low wait times on average compared to FCFS, your distribution needs to be
right-skewed.
Preemptive Priority
We will describe priority and SJF’s preemptive version in the same section because it is essentially the same as
we’ve shown above. We’ll use the same notation as before. We will also introduce an additional term Ci which
denotes the variation among a particular class
var(Si )
Ci =
E[Si ]
i
P (1+C j )
2
j=0
E[R i ] = ∗ E[Si ]
(1 − ρ(x)) ∗ (1 − ρ(x − 1))
If this looks familiar it should. This is the average wait time in the nonpreemptive case with a small change.
Instead of using the variance of the entire distribution, we are looking at the variance of each job coming in.
The whole response times are
k
X
E[R] = pi ∗ E[R i ]
i=0
If lower priorities jobs come in at a higher service time variance, that means our average response times
could go down, unless they make up most of the jobs that come in. Think of the extreme cases. If 99% of
the jobs are high priority and the rest make up the other percent, then the other jobs will get frequently
interrupted, but high priority jobs will make up most of the jobs, so the expectation is still low. The other
extreme is if one percent of jobs are high priority and they come in a low variance. That means the chances
the system getting a high priority jobs that will take a long time is low, thus making our response times
lower on average. We only run into trouble if high priority jobs make up a non-negligible amount, and they
have a high variance in service times. This brings down response times as well as wait times.
2. Waiting Time
E[Si ]
E[Wi ] = E[R i ] +
1 − ρ(i)
k
X E[Si ]
E[W ] = pi (E[R i ] + )
i=0
1 − ρ(i)
We can simplify to
k
X E[Si ]pi
E[W ] = E[R] +
i=0
(1 − ρ(i))
We incur the same cost on response time and then we have to suffer an additional cost based on what the
probabilities are of lower priority jobs coming in and taking this job out. That is what we call the average
interruption time. This follows the same laws as before. Since we have a variadic, pyramid summation if
we have a lot of jobs with small service times then the wait time goes down for both additive pieces. It
can be analytically shown that this is better given certain probability distributions. For example, try with
the uniform versus FCFS or the non preemptive version. What happens? As always the proof is left to the
reader.
3. Turnaround Time is the same formula E[T ] = E[S] + E[W ]. This means that given a distribution of jobs
that has either low waiting time as described above, we will get low turnaround time – we can’t control the
distribution of service times.
Preemptive Shortest Job First
Unfortunately, we can’t use the same trick as before because an infinitesimal point doesn’t have a controlled
variance. Imagine the comparisons though as the same as the previous section.
Topics
• Scheduling Algorithms
• Measures of Efficiency
Questions
• What is scheduling?
• Do preemptive algorithms do better on average response time compared to non preemptive? How about
turnaround/wait time?
Bibliography
[1] R.W. Conway, W.L. Maxwell, and L.W. Miller. Theory of scheduling. Addison-Wesley Pub. Co., 1967. URL
https://books.google.com/books?id=CSozAAAAMAAJ.
[2] M. Harchol-Balter. Performance Modeling and Design of Computer Systems: Queueing Theory in Action.
Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University
Press, 2013. ISBN 9781107027503. URL https://books.google.com/books?id=75SbigDGK0kC.
[3] A. Silberschatz, P.B. Galvin, and G. Gagne. Operating System Concepts. Wiley, 2005. ISBN 9780471694663.
URL https://books.google.com/books?id=FH8fAQAAIAAJ.
Networking has become arguably the most important use of computers in the past 10-20 years. Most of us
nowadays can’t stand a place without WiFi or any connectivity, so it is crucial as programmers that you have
an understanding of networking and how to program to communicate across networks. Although it may sound
complicated, POSIX has defined nice standards that make connecting to the outside world easy. POSIX also lets
you peer underneath the hood and optimize all the little parts of each connection to write highly performant
programs.
As an addendum that you’ll read more about in the next chapter, we will be strict in our notation for sizes.
That means that when we refer to the SI prefixes of Kilo-, Mega-, etc, then we are always referring to a power
of 10. A kilobyte is one thousand bytes, a megabyte is a thousand kilobytes and so on. If we need to refer to
1024 bytes, we will use the more accurate term Kibibyte. Mibibyte and Gibibyte are the analogs of Megabyte
and Gigabyte respectively. We make this distinction to make sure that we aren’t off by 24. The reasons for this
misnomer will be explained in the filesystems chapter.
The Open Source Interconnection 7 layer model (OSI Model) is a sequence of segments that define standards for
both infrastructure and protocols for forms of radio communication, in our case the Internet. The 7 layer model is
as follows
1. Layer 1: The Physical Layer. These are the actual waves that carry the bauds across the wire. As an aside,
bits don’t cross the wire because in most mediums you can alter two characteristics of a wave – the amplitude
and the frequency – and get more bits per clock cycle.
2. Layer 2: The Link Layer. This is how each of the agents reacts to certain events (error detection, noisy
channels, etc). This is where Ethernet and WiFi live.
3. Layer 3: The Network Layer. This is the heart of the Internet. The bottom two protocols deal with
communication between two different computers that are directly connected. This layer deals with routing
packets from one endpoint to another.
257
4. Layer 4: The Transport Layer. This layer specifies how the slices of data are received. The bottom three
layers make no guarantee about the order that packets are received and what happens when a packet is
dropped. Using different protocols, this layer can.
5. Layer 5: The Session Layer. This layer makes sure that if a connection in the previous layers is dropped, a
new connection in the lower layers can be established, and it looks like nothing happened to the end-user.
6. Layer 6: The Presentation Layer. This layer deals with encryption, compression, and data translation. For
example, portability between different operating systems like translating newlines to windows newlines.
7. Layer 7: The Application Layer. Hyper Text Transfer Protocol and File Transfer Protocol are both defined
at this level. This is typically where we define protocols across the Internet. As programmers, we only go
lower when we think we can create algorithms that are more suited to our needs than all of the below.
This book won’t cover networking in depth. We will focus on some aspects of layers 3, 4, and 7 because they
are essential to know if you are going to be doing something with the Internet, which at some point in your
career you will be. As for another definition, a protocol is a set of specifications put forward by the Internet
Engineering Task Force that govern how implementers of a protocol have their program or circuit behave under
specific circumstances.
The following is a short introduction to internet protocol (IP), the primary way to send datagrams of information
from one machine to another. “IP4”, or more precisely, Internet Protocol Version 4 is version 4 of the Internet
Protocol that describes how to send Packet of information across a network from one machine to another. Even as
of 2018, IPv4 still dominates Internet traffic, but google reports that 24 countries now supply 15% of their traffic
through IPv6 [2]. A significant limitation of IPv4 is that source and destination addresses are limited to 32 bits.
IPv4 was designed at a time when the idea of 4 billion devices connected to the same network was unthinkable or
at least not worth making the packet size larger. are written typically in a sequence of four octets delimited by
periods "255.255.255.0" for example.
Each IPv4 Datagram includes a small header - typically 20 , that includes a source and destination address.
Conceptually the source and destination addresses can be split into two: a network number the upper bits and
lower bits represent a particular host number on that network.
A newer packet protocol Internet Protocol Version 6 solves many of the limitations of IPv4 like making routing
tables simpler and 128-bit addresses. However, little web traffic is IPv6 based on comparison as of 2018 [2] We write
IPv6 addresses in a sequence of eight, four hexadecimal delimiters like "1F45:0000:0000:0000:0000:0000:0000:0000".
Since that can get unruly, we can omit the zeros "1F45::". A machine can have an IPv6 address and an IPv4
address.
There are special IP Addresses. One such in IPv4 is 127.0.0.1, IPv6 as 0:0:0:0:0:0:0:1 or ::1 also
known as localhost. Packets sent to 127.0.0.1 will never leave the machine; the address is specified to be the same
machine. There are a lot of others that are denoted by certain octets being zeros or 255, the maximum value. You
won’t need to know all the terminology, keep in mind that the actual number of IP addresses that a machine can
have globally over the Internet is smaller than the number of “raw” addresses. This book covers how IP deals with
routing, fragmenting, and reassembling upper-level protocols. A more in-depth aside follows.
Extra: In-depth IPv4 Specification
The Internet Protocol deals with routing, fragmentation, and reassembly of fragments. Datagrams are formatted
as such
0 16 32
Header
Version Service Type Total Length
Length
Source Address
Destination Address
Options Padding
2. The next octet is how long the header is. Although it may seem that the header is a constant size, you can
include optional parameters to augment the path that is taken or other instructions.
3. The next two octets specify the total length of the datagram. This means this is the header, the data, the
footer, and the padding. This is given in multiple of octets, meaning that a value of 20 means 20 octets.
4. The next two are Identification number. IP handles taking packets that are too big to be sent over the
physical wire and chunks them up. As such, this number identifies what datagram this originally belonged
to.
6. The next octet and half is fragment number. If this packet was fragmented, this is the number this fragment
represents
7. The next octet is time to live. So this is the number of "hops" (travels over a wire) a packet is allowed to
go. This is set because different routing protocols could cause packets to go in circles, the packets must be
dropped at some point.
8. The next octet is the protocol number. Although protocols between different layers of the OCI model
are supposed to be black boxes, this is included, so that hardware can peer into the underlying protocol
efficiently. Take for example IP over IP (yes you can do that!). Your ISP wraps IPv4 packets sent from your
computer to the ISP in another IP layer and sends the packet off to be delivered to the website. On the
reverse trip, the packet is "unwrapped" and the original IP datagram is sent to your computer. This was
done because we ran out of IP addresses, and this adds additional overhead but it is a necessary fix. Other
common protocols are TCP, UDP, etc.
9. The next two octets is an internet checksum. This is a CRC that is calculated to make sure that a wide
variety of bit errors are detected.
10. The source address is what people generally refer to as the IP address. There is no verification of this, so
one host can pretend to be any IP address possible
11. The destination address is where you want the packet to be sent to. Destinations are crucial to the routing
process.
13. Footer: A bit of padding to make sure your data is a multiple of 4 octets.
14. After: Your data! All data of higher-order protocols are put following the header.
Extra: Routing
The Internet Protocol routing is an amazing intersection of theory and application. We can imagine the entire
Internet as a set of graphs. Most peers are connected to what we call "peering points" – these are the WiFi routers
and Ethernet ports that one finds at home, at work, and in public. These peering points are then connected to a
wired network of routers, switches, and servers that all route themselves. At a high level there are two types of
routing
1. Internal Routing Protocols. Internal protocols are routing designed for within an ISP’s network. These
protocols are meant to be fast and more trusting because all computers, switches, and routers are part of an
ISP. communication between two routers.
2. External Routing Protocols. These typically happen to be ISP to ISP protocol. Certain routers are designated
as border routers. These routers talk to routers from ISPs who have different policies from accepting or
receiving packets. If an evil ISP is trying to dump all network traffic onto your ISP, these routers would deal
with that. These protocols also deal with gathering information about the outside world to each router. In
most routing protocols using link state or OSPF, a router must necessarily calculate the shortest path to the
destination. This means it needs information about the "foreign" routers which is disseminated according to
these protocols.
These two protocols have to interplay with each other nicely to make sure that packets are mostly delivered.
Also, ISPs need to be nice to each other. Theoretically, an ISP can handle a smaller load by forwarding all packets
to another ISP. If everyone does that then, no packets get delivered at all which won’t make customers happy at
all. These two protocols need to be fair so the result works
If you want to read more about this, look at the Wikipedia page for routing here Routing.
Extra: Fragmentation/Reassembly
Lower layers like WiFi and Ethernet have maximum transmission sizes. The reason being is
2. If an error occurs, we want some sort of "progress bar" on how far the communication has gone instead of
retransmitting the entire stream.
3. There are physical limitations, keeping a laser beam in optics working continuously may cause bit errors.
If the Internet Protocol receives a packet that is too big for the maximum size, it must chunk it up. TCP
calculates how many datagrams that it needs to construct a packet and ensures that they are all transmitted and
reconstructed at the end receiver. The reason that we barely use this feature is that if any fragment is lost, the
entire packet is lost. Meaning that, assuming the probability of receiving a packet assuming each fragment is lost
with an independent percentage, the probability of successfully sending a packet drops off exponentially as packet
size increases.
As such, TCP slices its packets so that it fits inside on IP datagram. The only time that this applies is when
sending UDP packets that are too big, but most people who are using UDP optimize and set the same packet size
as well.
Extra: IP Multicast
A little known feature is that using the IP protocol one can send a datagram to all devices connected to a router
in what is called a multicast. Multicasts can also be configured with groups, so one can efficiently slice up all
connected routers and send a piece of information to all of them efficiently. To access this in a higher protocol,
you need to use UDP and specify a few more options. Note that this will cause undue stress on the network, so a
series of multicasts could flood the network fast.
What’s the deal with IPv6?
0 4 12 16 24 32
Source Address
Destination Address
One of the big features of IPv6 is the address space. The world ran out of IP addresses a while ago and has been
using hacks to get around that. With IPv6 there are enough internal and external addresses so even if we discover
alien civilizations, we probably won’t run out. The other benefit is that these addresses are leased not bought,
meaning that if something drastic happens in let’s say the Internet of things and there needs to be a change in the
block addressing scheme, it can be done.
Another big feature is security through IPsec. IPv4 was designed with little to no security in mind. As such,
now there is a key exchange similar to TLS in higher layers that allows you to encrypt communication.
Another feature is simplified processing. To make the Internet fast, IPv4 and IPv6 headers are verified in
hardware. That means that all header options are processed in circuits as they come in. The problem is that as the
IPv4 spec grew to include a copious amount of headers, the hardware had to become more and more advanced to
support those headers. IPv6 reorders the headers so that packets can be dropped and routed with fewer hardware
cycles. In the case of the Internet, every cycle matters when trying to route the world’s traffic.
What’s My Address?
To obtain a linked list of IP addresses of the current machine use getifaddrs which will return a linked list of
IPv4 and IPv6 IP addresses among other interfaces as well. We can examine each entry and use getnameinfo
to print the host’s IP address. The ifaddrs struct includes the family but does not include the sizeof the struct.
Therefore we need to manually determine the struct sized based on the family.
To get your IP Address from the command line use ifconfig or Windows’ ipconfig.
However, this command generates a lot of output for each interface, so we can filter the output using grep.
Example output:
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
inet 127.0.0.1 netmask 0xff000000
inet6 ::1 prefixlen 128
inet6 fe80::7256:81ff:fe9a:9141%en1 prefixlen 64 scopeid 0x5
inet 192.168.1.100 netmask 0xffffff00 broadcast 192.168.1.255
To grab the IP Address of a remote website, The function getaddrinfo can convert a human-readable domain
name (e.g. www.illinois.edu) into an IPv4 and IPv6 address. It will return a linked-list of addrinfo structs:
struct addrinfo {
int ai_flags;
int ai_family;
int ai_socktype;
int ai_protocol;
socklen_t ai_addrlen;
struct sockaddr *ai_addr;
char *ai_canonname;
struct addrinfo *ai_next;
};
For example, suppose you wanted to find out the numeric IPv4 address of a web server at www.bbc.com.
We do this in two stages. First, use getaddrinfo to build a linked-list of possible connections. Secondly, use
getnameinfo to convert the binary address of one of those into a readable form.
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
int main() {
hints.ai_family = AF_INET; // AF_INET means IPv4 only addresses
freeaddrinfo(infoptr);
return 0;
}
Possible output.
212.58.244.70
212.58.244.71
One can specify IPv4 or IPv6 with AF_UNSPEC. Just replace the ai_family attribute in the above code with the
following.
hints.ai_family = AF_UNSPEC
If you are wondering how the computer maps Hostname to addresses, we will talk about that in Layer 7.
Spoiler: It is a service called Domain Name Service. Before we move onto the next section, it is important to
note that a single website can have multiple IP addresses. This may be to be efficient with machines. If Google
or Facebook has a single server routing all of their incoming requests to other computers, they’d have to spend
massive amounts of money on that computer or data center. Instead, they can give different regions different IP
addresses and have a computer pick. It isn’t bad to access a website through the non-preferred IP address. The
page may load slower.
Layer 4: TCP and Client
0 16 32
Source Port Destination Port
Sequence Number
Acknowledgment Number
Options Padding
Most services on the Internet today use Transport Control Protocol because it efficiently hides the complexity of
the lower, packet-level nature of the Internet. TCP or Transport Control Protocol is a connection-based protocol
that is built on top of IPv4 and IPv6 and therefore can be described as “TCP/IP” or “TCP over IP”. TCP creates a
pipe between two machines and abstracts away the low-level packet-nature of the Internet. Thus, under most
conditions, bytes sent over a TCP connection delivered and uncorrupted. High performance and error-prone code
won’t even assume that!
TCP has many features that set it apart from the other transport protocol UDP.
1. Port With IP, you are only allowed to send packets to a machine. If you want one machine to handle multiple
flows of data, you have to do it manually with IP. TCP gives the programmer a set of virtual sockets. Clients
specify the socket that you want the packet sent to and the TCP protocol makes sure that applications that
are waiting for packets on that port receive that. A process can listen for incoming packets on a particular
port. However, only processes with (root) access can listen on ports less than 1024. Any process can listen
on ports 1024 or higher. A frequently used port is number 80. It is used for unencrypted HTTP requests or
web pages. For example, if a web browser connects to http://www.bbc.com/ then it will be connecting
to port 80.
2. Packets can get dropped due to network errors or congestion. As such, they need to be retransmitted. At the
same time, the retransmission shouldn’t cause packets more packets to be dropped. This needs to balance
the tradeoff between flooding the network and speed.
3. Out of order packets. Packets may get routed more favorably due to various reasons in IP. If a later packet
arrives before another packet, the protocol should detect and reorder them.
4. Duplicate packets. Packets can arrive twice. Packets can arrive twice. As such, a protocol needs to be able
to differentiate between two packets given a sequence number subject to overflow.
5. Error correction. There is a TCP checksum that handles bit errors. This is rarely used though.
6. Flow Control. Flow control is performed on the receiver side. This may be done so that a slow receiver
doesn’t get overwhelmed with packets. Servers that handle 10000 or 10 million concurrent connections
may need to tell receivers to slow down but remain connected due to load. There is also the problem of
making sure the local network’s traffic is stable.
7. Congestion control. Congestion control is performed on the sender’s side. Congestion control is to avoid a
sender from flooding the network with too many packets. This is important to make sure that each TCP
connection is treated fairly. Meaning that two connections leaving a computer to google and youtube receive
the same bandwidth and ping as each other. One can easily define a protocol that takes all the bandwidth
and leaves other protocols in the dust, but this tends to be malicious because many times limiting a computer
to a single TCP connection will yield the same result.
8. Connection-Oriented/life cycle oriented. You can imagine a TCP connection as a series of bytes sent through
a pipe. There is a “lifecycle” to a TCP connection though. TCP handles setting up the connection through
SYN SYN-ACK ACK. This means the client will send a SYNchronization packet that tells TCP what starting
sequence to start on. Then the receiver will send a SYN-ACK message acknowledging the synchronization
number. Then the client will ACKnowledge that with one last packet. The connection is now open for both
reading and writing on both ends TCP will send data and the receiver of the data will acknowledge that it
received a packet. Then every so often if a packet is not sent, TCP will trade zero-length packets to make
sure the connection is still alive. At any point, the client and server can send a FIN packet meaning that
the server will not transmit. This packet can be altered with bits that only close the read or write end of a
particular connection. When all ends are closed then the connection is over.
1. Security. Connecting to an IP address claiming to be a certain website does not verify the claim (like in
TLS). You could be sending packets to a malicious computer.
2. Encryption. Anybody can listen in on plain TCP. The packets in transport are in plain text. Important things
like your passwords could easily be skimmed by onlookers.
3. Session Reconnection. If a TCP connection dies then a whole new one must be created, and the transmission
has to be started over again. This is handled by a higher protocol.
4. Delimiting Requests. TCP is naturally connection-oriented. Applications that are communicating over TCP
need to find a unique way of telling each other that this request or response is over. HTTP delimits the
header through two carriage returns and uses either a length field or one keeps listening until the connection
closes
TCP Client
There are three basic system calls to connect to a remote machine.
1. int getaddrinfo(const char *node, const char *service, const struct addrinfo *hints, struct
The getaddrinfo call if successful, creates a linked-list of addrinfo structs and sets the given pointer to
point to the first one.
Also, you can use the hints struct to only grab certain entries like certain IP protocols, etc. The addrinfo
structure that is passed into getaddrinfo to define the kind of connection you’d like. For example, to
specify stream-based protocols over IPv6, you can use the following snippet.
The other modes for ‘family‘ are AF_INET4 and AF_UNSPEC which mean IPv4 and unspecified respectively.
This could be useful if you are searching for a service that you aren’t entirely sure which IP version. Naturally,
you get the version in the field back if you specified UNSPEC.
Error handling with getaddrinfo is a little different. The return value is the error code. To convert to a
human-readable error use gai_strerror to get the equivalent short English error text.
// Pull out the socket address info from the addrinfo struct:
connect(sockfd, p->ai_addr, p->ai_addrlen)
4. (Optional) To clean up code call freeaddrinfo(struct addrinfo *ai) on the first level addrinfo
struct.
There is an old function gethostbyname is deprecated. It’s the old way convert a hostname into an IP address.
The port address still needs to be manually set using htons function. It’s much easier to write code to support
IPv4 AND IPv6 using the newer getaddrinfo
This is all that is needed to create a simple TCP client. However, network communications offer many different
levels of abstraction and several attributes and options that can be set at each level. For example, we haven’t
talked about setsockopt which can manipulate options for the socket. You can also mess around with lower
protocols as the kernel provides primitives that contribute to this. Note that you need to be root to create a raw
socket. Also, you need to have a lot of “set up” or starter code, be prepared to have your datagrams be dropped
due to bad form as well. For more information see this guide.
Sending some data
Once we have a successful connection we can read or write like any old file descriptor. Keep in mind if you are
connected to a website, you want to conform to the HTTP protocol specification to get any sort of meaningful
results back. There are libraries to do this. Usually, you don’t connect at the socket level. The number of bytes
read or written may be smaller than expected. Thus, it is important to check the return value of read and write.
A simple HTTP client that sends a request to a compliant URL is below. First, we’ll start with the boring stuff and
the parsing code.
return 0;
}
The code that sends the request is below. The first thing that we have to do is connect to an address.
freeaddrinfo(result);
The next piece of code sends the request. Here is what each header means.
1. "GET %s HTTP/1.0" This is the request verb interpolated with the path. This means to perform the GET
verb on the path using the HTTP/1.0 method.
2. "Connection: close" Means that as soon as the request is over, please close the connection. This line won’t
be used for any other connections. This is a little redundant given that HTTP 1.0 doesn’t allow you to send
multiple requests, but it is better to be explicit given there are non-conformant technologies.
3. "Accept: */*" This means that the client is willing to accept anything.
A more robust piece of code would also check if the write fails or if the call was interrupted.
char *buffer;
asprintf(&buffer,
"GET %s HTTP/1.0\r\n"
"Connection: close\r\n"
"Accept: */*\r\n\r\n",
info->resource);
The last piece of code is the driver code that sends the request. Feel free to use the following code if you
want to open the file descriptor as a FILE object for convenience functions. Just be careful not to forget to set the
buffering to zero otherwise you may double buffer the input, which would lead to performance problems.
ret = handle_okay(sock_file);
fclose(sock_file);
close(sock_fd);
}
The example above demonstrates a request to the server using the HyperText Transfer Protocol. In general,
there are six parts
6. The actual body of the request delimited by two new lines. The body of the request is either if the size is
specified or until the receiver closes their connection.
The server’s first response line describes the HTTP version used and whether the request is successful using a
3 digit response code.
HTTP/1.1 200 OK
If the client had requested a non-existent path, e.g. GET /nosuchfile.html HTTP/1.0 Then the first line
includes the response code is the well-known 404 response code.
For more information, RFC 7231 has the most current specifications on the most common HTTP method today
[4].
The four system calls required to create a minimal TCP server are socket, bind, listen, and accept. Each
has a specific purpose and should be called in roughly the above order
1. int socket(int domain, int socket_type, int protocol)
To create an endpoint for networking communication. A new socket by itself is stores bytes. Though we’ve
specified either a packet or stream-based connections, it is unbound to a particular network interface or
port. Instead, socket returns a network descriptor that can be used with later calls to bind, listen and accept.
As one gotcha, these sockets must be declared passive. Passive server sockets wait for another host to
connect. Instead, they wait for incoming connections. Additionally, server sockets remain open when the
peer disconnects. Instead, the client communicates with a separate active socket on the server that is specific
to that connection.
Since a TCP connection is defined by the sender address and port along with a receiver address and port,
a particular server port there can be one passive server socket but multiple active sockets. One for each
currently open connection. The server’s operating system maintains a lookup table that associates a unique
tuple with active sockets so that incoming packets can be correctly routed to the correct socket.
int optval = 1;
setsockopt(sfd, SOL_SOCKET, SO_REUSEPORT, &optval,
sizeof(optval));
bind(...);
We’ve already seen getaddrinfo that can build a linked list of addrinfo entries and each one of these
can include socket configuration data. What if we wanted to turn socket data into IP and port addresses?
Enter getnameinfo that can be used to convert local or remote socket information into a domain name
or numeric IP. Similarly, the port number can be represented as a service name. For example, port 80 is
commonly used as the incoming connection port for incoming HTTP requests. In the example below, we
request numeric versions for the client IP address and client port number.
One can use the macros NI_MAXHOST to denote the maximum length of a hostname, and NI_MAXSERV to
denote the maximum length of a port. NI_NUMERICHOST gets the hostname as a numeric IP address and
similarly for NI_NUMERICSERV although the port is usually numeric, to begin with. The Open BSD man
pages have more information
• Using the socket descriptor of the passive server socket (described above)
• The bind call will fail if the port is currently in use. Ports are per machine – not per process or user. In
other words, you cannot use port 1234 while another process is using that port. Worse, ports are by default
‘tied up’ after a process has finished.
Example Server
A working simple server example is shown below. Note: this example is incomplete. For example, the socket file
descriptor remains open and memory created by getaddrinfo remains allocated. First, we get the address info
for our current machine.
if (listen(sock_fd, 10) != 0) {
perror("listen()");
exit(1);
}
We are finally ready to listen for connections, so we’ll tell the user and accept our first client.
After that, we can treat the new file descriptor as a stream of bytes much like a pipe.
char buffer[1000];
// Could get interrupted
int len = read(client_fd, buffer, sizeof(buffer) - 1);
buffer[len] = ’\0’;
[language=C]
Sorry To Interrupt
One concept that we need to make clear is that you need to handle interrupts in your networking code. That
means that the sockets or accepted file descriptors that you read to or write to may have their calls interrupted –
most of the time you will get an interrupt or two. In reality, any of your system calls could get interrupted. The
reason we bring this up now is that you are usually waiting for the network. Which is an order of magnitude
slower than processes. Meaning a higher probability of getting interrupted.
How would you handle interrupts? Let’s try a quick example.
Layer 4: UDP
UDP is a connectionless protocol that is built on top of IPv4 and IPv6. It’s simple to use. Decide the destination
address and port and send your data packet! However, the network makes no guarantee about whether the
packets will arrive. Packets may be dropped if the network is congested. Packets may be duplicated or arrive out
of order.
A typical use case for UDP is when receiving up to date data is more important than receiving all of the data.
For example, a game may send continuous updates of player positions. A streaming video signal may send picture
updates using UDP
UDP Attributes
• Unreliable Datagram Protocol Packets sent through UDP may be dropped on their way to the destination.
This can especially be confusing because if you only test on your loop-back device – this is localhost or
127.0.0.1 for most users – then packets will seldom be lost because no network packets are sent.
• Simple The UDP protocol is supposed to have much less fluff than TCP. Meaning that for TCP there are a lot
of configurable parameters and a lot of edge cases in the implementation. UDP is fire and forget.
• Stateless/Transaction The UDP protocol is stateless. This makes the protocol more simple and lets the
protocol represent simple transactions like requesting or responding to queries. There is also less overhead
to sending a UDP message because there is no three-way handshake.
• Manual Flow/Congestion Control You have to manually manage the flow and congestion control which is
a double-edged sword. On one hand, you have full control over everything. On the other hand, TCP has
decades of optimization, meaning your protocol for its use cases needs to be more efficient that to be more
beneficial to use it.
• Multicast This is one thing that you can only do with UDP. This means that you can send a message to every
peer connected to a particular router that is part of a particular group.
The previous code grabs an entry hostent that matches by hostname. Even though this isn’t portable, it
gets the job done. First is to connect to it and make it reusable – the same as a TCP socket. Note that we pass
SOCK_DGRAM instead of SOCK_STREAM.
Then, we can copy over our hostent struct into the sockaddr_in struct. Full definitions are provided in
the man pages so it is safe to copy them over.
Then a final useful part of UDP is that we can time out receiving a packet as opposed to TCP because UDP isn’t
connection-oriented. The snippet to do that is below.
Now, the socket is connected and ready to use. We can use sendto to send a packet. We should also check
the return value. Note that we won’t get an error if the packet isn’t delivered because that is a part of the UDP
protocol. We will, however, get error codes for invalid structs, bad addresses, etc.
char *to_send = "Hello!"
int send_ret = sendto(sock_fd, // Socket
to_send, // Data
strlen(to_send), // Length of data
0, // Flags
(struct sockaddr *)&ipaddr, // Address
sizeof(ipaddr)); // How long the address is
The above code simply sends “Hello” through a UDP. There is no idea of if the packet arrives, is processed, etc.
UDP Server
There are a variety of function calls available to send UDP sockets. We will use the newer getaddrinfo to
help set up a socket structure. Remember that UDP is a simple packet-based (‘datagram’) protocol. There is no
connection to set up between the two hosts. First, initialize the hints addrinfo struct to request an IPv6, passive
datagram socket.
memset(&hints, 0, sizeof(hints));
hints.ai_family = AF_INET6;
hints.ai_socktype = SOCK_DGRAM;
hints.ai_flags = AI_PASSIVE;
Next, use getaddrinfo to specify the port number. We don’t need to specify a host as we are creating a server
socket, not sending a packet to a remote host. Be careful not to send “localhost” or any other synonym for the
loop-back address. We may end up trying to passively listen to ourselves and resulting in bind errors.
The port number is less than 1024, so the program will need root privileges. We could have also specified a
service name instead of a numeric port value.
So far, the calls have been similar to a TCP server. For a stream-based service, we would call listen and
accept. For our UDP-server, the program can start waiting for the arrival of a packet.
The addr struct will hold the sender (source) information about the arriving packet. Note the sockaddr_storage
type is sufficiently large enough to hold all possible types of socket addresses – IPv4, IPv6 or any other Internet
Protocol. The full UDP server code is below.
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#include <unistd.h>
#include <arpa/inet.h>
while(1){
char buf[1024];
ssize_t byte_count = recvfrom(sockfd, buf, sizeof(buf), 0,
&addr, &addrlen);
buf[byte_count] = ’\0’;
return 0;
}
Note that if you perform a partial read from a packet, the rest of that data is discarded. One call to recvfrom is
one packet. To make sure that you have enough space, use 64 KiB as storage space.
Layer 7: HTTP
Layer 7 of the OSI layer deals with application-level interfaces. Meaning that you can ignore everything below
this layer and treat the Internet as a way of communicating with another computer than can be secure and the
session may reconnect. Common layer 7 protocols are the following
1. HTTP(S) - Hypertext Transfer Protocol. Sends arbitrary data and executes remote actions on a web server.
The S standards for secure where the TCP connection uses the TLS protocol to ensure that the communication
can’t be read easily by an onlooker.
2. FTP - File Transfer Protocol. Transfers a file from one computer to another
3. TFTP - Trivial File Transfer Protocol. Same as above but using UDP.
5. SMTP - Simple Mail Transfer Protocol. Allows one to send plain text emails to an email server
6. SSH - Secure SHell. Allows one computer to connect to another computer and execute commands remotely.
9. NTP - Network Time Protocol. This protocol helps keep your computer’s clock synced with the outside world
What’s my name?
Remember when we were talking before about converting a website to an IP address? A system called “DNS”
(Domain Name Service) is used. If the IP address is missing form a machine’s cache then it sends a UDP packet to
a local DNS server. This server may query other upstream DNS servers.
DNS by itself is fast but insecure. DNS requests are unencrypted and susceptible to ‘man-in-the-middle’ attacks.
For example, a coffee shop internet connection could easily subvert your DNS requests and send back different IP
addresses for a particular domain. The way this is usually subverted is that after the IP address is obtained then a
connection is usually made over HTTPS. HTTPS uses what is called the TLS (formerly known as SSL) to secure
transmissions and verify that the hostname is recognized by a Certificate Authority. Certificate Authorities often
get hacked so be careful of equating a green lock to secure. Even with this added layer of security, the united
states government has recently issued a request for everyone to upgrade their DNS to DNSSec which includes
additional security-focused technologies to verify with high probability that an IP address is truly associated with
a hostname.
Digression aside, DNS works like this in a nutshell
2. If that DNS server has the packet cached return the result
3. If not, ask higher-level DNS servers for the answer. Cache and send the result
4. If either packet is not answered from within a guessed timeout, resend the request.
If you want the full bits and pieces, feel free to look at the Wikipedia page. In essence, there is a hierarchy
of DNS servers. First, there is the dot hierarchy. This hierarchy first resolves top-level domains .edu .gov etc.
Next, it resolves the next level i.e. illinois.edu. Then the local resolvers can resolve any number of URLs. For
example, the Illinois DNS server handles both cs.illinois.edu and cs241.cs.illinois.edu. There is a
limit on how many subdomains you can have, but this is often used to route requests to different servers to avoid
having to buy many high performant servers to route requests.
Non-Blocking IO
When you call read() if the data is unavailable, it will wait until the data is ready before the function returns.
When you’re reading data from a disk, that delay is short, but when you’re reading from a slow network connection,
requests take a long time. And the data may never arrive, leading to an unexpected close.
POSIX lets you set a flag on a file descriptor such that any call to read() on that file descriptor will return
immediately, whether it has finished or not. With your file descriptor in this mode, your call to read() will start
the read operation, and while it’s working you can do other useful work. This is called “non-blocking” mode since
the call to read() doesn’t block.
To set a file descriptor to be non-blocking.
// fd is my file descriptor
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
For a socket, you can create it in non-blocking mode by adding SOCK_NONBLOCK to the second argument to
socket():
When a file is in non-blocking mode and you call read(), it will return immediately with whatever bytes are
available. Say 100 bytes have arrived from the server at the other end of your socket and you call read(fd, buf, 150).
‘read‘ will return immediately with a value of 100, meaning it read 100 of the 150 bytes you asked for. Say
you tried to read the remaining data with a call to read(fd, buf+100, 50), but the last 50 bytes still hadn’t
arrived yet. read() would return -1 and set the global error variable errno to either EAGAIN or EWOULDBLOCK.
That’s the system’s way of telling you the data isn’t ready yet.
write() also works in non-blocking mode. Say you want to send 40,000 bytes to a remote server using a
socket. The system can only send so many bytes at a time. In non-blocking mode, write(fd, buf, 40000)
would return the number of bytes it was able to send immediately, or about 23,000. If you called write() right
away again, it would return -1 and set errno to EAGAIN or EWOULDBLOCK. That’s the system’s way of telling you
that it’s still busy sending the last chunk of data and isn’t ready to send more yet.
There are a few ways to check that your IO has arrived. Let’s see how to do it using select and epoll. The first
interface we have is select. It isn’t preferred by many in the POSIX community if they have an alternative to it,
and in most cases there is an alternative to it.
Given three sets of file descriptors, select() will wait for any of those file descriptors to become ‘ready’.
1. readfds - a file descriptor in readfds is ready when there is data that can be read or EOF has been
reached.
2. writefds - a file descriptor in writefds is ready when a call to write() will succeed.
select() returns the total number of ready file descriptors. If none of them become ready during the time
defined by timeout, it will return 0. After select() returns, the caller will need to loop through the file descriptors
in readfds and/or writefds to see which ones are ready. As readfds and writefds act as both input and output
parameters, when select() indicates that there are ready file descriptors, it would have overwritten them to
reflect only the ready file descriptors. Unless the caller intends to call select() only once, it would be a good
idea to save a copy of readfds and writefds before calling it. Here is a comprehensive snippet.
if (num_ready < 0) {
perror("error in select()");
} else if (num_ready == 0) {
printf("timeout\n");
} else {
for (int i=0; i < read_fd_count; i++)
if (FD_ISSET(my_read_fds[i], &readfds))
printf("fd %d is ready for reading\n", my_read_fds[i]);
for (int i=0; i < write_fd_count; i++)
if (FD_ISSET(my_write_fds[i], &writefds))
printf("fd %d is ready for writing\n", my_write_fds[i]);
}
For more information on select() The problem with select and why a lot of users don’t use this or poll is that
select must linearly go through each of the objects. If at any point in going through the objects, the previous
objects change state, select must restart. This is highly inefficient if we have a large number of file descriptors in
each of our sets. There is an alternative, that isn’t much better.
epoll
epoll is not part of POSIX, but it is supported by Linux. It is a more efficient way to wait for many file descriptors.
It will tell you exactly which descriptors are ready. It even gives you a way to store a small amount of data
with each descriptor, like an array index or a pointer, making it easier to access your data associated with that
descriptor.
First, you must create a special file descriptor with epoll_create(). You won’t read or write to this file descriptor.
You’ll pass it to the other epoll_xxx functions and call close() on it at the end.
For each file descriptor that you want to monitor with epoll, you’ll need to add it to the epoll data structures
using epoll_ctl() with the EPOLL_CTL_ADD option. You can add any number of file descriptors to it.
Say you were waiting to write data to a file descriptor, but now you want to wait to read data from it. Just use
epoll_ctl() with the EPOLL_CTL_MOD option to change the type of operation you’re monitoring.
event.events = EPOLLOUT;
event.data.ptr = mypointer;
epoll_ctl(epfd, EPOLL_CTL_MOD, mypointer->fd, &event);
To unsubscribe one file descriptor from epoll while leaving others active, use epoll_ctl() with the
EPOLL_CTL_DEL option.
close(epfd);
Also to non-blocking read() and write(), any calls to connect() on a non-blocking socket will also be
non-blocking. To wait for the connection to complete, use select() or epoll to wait for the socket to be writable.
There are reasons to use epoll over select but due to interface, there are fundamental problems with doing so.
Blogpost about select being broken
Epoll Example
Let’s break down the epoll code in the man page. We’ll assume that we have a prepared TCP server socket
int listen_sock. The first thing we have to do is create the epoll device.
epollfd = epoll_create1(0);
if (epollfd == -1) {
perror("epoll_create1");
exit(EXIT_FAILURE);
}
The next step is to add the listen socket in level triggered mode.
// Add the socket in with all the other fds. Everything is a file
descriptor
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
perror("epoll_ctl: listen_sock");
exit(EXIT_FAILURE);
}
If we get an event on a client socket, that means that the client has data ready to be read, and we perform that
operation. Otherwise, we need to update our epoll structure with a new client.
if (events[n].data.fd == listen_sock) {
int conn_sock = accept(listen_sock, (struct sockaddr *) &addr,
&addrlen);
// Must set to non-blocking
setnonblocking(conn_sock);
// We will read from this file, and we only want to return once
// we have something to read from. We don’t want to keep getting
// reminded if there is still data left (edge triggered)
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = conn_sock;
epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock, &ev)
}
The function above is missing some error checking for brevity as well. Note that this code is performant
because we added the server socket in level-triggered mode and we add each of the client file descriptors in
edge-triggered. Edge triggered mode leaves more calculations on the part of the application – the application
must keep reading or writing until the file descriptor is out of bytes – but it prevents starvation. A more efficient
implementation would also add the listening socket in edge-triggered to clear out the backlog of connections as
well.
Please read through most of man 7 epoll before starting to program. There are a lot of gotchas. Some of
the more common ones will be detailed below.
1. There are two modes. Level triggered and edge-triggered. Level triggered means that while the file descriptor
has events on it, it will be returned by epoll when calling the ctl function. In edge-triggered, the caller will
only get the file descriptor once it goes from zero events to an event. This means if you forget to read, write,
accept etc on the file descriptor until you get a EWOULDBLOCK, that file descriptor will be dropped.
2. If at any point you duplicate a file descriptor and add it to epoll, you will get an event from that file descriptor
and the duplicated one.
3. You can add an epoll object to epoll. Edge triggered and level-triggered modes are the same because ctl will
reset the state to zero events
4. Depending on the conditions, you may get a file descriptor that was closed from Epoll. This isn’t a bug. The
reason that this happens is epoll works on the kernel object level, not the file descriptor level. If the kernel
object lives longer and the right flags are set, a process could get a closed file descriptor. This also means
that if you close the file descriptor, there is no way to remove the kernel object.
5. Epoll has the EPOLLONESHOT flag which will remove a file descriptor after it has been returned in
epoll_wait
6. Epoll using level-triggered mode could starve certain file descriptors because it is unknown how much data
the application will read from each descriptor.
Extra: kqueue
When it comes to Event-Driven IO, the name of the game is to be fast. One extra system call is considered slow.
OpenBSD and FreeBSD have an arguably better model of asynchronous IO from the kqueue model. Kqueue is
a system call that is exclusive the BSDs and MacOs. It allows you to modify file descriptor events and read file
descriptors all in a single call under a unified interface. So what are the benefits?
1. No more differentiation between file descriptors and kernel objects. In the epoll section, we had to discuss
this distinction otherwise you may wonder why closed file descriptors are getting returned on epoll. No
problem here.
2. How often do you call epoll to read file descriptors, get a server socket, and need to add another file
descriptor? In a high-performance server, this can easily happen 1000s of times a second. As such, having
one system call to register and grab events saves the overhead of having a system call.
3. The unified system call for all types. kqueue is the truest sense of underlying descriptor agnostic. One can
add files, sockets, pipes to it and get full or near full performance. You can add the same to epoll, but Linux’s
whole ecosystem with async file input-output has been messed up with aio, meaning that since there is no
unified interface, you run into weird edge cases.
RPC or Remote Procedure Call is the idea that we can execute a procedure on a different machine. In practice,
the procedure may execute on the same machine. However, it may be in a different context. For example, the
operation under a different user with different permissions and different lifecycles.
An example of this is you may send a remote procedure call to a docker daemon to change the state of the
container. Not every application needs to have access to the entire system machine, but they should have access
to containers that they’ve created.
Privilege Separation
The remote code will execute under a different user and with different privileges from the caller. In practice, the
remote call may execute with more or fewer privileges than the caller. This in principle can be used to improve
the security of a system by ensuring components operate with the least privilege. Unfortunately, security concerns
need to be carefully assessed to ensure that RPC mechanisms cannot be subverted to perform unwanted actions.
For example, an RPC implementation may implicitly trust any connected client to perform any action, rather than
a subset of actions on a subset of the data.
// Send down the wire (we do not send the zero byte; the ’!’
signifies the end of the message)
write(fd, buffer, strlen(buffer) );
Using a string format may be a little inefficient. A good example of this marshaling is Golang’s gRPC or Google
RPC. There is a version in C as well if you want to check that out.
The server stub code will receive the request, unmarshal the request into a valid in-memory data call the
underlying implementation and send the result back to the caller. Often the underlying library will do this for you.
To implement RPC you need to decide and document which conventions you will use to serialize the data into
a byte sequence. Even a simple integer has several common choices.
1. Signed or unsigned?
To marshal a struct, decide which fields need to be serialized. It may be unnecessary to send all data items.
For example, some items may be irrelevant to the specific RPC or can be re-computed by the server from the other
data items present.
To marshal a linked list, it is unnecessary to send the link pointers, stream the values. As part of unmarshaling,
the server can recreate a linked list structure from the byte sequence.
By starting at the head node/vertex, a simple tree can be recursively visited to create a serialized version of
the data. A cyclic graph will usually require additional memory to ensure that each edge and vertex is processed
exactly once.
<ticket><price
currency=’dollar’>10</price><vendor>travelocity</vendor></ticket>
Google Protocol Buffers is an open-source efficient binary protocol that places a strong emphasis on high
throughput with low CPU overhead and minimal memory copying. This means client and server stub code in
multiple languages can be generated from the .proto specification file to marshal data to and from a binary stream.
Google Protocol Buffers reduces the versioning problem by ignoring unknown fields that are present in a
message. See the introduction to Protocol Buffers for more information.
The general chain is to abstract away the actual business logic and the various marshaling code. If your
application ever becomes CPU bound parsing XML, JSON or YAML, switch to protocol buffers!
Topics
• IPv4 vs IPv6
• TCP vs UDP
• DNS
• shutdown
• recvfrom
• epoll vs select
• RPC
Questions
• What is the TCP? The UDP? Give me the advantages and disadvantages of both of them. What is a scenario
of using one over the other?
• When can you use read and write? How about recvfrom and sendto?
• What are some advantages to epoll over select? How about select over epoll?
• What is a remote procedure call? When should one use it versus HTTP or running code locally?
Bibliography
[1] User Datagram Protocol. RFC 768, August 1980. URL https://rfc-editor.org/rfc/rfc768.txt.
[3] Danny Cohen. On holy wars and a plea for peace, Apr 1980. URL https://www.ietf.org/rfc/ien/
ien137.txt.
[4] Roy T. Fielding and Julian Reschke. Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. RFC
7231, June 2014. URL https://rfc-editor.org/rfc/rfc7231.txt.
[5] J. Reynolds and J. Postel. Assigned numbers. RFC 1700, RFC Editor, October 1994.
12
Filesystems
Filesystems are important because they allow you to persist data after a computer is shut down, crashes, or has
memory corruption. Back in the day, filesystems were expensive to use. Writing to the filesystem (FS) involved
writing to magnetic tape and reading from that tape [1]. It was slow, heavy, and prone to errors.
Nowadays most of our files are stored on disk – though not all of them! The disk is still slower than memory
by an order of magnitude at the least.
Some terminology before we begin this chapter. A filesystem, as we’ll define more concretely later, is anything
that satisfies the API of a filesystem. A filesystem is backed by a storage medium, such as a hard disk drive, solid
state drive, RAM, etc. A disk is either a hard disk drive (HDD) which includes a spinning metallic platter and
a head which can zap the platter to encode a 1 or a 0, or a solid-state drive (SSD) that can flip certain
NAND gates on a chip or standalone drive to store a 1 or a 0. As of 2019, SSDs are an order of magnitude
faster than the standard HDD. These are typical backings for a filesystem. A filesystem is implemented on top
of this backing, meaning that we can either implement something like EXT, MinixFS, NTFS, FAT32, etc. on a
commercially available hard disk. This filesystem tells the operating system how to organize the 1s and 0s to store
file information as well as directory information, but more on that later. To avoid being pedantic, we’ll say that a
filesystem like EXT or NTFS implements the filesystem API directly (open, close, etc). Often, operating systems will
add a layer of abstraction and require that the operating system satisfy its API instead (think imaginary functions
linux_open, linux_close etc). The two benefits are that one filesystem can be implemented for multiple operating
system APIs and adding a new OS filesystem call doesn’t require all of the underlying file systems to change their
API. For example, in the next iteration of linux if there was a new system call to create a backup of a file, the OS
can implement that with the internal API rather than requiring all filesystem drivers to change their code.
The last piece of background is an important one. In this chapter, we will refer to sizes of files in the
ISO-compliant KiB or Kibibyte. The *iB family is short for power of two storage. That means the following:
293
The standard notational prefixes mean the following:
We will do this in the book and in the Networking chapter for the sake of consistency and to not confuse
anyone. Confusingly in the real world, there is a different convention. That convention is that when a file is
displayed in the operating system, KB is the same as KiB. When we are talking about computer networks, CDs,
other storage KB is not the same as KiB and is the ISO / Metric Definition above. This is a historical quirk was
brought by a clash between network developers and memory/hard storage developers. Hard storage and memory
developers found that if a bit could take one of two states, it would be natural to call a Kilo- prefix 1024 because it
was about 1000. Network developers had to deal with bits, real-time signal processing, and various other factors,
so they went with the already accepted convention that Kilo- means 1000 of something [1]. What you need to
know is if you see KB in the wild, that it may be 1024 based on the context. If any time in this class you see KB or
any of the family refer to a filesystems question, you can safely infer that they are referring to 1024 as the base
unit. Though when you are pushing production code, make sure to ask about the difference!
What is a filesystem?
You may have encountered the old UNIX adage, "everything is a file". In most UNIX systems, file operations
provide an interface to abstract many different operations. Network sockets, hardware devices, and data on the
disk are all represented by file-like objects. A file-like object must follow the following conventions:
2. It must support common filesystem operations, such as open, read, write. At a minimum, it needs to be
opened and closed.
A filesystem is an implementation of the file interface. In this chapter, we will be exploring the various callbacks
a filesystem provides, some typical functionality and associated implementation details. In this class, we will
mostly talk about filesystems that serve to allow users to access data on disk, which are integral to modern
computers.
Here are some common features of a filesystem:
1. They deal with both storing local files and handle special devices that allow for safe communication between
the kernel and user space.
2. They deal with failures, scalability, indexing, encryption, compression, and performance.
3. They handle the abstraction between a file that contains data and how exactly that data is stored on disk,
partitioned, and protected.
Before we dive into the details of a filesystem, let’s take a look at some examples. To clarify, a mount point is
simply a mapping of a directory to a filesystem represented in the kernel.
1. ext4 Usually mounted at / on Linux systems, this is the filesystem that usually provides disk access as
you’re used to.
2. procfs Usually mounted at /proc, provides information and control over processes.
3. sysfs Usually mounted at /sys, a more modern version of /proc that also allows control over various other
hardware such as network sockets.
4. tmpfs Mounted at /tmp in some systems, an in-memory filesystem to hold temporary files.
It tells you what filesystem directory-based system calls resolve to. For example, / is resolved by the ext4
filesystem in our case, but /proc/2 is resolved by the procfs system even though it contains / as a subsystem.
As you may have noticed, some filesystems provide an interface to things that aren’t "files". Filesystems such
as procfs are usually referred to as virtual filesystems, since they don’t provide data access in the same sense as
a traditional filesystem would. Technically, all filesystems in the kernel are represented by virtual filesystems, but
we will differentiate virtual filesystems as filesystems that actually don’t store anything on a hard disk.
Not every filesystem supports all the possible callback functions. For example, many filesystems omit ioctl
or link. Many filesystems aren’t seekable meaning that they exclusively provide sequential access. A program
cannot move to an arbitrary point in the file. This is analogous to seekable streams. In this chapter, we will
not be examining each filesystem callback. If you would like to learn more about this interface, try looking at the
documentation for Filesystems at the User Space Level (FUSE).
To understand how a filesystem interacts with data on disk, there are three key terms we will be using.
1. disk block A disk block is a portion of the disk that is reserved for storing the contents of a file or a
directory.
2. inode An inode is a file or directory. This means that an inode contains metadata about the file as well as
pointers to disk blocks so that the file can actually be written to or read from.
3. superblock A superblock contains metadata about the inodes and disk blocks. An example superblock
can store how full each disk block is, which inodes are being used etc. Modern filesystems may actually
contain multiple superblocks and a sort-of super-super block that keeps track of which sectors are governed
by which superblocks. This tends to help with fragmentation.
It may seem overwhelming, but by the end of this chapter, we will be able to make sense of every part of the
filesystem.
To reason about data on some form of storage – spinning disks, solid state drives, magnetic tape – it is common
practice to first consider the medium of storage as a collection of blocks. A block can be thought of as a contiguous
region on disk. While its size is sometimes determined by some property of the underlying hardware, it is more
frequently determined based on the size of a page of memory for a given system, so that data from the disk can be
cached in memory for faster access – a important feature of many filesystems.
A filesystem has a special block denoted as a superblock that stores metadata about the filesystem such as a
journal (which logs changes to the filesystem), a table of inodes, the location of the first inode on disk, etc. The
important thing about a superblock is that it is in a known location on disk. If not, your computer may fail to
boot! Consider a simple ROM programmed into your motherboard. If your processor can’t tell the motherboard
to start reading and decipher a disk block to start the boot sequence, you are out of luck.
The inode is the most important structure for our filesystem as it represents a file. Before we explore it in-depth,
let’s list out the key information we need to have a usable file.
• Name
• File size
• Permissions
• Filepath
• Checksum
• File data
File Contents
From Wikipedia:
In a Unix-style file system, an index node, informally referred to as an inode, is a data structure used to
represent a filesystem object, which can be various things including a file or a directory. Each inode stores
the attributes and disk block location(s) of the filesystem object’s data. Filesystem object attributes may
include manipulation metadata (e.g. change, access, modify time), as well as owner and permission data
(e.g. group-id, user-id, permissions).
The superblock may store an array of inodes, each of which stores direct, and potentially several kinds of
indirect pointers to disk blocks. Since inodes are stored in the superblock, most filesystems have a limit on
how many inodes can exist. Since each inode corresponds to a file, this is also a limit on how many files that
filesystem can have. Trying to overcome this problem by storing inodes in some other location greatly increases
the complexity of the filesystem. Trying to reallocate space for the inode table is also infeasible since every byte
following the end of the inode array would have to be shifted, a highly expensive operation. This isn’t to say there
aren’t any solutions at all, although typically there is no need to increase the number of inodes since the number
of inodes is usually sufficiently high.
Big idea: Forget names of files. The ‘inode’ is the file.
It is common to think of the file name as the ‘actual’ file. It’s not! Instead, consider the inode as the file. The
inode holds the meta-information (last accessed, ownership, size) and points to the disk blocks used to hold
the file contents. However, the inode does not usually store a filename. Filenames are usually only stored in
directories (see below).
For example, to read the first few bytes of the file, follow the first direct block pointer to the first direct block
and read the first few bytes. Writing follows the same process. If a program wants to read the entire file, keep
reading direct blocks until you’ve read several bytes equal to the size of the file. If the total size of the file is less
than that of the number of direct blocks multiplied by the size of a block, then unused block pointers will be
undefined. Similarly, if the size of a file is not a multiple of the size of a block, data past the end of the last byte in
the last block will be garbage.
What if a file is bigger than the maximum space addressable by its direct blocks? To that, we present a motto
programmers take too seriously.
“All problems in computer science can be solved by another level of indirection.” - David Wheeler
Directory Implementation
A directory is a mapping of names to inode numbers. It is typically a normal file, but with some special bits set in
its inode and a specific structure for its contents. POSIX provides a small set of functions to read the filename and
inode number for each entry, which we will talk about in depth later in this chapter.
Let’s think about what directories looks like in the actual file system. Theoretically, they are files. The disk
blocks will contain directory entries or dirents. What that means is that our disk block can look like this
# ls -i
12983989 dirlist.c 12984068 sandwich.c
You can see later that this is a powerful abstraction. One can have a file be multiple different names in a
directory, or exist in multiple directories.
Counterintuitively, ... could be the name of a file, not the grandparent directory. Only the current directory
and the parent directory have special aliases involving . (namely , . and ..). However, ... could however be the
name of a file or directory on disk (You can try this with mkdir ...). Confusingly, the shell zsh does interpret
... as a handy shortcut to the grandparent directory (should it exist) while expanding shell commands.
Additional facts about name-related conventions:
1. Files that start with ’.’ (a period) on disk are conventionally considered ’hidden’ and will be omitted by
programs like ls without additional flags (-a). This is not a feature of the filesystem, and programs may
choose to ignore this.
2. Some files may also start with a NUL byte. These are usually abstract UNIX sockets and are used to prevent
cluttering up the filesystem since they will be effectively hidden by any unexpecting program. They will,
however, be listed by tools that detail information about sockets, so this is not a feature providing security.
3. If you want to annoy your neighbor, create a file with the terminal bell character. Every single time the file
is listed (by calling ‘ls’, for example), an audible bell will be heard.
Directory API
While interacting with a file in C is typically done by using open to open the file and then read or write to
interact with the file before calling close to release resources, directories have special calls such as, opendir,
closedir and readdir. There is no function writedir since typically that implies creating a file or link. The
program would use something like open or mkdir.
To explore these functions, let’s write a program to search the contents of a directory for a particular file. The
code below has a bug, try to spot it!
Did you find the bug? It leaks resources! If a matching filename is found then ‘closedir’ is never called as part
of the early return. Any file descriptors opened and any memory allocated by opendir are never released. This
means eventually the process will run out of resources and an open or opendir call will fail.
The fix is to ensure we free up resources in every possible code path.
In the above code, this means calling closedir before return 1. Forgetting to release resources is a common
C programming bug because there is no support in the C language to ensure resources are always released with
all code paths.
Given an open directory, after a call to fork(), either (XOR), the parent or the child can use readdir(),
rewinddir() or seekdir(). If both the parent and the child use the above, the behavior is undefined.
There are two main gotchas and one consideration. The readdir function returns “.” (current directory) and
“..” (parent directory). The other is programs need to explicity exclude subdirectories from a search, otherwise
the search may take a long time.
For many applications, it’s reasonable to check the current directory first before recursively searching sub-
directories. This can be achieved by storing the results in a linked list, or resetting the directory struct to restart
from the beginning.
The following code attempts to list all files in a directory recursively. As an exercise, try to identify the bugs it
introduces.
// Check opendir result (perhaps user gave us a path that can not
be opened as a directory
if (!dirp) {perror("Could not open directory"); return; }
// +2 as we need space for the / and the terminating 0
char newpath[strlen(path) + strlen(dp->d_name) + 2];
// Correct parameter
sprintf(newpath,"%s/%s", path, dp->d_name);
One final note of caution. readdir is not thread-safe! You shouldn’t use the re-entrant version of the function.
Synchronizing the filesystem within a process is important, so use locks around readdir.
See the man page of readdir for more details.
Linking
Links are what force us to model a filesystem as a tree rather than a graph.
While modeling the filesystem as a tree would imply that every inode has a unique parent directory, links
allow inodes to present themselves as files in multiple places, potentially with different names, thus leading to an
inode having multiple parent directories. There are two kinds of links:
1. Hard Links A hard link is simply an entry in a directory assigning some name to an inode number that
already has a different name and mapping in either the same directory or a different one. If we already
have a file on a file system we can create another link to the same inode using the ‘ln’ command:
$ ln file1.txt blip.txt
However, blip.txt is the same file. If we edit blip, I’m editing the same file as ‘file1.txt!’. We can prove this by
showing that both file names refer to the same inode.
$ ls -i file1.txt blip.txt
134235 file1.txt
134235 blip.txt
// Function Prototype
int link(const char *path1, const char *path2);
link("file1.txt", "blip.txt");
For simplicity, the above examples made hard links inside the same directory. Hard links can be created
anywhere inside the same filesystem.
2. Soft Links The second kind of link is called a soft link, symbolic link, or symlink. A symbolic link is
different because it is a file with a special bit set and stores a path to another file. Quite simply, without the
special bit, it is nothing more than a text file with a file path inside. Note when people generally talk about
a link without specifying hard or soft, they are referring to a hard link.
To create a symbolic link in the shell, use ln -s. To read the contents of the link as a file, use readlink.
These are both demonstrated below.
$ ln -s file1.txt file2.txt
$ ls -i file1.txt blip.txt
134235 file1.txt
134236 file2.txt
134235 blip.txt
$ cat file1.txt
file1!
$ cat file2.txt
file1!
$ cat blip.txt
file1!
$ echo edited file2 >> file2.txt # >> is bash syntax for append to file
$ cat file1.txt
file1!
edited file2
$ cat file2.txt
I’m file1!
edited file2
$ cat blip.txt
file1!
edited file2
$ readlink myfile.txt
file2.txt
Note that file2.txt and file1.txt have different inode numbers, unlike the hard link, blip.txt.
There is a C library call to create symlinks which is similar to link.
The integrity of the file system assumes the directory structure is an acyclic tree that is reachable from the root
directory. It becomes expensive to enforce or verify this constraint if directory linking is allowed. Breaking these
assumptions can leave file integrity tools unable to repair the file system. Recursive searches potentially never
terminate and directories can have more than one parent but “..” can only refer to a single parent. All in all, a bad
idea. Soft links are merely ignored, which is why we can use them to reference directories.
When you remove a file using rm or unlink, you are removing an inode reference from a directory. However,
the inode may still be referenced from other directories. To determine if the contents of the file are still required,
each inode keeps a reference count that is updated whenever a new link is created or destroyed. This count only
tracks hard links, symlinks are allowed to refer to a non-existent file and thus, do not matter.
An example use of hard links is to efficiently create multiple archives of a file system at different points in time.
Once the archive area has a copy of a particular file, then future archives can re-use these archive files rather than
creating a duplicate file. This is called an incremental backup. Apple’s “Time Machine” software does this.
Pathing
Now that we have definitions, and have talked about directories, we come across the concept of a path. A path
is a sequence of directories that provide one with a "path" in the graph that is a filesystem. However, there are
some nuances. It is possible to have a path called a/b/../c/./. Since .. and . are special entries in directories,
this is a valid path that actually refers to a/c. Most filesystem functions will allow uncompressed paths to be
passed in. The C library provides a function realpath to compress the path or get the absolute path. To simplify
by hand, remember that .. means ‘parent folder’ and that . means ‘current folder’. Below is an example that
illustrates the simplification of the a/b/../c/. by using cd in a shell to navigate a filesystem.
1. cd a (in a)
2. cd b (in a/b)
4. cd c (in a/c)
Metadata
How can we distinguish between a regular file and a directory? For that matter, there are many other attributes
that files also might contain. We distinguish a file type – different from the file extension i.e. png, svg, pdf – using
fields inside the inode. How does the system know what type the file is?
This information is stored within an inode. To access it, use the stat calls. For example, to find out when my
‘notes.txt’ file was last accessed.
struct stat s;
stat("notes.txt", &s);
printf("Last accessed %s", ctime(&s.st_atime));
For example, a program can use fstat to learn about file metadata if it already has a file descriptor associated
with that file.
lstat is almost the same as stat but handles symbolic links differently. From the stat man page.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information
about the link itself, not the file that it refers to.
The stat functions make use of struct stat. From the stat man page:
struct stat {
dev_t st_dev; /* ID of device containing file */
ino_t st_ino; /* Inode number */
mode_t st_mode; /* File type and mode */
nlink_t st_nlink; /* Number of hard links */
uid_t st_uid; /* User ID of owner */
gid_t st_gid; /* Group ID of owner */
dev_t st_rdev; /* Device ID (if special file) */
off_t st_size; /* Total size, in bytes */
blksize_t st_blksize; /* Block size for filesystem I/O */
blkcnt_t st_blocks; /* Number of 512B blocks allocated */
struct timespec st_atim; /* Time of last access */
struct timespec st_mtim; /* Time of last modification */
struct timespec st_ctim; /* Time of last status change */
};
The st_mode field can be used to distinguish between regular files and directories. To accomplish this, use
the macros, S_ISDIR and S_ISREG.
struct stat s;
if (0 == stat(name, &s)) {
printf("%s ", name);
if (S_ISDIR( s.st_mode)) puts("is a directory");
if (S_ISREG( s.st_mode)) puts("is a regular file");
} else {
perror("stat failed - are you sure we can read this file’s
metadata?");
}
Permissions are a key part of the way UNIX systems provide security in a filesystem. You may have noticed that
the st_mode field in struct stat contains more than the file type. It also contains the mode, a description
detailing what a user can and can’t do with a given file. There are usually three sets of permissions for any file.
Permissions for the user, the group and other (every user falling outside the first two categories). For each of the
three categories, we need to keep track of whetherthe user is allowed to read the file, write to the file, and execute
the file. Since there are three categories and three permissions, permissions are usually represented as a 3-digit
octal number. For each digit, the least significant byte corresponds to read privileges, the middle one to write
privileges and the final byte to execute privileges. They are always presented as User, Group, Other (UGO). Below
are some common examples. Here are the bit conventions:
It is worth noting that the rwx bits have a slightly different meaning for directories. Write access to a directory
that will allow a program to create or delete new files or directories inside. You can think about this as having
write access to the directory entry (dirent) mappings. Read-access to a directory will allow a program to list a
directory’s contents. This is read access to the directory entry (dirent) mapping. Execute will allow a program to
enter the directory using cd. Without the execute bit, it any attempt create or remove files or directories will fail
since you cannot access them. You can, however, list the contents of the directory.
There are several command line utilities for interacting with a file’s mode. mknod changes the type of the
file. chmod takes a number and a file and changes the permission bits. However, before we can discuss chmod in
detail, we must also understand the user ID (uid) and group id (gid) as well.
User ID / Group ID
Every user in a UNIX system has a user ID. This is a unique number that can identify a user. Similarly, users can be
added to collections called groups, and every group also has a unique identifying number. Groups have a variety
of uses on UNIX systems. They can be assigned capabilities - a way of describing the level of control a user has
over a system. For example, a group you may have run into is the sudoers group, a set of trusted users who are
allowed to use the command sudo to temporarily gain higher privileges. We’ll talk more about how sudo works
in this chapter. Every file, upon creation, an owner, the creator of the file. This owner’s user ID (uid) can be
found inside the st_mode file of a struct stat with a call to stat. Similarly, the group ID (gid) is set as well.
Every process can determine its uid and gid with getuid and getgid. When a process tries to open a file
with a specific mode, it’s uid and gid are compared with the uid and gid of the file. If the uids match, then
the process’s request to open the file will be compared with the bits on the user field of the file’s permissions. If
the gids match, then the process’s request will be compared with the group field of the permissions. If none of
the IDs match, then the other field will apply.
2. (d) directory
7. (s) socket
Alternatively, use the program stat which presents all the information that one could retrieve from the stat
library call.
To change the permission bits, there is a system call, int chmod(const char *path, mode_t mode);.
To simplify our examples, we will be using the command line utility of the same name chmod short of “change
mode”. There are two common ways to use chmod, with either an octal value or with a symbolic string.
$ chmod 644 file1
$ chmod 755 file2
$ chmod 700 file3
$ chmod ugo-w file4
$ chmod o-rx file4
The base-8 (‘octal’) digits describe the permissions for each role: The user who owns the file, the group and
everyone else. The octal number is the sum of three values given to the three types of permission: read(4),
write(2), execute(1)
Example: chmod 755 myfile
$ umask 077
$ mkdir secretdir
As a code example, suppose a new file is created with open() and mode bits 666 (write and read bits for
user, group and other):
If umask is octal 022, then the permissions of the created file will be 0666 & ~022 for example.
When executing a process with the setuid bit, it is still possible to determine a user’s original uid with getuid.
The real action of the setuid bit is to set the effective user ID (euid) which can be determined with geteuid.
The actions of getuid and geteuid are described below.
• geteuid returns the effective user id (zero if acting as root, e.g. due to the setuid flag set on a program)
These functions can allow one to write a program that can only be run by a privileged user by checking
geteuid or go a step further and ensure that the only user who can run the code is root by using getuid.
Note that in the example above, the username is prepended to the prompt, and the command su is used to
switch users.
POSIX systems, such as Linux and Mac OS X (which is based on BSD) include several virtual filesystems that are
mounted (available) as part of the file-system. Files inside these virtual filesystems may be generated dynamically
or stored in ram. Linux provides 3 main virtual filesystems.
$ ls . >/dev/null
There is a window of opportunity between when the directory is created and when it’s permissions are changed.
This leads to several vulnerabilities that are based on a race condition.
Another user replaces mystuff with a hard link to an existing file or directory owned by the second user, then
they would be able to read and control the contents of the mystuff directory. Oh no - our secrets are no longer
secret!
However in this specific example, the /tmp directory has the sticky bit set, so only the owner may delete
the mystuff directory, and the simple attack scenario described above is impossible. This does not mean that
creating the directory and then later making the directory private is secure! A better version is to atomically create
the directory with the correct permissions from its inception.
2. /dev/random may block at an inconvenient time. If one is programming a service for high scalability and
relies on /dev/random, an attacker can reliably exhaust the entropy pool and cause the service to block.
3. Manual page authors pose a hypothetical attack where an attacker exhausts the entropy pool and guesses
the seeding bits, but that attack has yet to be implemented.
5. Security experts will talk about Computational Security vs Information Theoretic security, more on this
article Urandom Myths. Most encryption is computationally secure, which means /dev/urandom is as
well.
Copying Files
Use the versatile dd command. For example, the following command copies 1 MiB of data from the file
/dev/urandom to the file /dev/null. The data is copied as 1024 blocks of block size 1024 bytes.
$ dd if=/dev/urandom of=/dev/null bs=1k count=1024
Both the input and output files in the example above are virtual - they don’t exist on a disk. This means the
speed of the transfer is unaffected by hardware power.
dd is also commonly used to make a copy of a disk or an entire filesystem to create images that can either be
burned on to other disks or to distribute data to other users.
$ umask 077 # all future new files will mask out all
r,w,x bits for group and other access
$ touch file123 # create a file if it non-existant, and
update its modified time
$ stat file123
File: ‘file123’
Size: 0 Blocks: 0 IO Block: 65536 regular
empty file
Device: 21h/33d Inode: 226148 Links: 1
Access: (0600/-rw-------) Uid: (395606/ angrave) Gid:
(61019/ ews)
Access: 2014-11-12 13:42:06.000000000 -0600
Modify: 2014-11-12 13:42:06.001787000 -0600
Change: 2014-11-12 13:42:06.001787000 -0600
An example use of touch is to force make to recompile a file that is unchanged after modifying the compiler
options inside the makefile. Remember that make is ‘lazy’ - it will compare the modified time of the source file
with the corresponding output file to see if the file needs to be recompiled.
Managing Filesystems
To manage filesystems on your machine, use mount. Using mount without any options generates a list (one
filesystem per line) of mounted filesystems including networked, virtual and local (spinning disk / SSD-based)
filesystems. Here is a typical output of mount
$ mount
/dev/mapper/cs241--server_sys-root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs
(rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext3 (rw)
/dev/mapper/cs241--server_sys-srv on /srv type ext4 (rw)
/dev/mapper/cs241--server_sys-tmp on /tmp type ext4 (rw)
/dev/mapper/cs241--server_sys-var on /var type ext4
(rw)rw,bind)
/srv/software/Mathematica-8.0 on /software/Mathematica-8.0
type none (rw,bind)
engr-ews-homes.engr.illinois.edu:/fs1-homes/angrave/linux
on /home/angrave type nfs
(rw,soft,intr,tcp,noacl,acregmin=30,vers=3,sec=sys,sloppy,addr=128.174.252.10
Notice that each line includes the filesystem type source of the filesystem and mount point. To reduce this
output, we can pipe it into grep and only see lines that match a regular expression.
Filesystem Mounting
Suppose you had downloaded a bootable Linux disk image from the arch linux download page
$ wget $URL
Before putting the filesystem on a CD, we can mount the file as a filesystem and explore its contents. Note:
mount requires root access, so let’s run it using sudo
$ mkdir arch
$ sudo mount -o loop archlinux-2015.04.01-dual.iso ./arch
$ cd arch
Before the mount command, the arch directory is new and obviously empty. After mounting, the contents of
arch/ will be drawn from the files and directories stored in the filesystem stored inside the archlinux-2014.11.01-dual.iso
file. The loop option is required because we want to mount a regular file, not a block device such as a physical
disk.
The loop option wraps the original file as a block device. In this example, we will find out below that the
file system is provided under /dev/loop0. We can check the filesystem type and mount options by running the
mount command without any parameters. We will pipe the output into grep so that we only see the relevant
output line(s) that contain ‘arch’.
The iso9660 filesystem is a read-only filesystem originally designed for optical storage media (i.e. CDRoms).
Attempting to change the contents of the filesystem will fail
$ touch arch/nocando
touch: cannot touch ‘/home/demo/arch/nocando’: Read-only file
system
Memory Mapped IO
While we traditionally think of reading and writing from a file as an operation that happens by using the read
and write calls, there is an alternative, mapping a file into memory using mmap. mmap can also be used for IPC,
and you can see more about mmap as a system call that enables shared memory in the IPC chapter. In this chapter,
we’ll briefly explore mmap as a filesystem operation.
mmap takes a file and maps its contents into memory. This allows a user to treat the entire file as a buffer in
memory for easier semantics while programming, and to avoid having to read a file as discrete chunks explicitly.
Not all filesystems support using mmap for IO. Those that do have varying behavior. Some will simply implement
mmap as a wrapper around read and write. Others will add additional optimizations by taking advantage of the
kernel’s page cache. Of course, such optimization can be used in the implementation of read and write as well,
so often using mmap has identical performance.
mmap is used to perform some operations such as loading libraries and processes into memory. If many
programs only need read-access to the same file, then the same physical memory can be shared between multiple
processes. This is used for common libraries like the C standard library.
The process to map a file into memory is as follows.
2. We seek to our desired size and write one byte to ensure that the file is sufficient length
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
int main() {
// We want a file big enough to hold 10 integers
int size = sizeof(int) * 10;
munmap(addr,size);
return 0;
The careful reader may notice that our integers were written in least-significant-byte format because that is
the endianness of the CPU that we ran this example on. We also allocated a file that is one byte too many! The
PROT_READ | PROT_WRITE options specify the virtual memory protection. The option PROT_EXEC (not used
here) can be set to allow CPU execution of instructions in memory.
Most filesystems cache significant amounts of disk data in physical memory. Linux, in this respect, is extreme. All
unused memory is used as a giant disk cache. The disk cache can have a significant impact on overall system
performance because disk I/O is slow. This is especially true for random access requests on spinning disks where
the disk read-write latency is dominated by the seek time required to move the read-write disk head to the correct
position.
For efficiency, the kernel caches recently used disk blocks. For writing, we have to choose a trade-off between
performance and reliability. Disk writes can also be cached (“Write-back cache”) where modified disk blocks are
stored in memory until evicted. Alternatively, a ‘write-through cache’ policy can be employed where disk writes
are sent immediately to the disk. The latter is safer as filesystem modifications are quickly stored to persistent
media but slower than a write-back cache. If writes are cached then they can be delayed and efficiently scheduled
based on the physical position of each disk block. Note, this is a simplified description because solid state drives
(SSDs) can be used as a secondary write-back cache.
Both solid state disks (SSD) and spinning disks have improved performance when reading or writing sequential
data. Thus, operating systems can often use a read-ahead strategy to amortize the read-request costs and request
several contiguous disk blocks per request. By issuing an I/O request for the next disk block before the user
application requires the next disk block, the apparent disk I/O latency can be reduced.
If your data is important and needs to be force written to disk, call sync to request that a filesystem’s changes
be written (flushed) to disk. However, operating systems may ignore this request. Even if the data is evicted from
the kernel buffers, the disk firmware may use an internal on-disk cache or may not yet have finished changing the
physical media. Note, you can also request that all changes associated with a particular file descriptor are flushed
to disk using fsync(int fd). There is a fiery debate about this call being useless, initiated by PostgresQL’s team
https://lwn.net/Articles/752063/
If your operating system fails in the middle of an operation, most modern file systems do something called
journaling to work around this. What the file system does is before it completes a potentially expensive operation,
is that it writes what it is going to do down in a journal. In the case of a crash or failure, one can step through the
journal and see which files are corrupt and fix them. This is a way to salvage hard disks in cases there is critical
data and there is no apparent backup.
Even though it is unlikely for your computer, programming for data centers means that disks fail every few
seconds. Disk failures are measured using “Mean-Time-To-Failure (MTTF)”. For large arrays, the mean failure time
can be surprisingly short. If the MTTF(single disk) = 30,000 hours, then the MTTF(1000 disks)= 30000/1000=30
hours or about a day and a half! That’s also assuming that the failures between the disks are independent, which
they often aren’t.
Software developers need to implement filesystems all the time. If that is surprising to you, we encourage you to
take a look at Hadoop, GlusterFS, Qumulo, etc. Filesystems are hot areas of research as of 2018 because people
have realized that the software models that we have devised don’t take full advantage of our current hardware.
Additionally, the hardware that we use for storing information is getting better all the time. As such, you may end
up designing a filesystem yourself someday. In this section, we will go over one of a fake filesystems and “walk
through” some examples of how things work.
So, what does our hypothetical filesystem look like? We will base it off of the minixfs, a simple filesystem
that happens to be the first filesystem that Linux ran on. It is laid out sequentially on disk, and the first section is
the superblock. The superblock stores important metadata about the entire filesystem. Since we want to be able to
read this block before we know anything else about the data on disk, this needs to be in a well-known location so
the start of the disk is a good choice. After the superblock, we’ll keep a map of which inodes are being used. The
nth bit is set if the nth inode – 0 being the inode root – is being used. Similarly, we store a map recording which
data blocks are used. Finally, we have an array of inodes followed by the rest of the disk - implicitly partitioned
into data blocks. One data block may be identical to the next from the perspective of the hardware components of
the disk. Thinking about the disk as an array of data blocks is simply something we do so that we have a way to
describe where files live on disk.
Below, we have an example of how an inode that describes a file may look. Note that for the sake of simplicity,
we have drawn arrows mapping data block numbers in the inode to their locations on disk. These aren’t pointers
so much as indices into an array.
We will assume that a data block is 4 KiB.
Note that a file will fill up each of its data blocks completely before requesting an additional data block. We
will refer to this property as the file being compact. The file presented above is interesting since it uses all of its
direct blocks, one of the entries for its indirect block and partially uses another indirect block.
The following subsections will all refer to the file presented above.
Note that our calculations so far have been to determine how much data the user is storing on disk. What about
the overhead of storing this data incurred while using this filesystem? You’ll notice that we use an indirect block to
store the disk block numbers of blocks used beyond the two direct blocks. While doing our above calculations,
we omitted this block. This would instead be counted as the overhead of the file, and thus the total overhead of
storing this file on disk is sizeo f (ind i r ec t_bl ock) = 4K iB).
Thinking about overhead, a related calculation could be to determine the max/min disk usage per file in this
filesystem.
Trivially a file of size 0 has no associated data blocks and takes up no space on disk (ignoring the space
required for the inode since these are located in a fixed size array somewhere on disk). How about the disk usage
of the smallest non-empty file? That is, consider a file of size 1B. Note that when a user writes the first byte, a
data block will be allocated. Since each data block is 4K iB, we find that 4K iB is the minimum disk usage for a
non-empty file. Here, we observe that the file size will only be 1B, despite that 4K iB of the disk is used – there is
a distinction between file size and disk usage because of overhead!
Finding maximum is slightly more involved. As we saw earlier in this chapter, a filesystem with this structure
can have 1024 data block numbers in one indirect block. This implies that the maximum filesize can be 2 ∗ 4K iB +
1024 ∗ 4K iB = 4M iB + 8K iB (after accounting for the direct blocks as well). However, on disk we also store the
indirect block itself. This means that an additional 4K iB of overhead will be used to account for the indirect block,
so the total disk usage will be 4M iB + 12K iB.
Note that when only using direct blocks, completely filling up a direct block implies that our filesize and our
disk usage are the same thing! While it would seem like we always want this ideal scenario, it puts a restrictive
limit on the maximum filesize. Attempting to remedy this by increasing the number of direct blocks seems
promising, but note that this requires increasing the size of an inode and reducing the amount of space available
to store user data – a tradeoff you will have to evaluate for yourself. Alternatively always trying to split your data
up into chunks that never use indirect blocks is may exhaust the limited pool of available inodes.
Performing Reads
Performing reads tend to be pretty easy in our filesystem because our files are compact. Let’s say that we want to
read the entirety of this particular file. What we’d start by doing is go to the inode’s direct struct and find the
first direct data block number. In our case, it is #7. Then we find the 7th data block from the start of all data
blocks. Then we read all of those bytes. We do the same thing for all of the direct nodes. What do we do after?
We go to the indirect block and read the indirect block. We know that every 4 bytes of the indirect block is either
a sentinel node (-1) or the number of another data block. In our particular example, the first four bytes evaluate
to the integer 5, meaning that our data continues on the 5th data block from the beginning. We do the same for
data block #4 and we stop after because we exceed the size of the inode
Now, let’s think about the edge cases. How would a program start the read starting at an arbitrary offset of
n bytes given that block sizes are 4K iBs. How many indirect blocks should there be if the filesystem is correct?
(Hint: think about using the size of the inode)
Performing Writes
Writing to files
Performing writes fall into two categories, writes to files and writes to directories. First we’ll focus on files and
assume that we are writing a byte to the 6th KiB of our file. To perform a write on a file at a particular offset, first
the filesystem must go to the data block would start at that offset. For this particular example we would have to
go to the 2nd or indexed number 1 inode to perform our write. We would once again fetch this number from the
inode, go to the root of the data blocks, go to the 5th data block and perform our write at the 2KiB offset from
this block because we skipped the first four kibibytes of the file in block 7. We perform our write and go on our
merry way.
Some questions to consider.
• How would a program perform a write after adding the offset would extend the length of the file?
• How would a program perform a write where the offset is greater than the length of the original file?
Writing to directories
Performing a write to a directory implies that an inode needs to be added to a directory. If we pretend that the
example above is a directory. We know that we will be adding at most one directory entry at a time. Meaning
that we have to have enough space for one directory entry in our data blocks. Luckily the last data block that we
have has enough free space. This means we need to find the number of the last data block as we did above, go to
where the data ends, and write one directory entry. Don’t forget to update the size of the directory so that the
next creation doesn’t overwrite your file!
Some more questions:
• How would would a program perform a write when the last data block is already full?
• How about when all the direct blocks have been filled up and the inode doesn’t have an indirect block?
While the API for most filesystems have stayed the same on POSIX over the years, the actual filesystems themselves
provide lots of important aspects.
• Data Integrity. File systems use journaling and sometimes checksums to ensure that the data written to
is valid. Journalling is a simple invention where the file system writes an operation in a journal. If the
filesystem crashes before the operation is complete, it can resume the operation when booted up again
using the partial journal.
• Caching. Linux does a good job of caching file system operations like finding inodes. This makes disk
operations seem nearly instant. If you want to see a slow system, look at Windows with FAT/NTFS. Disk
operations need to be cached by the application, or it will burn through the CPU.
• Speed. On spinning disk machines, data that is toward the end of a metallic platter will spin faster (angular
velocity is farther from the center). Programs used this to reduce time loading large files like movies in a
video editing piece of software. SSDs don’t have this problem because there is no spinning disk, but they
will portion off a section of their space to be used as "swap space" for fiels.
• Parallelism. Filesystems with multiple heads (for physical hard disks) or multiple controllers (for SSDs)
can utilize parallelism by multiplexing the PCIe slot with data, always serving some data to the application
whenever possible.
• Encryption. Data can be encrypted with one or more keys. A good example of this is Apple’s APFS file
systems.
• Redundancy. Sometimes data can be replicated to blocks to ensure that the data is always available.
• Efficient Backups. Many of us have data that we can’t store on the cloud for one reason or another. It is
useful that when a filesystems is either being used as a backup medium or is the source to the backup that it
is able to calculate what has changed efficiently, compress files, and sync between the external drive.
• Integriy and Bootability. File systems need to be resillient to bit flipping. Most readers have their operating
system installed on the same paritition as the file system that they used to do different operations. The file
system needs to make sure a stray read or write doesn’t destroy the boot sector – meaning your computer
can’t start up again.
• Fragmentation. Just like a memory allocator, allocating space for a file leads to both internal and external
fragmentation. The same caching benefit occurs when disk blocks for a single file are located next to each
other. File systems need to perform well under low, high, and possible fragmentation usage.
• Distributed. Sometimes, the filesystem should be single machine fault tolerant. Hadoop and other distributed
file system allow you to do that.
Topics
• Superblock
• Data Block
• Inode
• Relative Path
• File Metadata
• Permission Bits
• Mode bits
• RAID
Questions
• How big can files be on a file system with 15 Direct blocks, 2 double, 3 triple indirect, 4kb blocks and 4byte
entries? (Assume enough infinite blocks)
• What is the difference between a hard link and a symbolic link? Does the file need to exist?
• “ls -l” shows the size of each file in a directory. Is the size stored in the directory or in the file’s inode?
Bibliography
That’s a signal, Jerry, that’s a signal! [snaps his fingers again] Signal!
George Costanza (Seinfeld)
Signals are a convenient way to deliver low-priority information and for users to interact with their programs
when other ways don’t work (for example standard input being frozen). They allow a program to clean up or
perform an action in the case of an event. Sometimes, a program can choose to ignore events which is supported.
Crafting a program that uses signals well is tricky due to how signals are handled. As such, signals are usually for
termination and clean up. Rarely are they supposed to be used in programming logic.
For those of you with an architecture background, the interrupts used here aren’t the interrupts generated by
the hardware. Those interrupts are almost always handled by the kernel because they require higher levels of
privileges. Instead, we are talking about software interrupts that are generated by the kernel – though they can
be in response to a hardware event like SIGSEGV.
This chapter will go over how to read information from a process that has either exited or been signaled. Then,
it will deep dive into what are signals, how does the kernel deal with a signal, and the various ways processes can
handle signals both with and without threads.
A signal allows one process to send an event or message to another process. If that process wants to accept the
signal, it can, and then, for most signals, decide what to do with that signal.
First, a bit of terminology. A is a per-process attribute that determines how a signal is handled after it is
delivered. Think of it as a table of signal-action pairs. The full discussion is in the Man Page. The actions are
2. IGN, ignore
323
A determines whether a particular signal is delivered or not. The overall process for how a kernel sends a
signal are below.
1. If no signals have arrived, the process can install its own signal handlers. This tells the kernel that when the
process gets signal X that it should jump to function Y.
3. The time between when a signal is generated and the kernel can apply the mask rules is called the pending
state.
4. Then the kernel then checks the process’ signal mask. If the mask says all the threads in a process are
blocking the signal, then the signal is currently blocked and nothing happens until a thread unblocks it.
5. If a single thread can accept the signal, then the kernel executes the action in the disposition table. If the
action is a default action, then no threads need to be paused.
6. Otherwise, the kernel delivers the signal by stopping whatever a particular thread is doing currently, and
jumps that thread to the signal handler. The signal is now in the delivered phase. More signals can be
generated now, but they can’t be delivered until the signal handler is complete which is when the delivered
phase is over.
7. Finally, we consider a signal caught if the process remains intact after the signal was delivered.
As a flowchart
Killed
Blocked
Here are some common signals that you will see thrown around.
One of our favorite anecdotes is to never use kill -9 for a host of reasons. The following is an excerpt from
Useless Use of Kill -9 Link to archive
We still keep kill -9 in there for extreme scenarios where the process needs to be gone.
Sending Signals
1. The user can send a signal. For example, you are at the terminal, and you press CTRL-C. One can also use
the built-in kill to send any signal.
2. The system can send an event. For example, if a process accesses a page that it isn’t supposed to, the
hardware generates an interrupt which gets intercepted by the kernel. The kernel finds the process that
caused this and sends a signal SIGSEGV. There are other kernel events like a child being created or a
process needs to be resumed.
3. Finally, another process can send a message. This could be used in low-stakes communication of events
between processes. If you are relying on signals to be the driver in your program, you should rethink your
application design. There are many drawbacks to using POSIX/Real-Time signals for asynchronous commu-
nication. The best way to handle interprocess communication is to use, well, interprocess communication
methods specifically designed for your task at hand.
You or another process can temporarily pause a running process by sending it a SIGSTOP signal. If it succeeds,
it will freeze a process. The process will not be allocated any more CPU time. To allow a process to resume
execution, send it the SIGCONT signal. For example, the following is a program that slowly prints a dot every
second, up to 59 dots.
#include <unistd.h>
#include <stdio.h>
int main() {
printf("My pid is %d\n", getpid() );
int i = 60;
while(--i) {
write(1, ".",1);
sleep(1);
}
write(1, "Done!",5);
return 0;
}
We will first start the process in the background (notice the & at the end). Then, send it a signal from the shell
process by using the kill command.
$ ./program &
My pid is 403
...
$ kill -SIGSTOP 403
$ kill -SIGCONT 403
...
In C, a program can send a signal to the child using kill POSIX call,
As we saw above there is also a kill command available in the shell. Another command killall works the
exact same way but instead of looking up by , it tries to match the name of the process. ps is an important utility
that can help you find the pid of a process.
# First let’s use ps and grep to find the process we want to send
a signal to
$ ps au | grep myprogram
angrave 4409 0.0 0.0 2434892 512 s004 R+ 2:42PM 0:00.00
myprogram 1 2 3
To send a signal to the running process, use raise or kill with getpid().
For non-root processes, signals can only be sent to processes of the same user. You can’t SIGKILL any process!
man -s2 kill for more details.
Handling Signals
There are strict limitations on the executable code inside a . Most library and system calls are async-signal-unsafe,
meaning they may not be used inside a signal handler because they are not re-entrant. Re-entrant safety means
that your function can be frozen at any point and executed again, can you guarantee that your function wouldn’t
fail? Let’s take the following
1. We execute (func("Hello"))
2. The string gets copied over to the buffer completely (strcmp(buffer, "Hello") == 0)
3. A signal is delivered and the function state freezes, we also stop accepting any new signals until after the
handler (we do this for convenience)
4. We execute func("World")
6. We resume the interrupted function and now print out the buffer once again "World" instead of what the
function call originally intended "Hello"
Guaranteeing that your functions are signal handler safe can’t be solved by removing shared buffers. You
must also think about multithreading and synchronization – what happens when I double lock a mutex? You
also have to make sure that each function call is reentrant safe. Suppose your original program was interrupted
while executing the library code of malloc. The memory structures used by malloc will be inconsistent. Calling
printf, which uses malloc as part of the signal handler, is unsafe and will result in undefined behavior. A
safe way to avoid this behavior is to set a variable and let the program resume operating. The design pattern also
helps us in designing programs that can receive signals twice and operate correctly.
int main() {
signal(SIGINT, handle_sigint);
pleaseStop = 0;
while (!pleaseStop) {
/* application logic here */
}
/* clean up code here */
}
The above code might appear to be correct on paper. However, we need to provide a hint to the compiler
and the CPU core that will execute the main() loop. We need to prevent compiler optimization. The expression
pleaseStop doesn’t get changed in the body of the loop, so some compilers will optimize it to true TODO:
citation needed. Secondly, we need to ensure that the value of pleaseStop is uncached using a CPU register and
instead always read from and written to main memory. The sig_atomic_t type implies that all the bits of the
variable can be read or modified as an atomic operation - a single uninterruptible operation. It is impossible
to read a value that is composed of some new bit values and old bit values.
By specifying pleaseStop with the correct type volatile sig_atomic_t, we can write portable code
where the main loop will be exited after the signal handler returns. The sig_atomic_t type can be as large as
an int on most modern platforms but on embedded systems can be as small as a char and only able to represent
(-127 to 127) values.
Two examples of this pattern can be found in COMP a terminal based 1Hz 4bit computer [3]. Two boolean
flags are used. One to mark the delivery of SIGINT (CTRL-C), and gracefully shutdown the program, and the
other to mark SIGWINCH signal to detect terminal resize and redraw the entire display.
You can also choose a handle pending signals asynchronously or synchronously. To install a signal handler to
asynchronously handle signals, use sigaction. To synchronously catch a pending signal use sigwait which
blocks until a signal is delivered or signalfd which also blocks and provides a file descriptor that can be read()
to retrieve pending signals.
Sigaction
You should use sigaction instead of signal because it has better defined semantics. signal on different
operating system does different things which is bad. sigaction is more portable and is better defined for threads.
You can use system call sigaction to set the current handler and disposition for a signal or read the current
signal handler for a particular signal.
The sigaction struct includes two callback functions (we will only look at the ‘handler’ version), a signal mask
and a flags field -
struct sigaction {
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
sigset_t sa_mask;
int sa_flags;
};
Suppose you stumble upon legacy code that uses signal. The following snippet installs myhandler as the
SIGALRM handler.
signal(SIGALRM, myhandler);
However, we typically may also set the mask and the flags field. The mask is a temporary signal mask used
during the signal handler execution. If the thread serving the signal is interrupted in the middle of a system call,
the SA_RESTART flag will automatically restart some system calls that otherwise would have returned early with
EINTR error. The latter means we can simplify the rest of code somewhat because a restart loop may no longer be
required.
sigfillset(&sa.sa_mask);
sa.sa_flags = SA_RESTART; /* Restart functions if interrupted by
handler */
It is often better to have your code check for the error and restart itself due to the selective nature of the flag.
Blocking Signals
To block signals use sigprocmask! With sigprocmask you can set the new mask, add new signals to be blocked
to the process mask, and unblock currently blocked signals. You can also determine the existing mask (and use it
for later) by passing in a non-null value for oldset.
From the Linux man page of sigprocmask, here are the possible values for how TODO: cite.
• SIG_BLOCK. The set of blocked signals is the union of the current set and the set argument.
• SIG_UNBLOCK. The signals in set are removed from the current set of blocked signals. It is permissible to
attempt to unblock a signal which is not blocked.
The sigset type behaves as a set. It is a common error to forget to initialize the signal set before adding to the
set.
Correct code initializes the set to be all on or all off. For example,
sigemptyset(&set); // no signals
sigprocmask(SIG_SETMASK, &set, NULL); // set the mask to be empty
again
If you block a signal with either sigprocmask or pthread_sigmask, then the handler registered with
sigaction is not delivered unless explicitly sigwait’ed on TODO: cite.
Sigwait
Sigwait can be used to read one pending signal at a time. sigwait is used to synchronously wait for signals,
rather than handle them in a callback. A typical use of sigwait in a multi-threaded program is shown below. Notice
that the thread signal mask is set first (and will be inherited by new threads). The mask prevents signals from
being delivered so they will remain in a pending state until sigwait is called. Also notice the same set sigset_t
variable is used by sigwait - except rather than setting the set of blocked signals it is used as the set of signals that
sigwait can catch and return.
One advantage of writing a custom signal handling thread (such as the example below) rather than a callback
function is that you can now use many more C library and system functions safely.
Based on sigmask code [2]
/* APPLICATION CODE */
...
}
/* Use the same mask as the set of signals that we’d like to know
about! */
sigwait(&signal_mask, &sig_caught);
switch (sig_caught) {
case SIGINT:
...
break;
case SIGTERM:
...
break;
default:
fprintf (stderr, "\nUnexpected signal %d\n", sig_caught);
break;
}
}
This is a recap of the processes chapter. After forking, the child process inherits a copy of the parent’s signal
dispositions and a copy of the parent’s signal mask. If you have installed a SIGINT handler before forking, then
the child process will also call the handler if a SIGINT is delivered to the child. If SIGINT is blocked in the parent,
it will be blocked in the child as well. Note that pending signals for the child are not inherited during forking.
After exec though, only the signal mask and pending signals are carried over [1]. Signal handlers are reset to
their original action, because the original handler code may have disappeared along with the old process.
Each thread has its own mask. A new thread inherits a copy of the calling thread’s mask. On initialization, the
calling thread’s mask is the exact same as the processes mask. After a new thread is created though, the processes
signal mask turns into a gray area. Instead, the kernel likes to treat the process as a collection of threads, each of
which can institute a signal mask and receive signals. To start setting your mask, you can use,
Blocking signals is similar in multi-threaded programs to single-threaded programs with the following transla-
tion.
The easiest method to ensure a signal is blocked in all threads is to set the signal mask in the main thread
before new threads are created.
sigemptyset(&set);
sigaddset(&set, SIGQUIT);
sigaddset(&set, SIGINT);
pthread_sigmask(SIG_BLOCK, &set, NULL);
// this thread and the new thread will block SIGQUIT and SIGINT
pthread_create(&thread_id, NULL, myfunc, funcparam);
Just as we saw with sigprocmask, pthread_sigmask includes a ‘how’ parameter that defines how the
signal set is to be used:
A signal then can be delivered to any signal thread that is willing to accept that signal. If the two or more
threads can receive the signal then which thread will be interrupted is arbitrary! A common practice is to have one
thread that can receive all signals or if there is a certain signal that requires special logic, have multiple threads
for multiple signals. Even though programs from the outside can’t send signals to specific threads, you can do that
internally with pthread_kill(pthread_t thread, int sig). In the example below, the newly created
thread executing func will be interrupted by SIGINT
As a word of warning pthread_kill(threadid, SIGKILL) will kill the entire process. Though individual
threads can set a signal mask, the signal disposition is per-process not per-thread. This means sigaction can be
called from any thread because you will be setting a signal handler for all threads in the process.
The Linux man pages discuss signal system calls in section 2. There is also a longer article in section 7 (though
not in OSX/BSD):
Topics
• Signals
• Signal Disposition
• Signal States
• Raising Signals in C
Questions
• What is a signal?
• How are signals served under UNIX? (Bonus: How about Windows?)
• What does it mean that a function is signal handler safe? How about reentrant?
• What function changes the signal disposition in a single threaded program? How about a multithreaded
program?
• What happens to pending signals after a fork? exec? How about my signal mask? How about signal
disposition?
• What is the process the kernel goes through from creation to delivery/block?
Bibliography
Hackers Are Like Artists, Who Wake Up In A Good Mood & Start
Painting
Vladimir Putin
Computer security is the protection of hardware and software from unauthorized access or modification. Even
if you don’t work directly in the computer security field, the concepts are important to learn because all systems
will have attackers given enough time. Even though this is introduced as a different chapter, it is important to note
that most of these concepts and code examples have already been introduced at different points in the course. We
won’t go in depth about all of the common ways of attack and defense nor will we go into how to perform all of
these attacks in an arbitrary system. Our goal is to introduce you to the field of making programs do what you
want to do.
There is some terminology that needs to be explained to get someone who has little to no experience in computer
security up to speed
1. An Attacker is typically the user who is trying to break into the system. Breaking into the system means
performing an action that the developer of the system didn’t intend. It could also mean accessing a system
you shouldn’t have access to.
2. A Defender is typically the user who is preventing the attacker from breaking into the system. This may be
the developer of the system.
3. There are different types of attackers. There are white hat hackers who attempt to hack a defender with
their consent. This is commonly a form of pre-emptive testing – in case a not-so-friendly attack comes along.
The black hat hackers are hackers who hack without permission and the intent to use the information
obtained for any purpose. Gray hat hacking differs because the hacker’s intent is to inform the defender of
the vulnerability – though this can be hard to judge at times.
Danger Will Robinson Before we let you go much further, it is important that we talk about ethics. Before
you skip over this section, know that your career quite literally can be terminated over an unethical decision
that you might make. The computer fraud and security act is a broad, and arguably terrible law, that casts any
335
non-authorized use of a ‘protected computer’ of a computer as a felony. Since most computers are involved in
some interstate/international commerce (the internet) most computers fall under this category. It is important to
think about your actions and have some ladder of accountability before executing any attack or defense. To be
more concrete, make sure supervisors in your organization have given you their blessing before trying to execute
an attack.
First if at all possible, get written permission from one of your superiors. We do realize that this is a cop-out
and this puts the blame up a level, but at the risk of sounding cynical organizations will often put blame on an
individual employee to avoid damages TODO: Citation Needed. If not possible, try to go through the engineering
steps
1. Figure out what the problem is that you are trying to solve. You can’t solve a problem that you don’t fully
understand.
2. Determine whetheryou need to “hack” the system. A hack is defined generally as trying to use a system
unintendedly. First, you should determine if your use is intended or unintended or somewhere in the middle
– get a decision for them. If you can’t get that, make a reasonable judgement as to what the intended use.
3. Figure out a reasonable estimate of what the cost is to “hacking” the system. Get that reasonable estimate
checked out with a few engineers so they can highlight things that you may have missed. Try to get someone
to sign off on the plan.
4. Execute the plan with caution. If at any point something seems wrong, weigh the risks and execute the plan.
If there isn’t a certain ethical guideline for the current application, then create some. This is often called a
policy vacuum. This may seem like busy work and more on the “business side” than computer scientists are
used to, but your career is at stake here. It is up to you as a computing professional to assess the risk and to
decide whether to execute. Courts generally like sitting on precedent, but you can easily say that you aren’t a
legal scholar. In lieu, you must be able to say that you reacted as a “reasonable” engineer would react.
TODO: Link to some case studies of real engineers having to decide
CIA Triad
There are three commonly accepted goals to help understand if a system is secure.
1. Information Confidentiality means that only authorized parties are allowed to see a piece of information
2. Information Integrity means that only authorized parties are allowed the modify a piece of information,
regardless of whether they are allowed to see it. It ensures that information remains in complete during
transit.
4. The triad above forms the Confidentiality, Integrity, and Availability (CIA) triad, often authenticity is added
as well.
If any of these are broken, the security of a system (either a service or piece of information) has been
compromised.
Security in C Programs
Stack Smashing
Consider the following code snippet
There is no checking on the bounds of strcpy! This means that we could potentially pass in a large string and
get the program to do something unintended, usually via replacing the return address of the function with the
address of malicious code. Most strings will cause the program to exit with a segmentation fault
$ ./a.out john
Hello, john!
$ ./a.out JohnAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
Program received signal SIGSEGV, Segmentation fault.
...
If we manipulate the bytes in certain ways and the program was compiled with the correct flags, we can actually
get access to a shell! Consider if that file is owned by root, we put in some valid bytecode (binary instructions) as
the string. What will happen is we’ll try to execute execve(’/bin/sh’, ’/bin/sh’, NULL , NULL) that
is compiled to the bytecode of the operating system and pass it as part of our string. With some luck, we will get
access to a root shell.
$ ./a.out <payload>
root#
The question arises, which parts of the triad does this break? Try to answer that question yourself. So how
would we go about fixing this? We could ingrain into most programmers at the C level to use strncpy or
strlcpy on OpenBSD systems. Turning on stack canaries as explained later will fix this issue as well.
Buffer Overflow
Most of you are already familiar with Buffer Overflows! A lot of time they are fairly tame, leading to simple
program crashes or funny mistakes. Here is a complete example
int main() {
char out[10];
char in[10];
fscanf(stdin, "%s", in);
out[0] = ’a’;
out[9] = ’\0’;
printf("%s\n", out);
return 0;
}
> gcc main.c -fno-stack-protector # need the special flag
otherwise won’t work
# Stack protectors are explained later.
> ./a.out
hello
a
> ./a.out
hellloooooooo
aoo
>
What happens here should be clear if you recall the c memory model. Out and in are both next to each other
in memory. If you read in a string from standard input that overflows in, then you end up printing aoo. It gets a
little more serious if the snippet starts out as
int main() {
char pass_hash[10];
char in[10];
read_user_password(pass_hash, 10);
// ...
}
Out of order instructions & Spectre
Out of order execution is an amazing development that has been recently adopted by many hardware vendors
(think 1990s) TODO: citation needed. Processors now instead of executing a sequence of instructions (let’s say
assigning a variable and then another variable) execute instructions before the current one is done [1, P. 45]. This
is because modern processors spend a lot of time waiting for memory accesses and other I/O driven applications.
This means that a processor, while it is waiting for an operation to complete, will execute the next few operations.
If any of the operations would possibly alter the final result, there is a barrier, or if the re-ordering violates the
data dependencies of the instructions, the processor keep the instructions in the stated order [1, P. 296].
Naturally, this allowed CPUs to become more energy-efficient while executing more instructions in real-time
and increased security risks from complex architectures. What system programmers are worried about is that
operation with mutex locks among threads are out of order – meaning that a pure software implementation of a
mutex will fail without copious memory barriers. Therefore, the programmer has to acknowledge that updates
may be missed among a series of threads, given that there is no barrier, on modern processors.
One of the most prominent bugs concerning this is Spectre [2]. Spectre is a bug where instructions that
otherwise wouldn’t be executed are speculatively executed due to out-of-order instruction execution. The following
snippet is a high-level proof of concept.
char *a[10];
for (int i = 10; i != 1; --i) {
a[i] = calloc(1, 1);
}
a[0] = 0xCAFE;
int val;
int j = 10; // This will be in a register
int i = 10; // This will be in main memory
for (int i = 10; i != 0; --i, --j) {
if (i) {
val = *a[j];
}
}
Let’s analyze this code. The first loop allocates 9 elements through a valid malloc. The last element is 0xCAFE,
meaning a dereference should result in a SEGFAULT. For the first 9 iterations, the branch is taken and val is
assigned to a valid value. The interesting part happens in the last iteration. The resulting behavior of the program
is to skip the last iteration. Therefore, val never gets assigned to the last value.
But under the right compilation conditions and compiler flags, the instructions will be speculatively executed.
The processor thinks that the branch will be taken, since it has been taken in the last 9 iterations. As such, the
processor will fetch those instructions. Due to out-of-order instruction execution, while the value of i is being
fetched from memory, we have to force it not to be in a register. Then, the processor will try to dereference that
address. This should result in a SEGFAULT. Since the address was never logically reached by the program, the
result is discarded.
Now here is the trick. Even though the value of the calculation would have resulted in a SEGFAULT, the bug
doesn’t clear the cache that refers to the physical memory where 0xCAFE is located. This is an inexact explanation,
but essentially how it works. Since it is still in the cache, if you again trick the processor to read form the cache
using val then you will read a memory value that you wouldn’t be able to read normally. This could include
important information such as passwords, payment information, etc.
2. Capabilities. In addition to permissions on files, each user has a certain set of permissions that they can do.
For a full list, you can check capabilities(7). In short, allowing a capability allows a user to perform a set of
actions. Some examples include controlling networking devices, creating special files, and peering into IPC
or interprocess communication.
3. Address Space Layout Randomization (ASLR). ASLR causes the address spaces of important sections of
a process, including the base address of the executable and the positions of the stack, heap and libraries,
to start at randomized values, on every run. This is so that an attacker with a running executable has to
randomly guess where sensitive information could be hidden. For example, an attacker may use this to
easily perform a return-to-libc attack.
4. Stack Protectors. Let’s say you’ve programmed a buffer overflow as above. In most cases, what happens?
Unless specifically turned off, the compiler will put in stack protectors or stack canaries. This is a value
that resides in the stack and must remain constant for the duration of the function call. If that protector is
overwritten at the end of the function call, the run time will abort and report to the user that stack smashing
was detected.
5. Write xor Execute, also known as Data Execution Prevention (DEP). This is a protection that was
covered in the IPC section that distinguishes code from data. A page can either be written to or executed
but not both. This is to prevent buffer overflows where attackers write arbitrary code, often stored on the
stack or heap, and execute with the user’s permissions.
6. Firewall. The Linux kernel provides the netfilter module as a way of deciding whetheran incoming connection
should be allowed and various other restrictions on connections. This can help with a DDOS attack (explained
later).
7. AppArmor. AppArmor is a suite of operating system tools at the userspace level to restrict applications to
certain operations.
OpenBSD is an arguably better system for security. It has many security oriented features. Some of these
features have been touched upon earlier. An exhaustive list of features is at https://www.openbsd.org/
innovations.html
1. pledge. Pledge is a powerful command that restricts system calls. This means if you have a simple program
like cat which only reads to and from files, one can reasonably restrict all network access, pipe access,
and write access to files. This is known as the process of “hardening” an executable or system, giving the
smallest amount of permissions to the least number of executables needed to run a system. Pledge is also
useful in case one tries to perform an injection attack.
2. unveil. Unveil is a system call that restricts the access of a current program to a few directories. Those
permissions apply to all forked programs as well. This means if you have a suspicious executable that you
want to run whose description is “creates a new file and outputs random words” one could use this call to
restrict access to a safe subdirectory and watch it receive the SIGKILL signal if it tries to access system files
in the root directory, for example. This could be useful for your program as well. If you want to ensure
that no user data is lost during an update (which is what happened with a Steam system update), then the
system could only reveal the program’s installation directory. If an attacker manages to find an exploit in
the executable, it can only compromise the installation directory.
3. sudo. Sudo is an openBSD project that runs everywhere! Before to run commands as root, one would have
to drop to a root shell. Some times that would also mean giving users scary system capabilities. Sudo gives
you access to perform commands as root for one-offs without giving a long list of capabilities to all of your
users.
Virtualization Security
Virtualization is the act of creating a virtual version of an environment for a program to run on. Though that
definition might be bent a little with the advent of new-age bare metal Virtual Machines, the abstraction is still there.
One can imagine a single Operating System per motherboard. Virtualization in the software sense is providing
“virtual” motherboard features like USB ports or monitors that another program (the bridge) communicates with
the actual hardware to perform a task. A simple example is running a virtual machine on your host desktop!
One can spin up an entirely different operating system whose instructions are fed through another program and
executed on the host system. There are many forms of virtualization that we use today. We will discuss two
popular forms below. One form is virtual machines. These programs emulate all forms of motherboard peripherals
to create a full machine. Another form is containers. Virtual machines are good but are often bulky and programs
only need a certain level of protection. Containers are virtual machines that don’t emulate all motherboard
peripherals and instead share with the host operating system, adding in additional layers of security.
Now, you can’t have proper virtualization without security. One of the reasons to have virtualization is to ensure
that the virtualized environment doesn’t maliciously leak back into the host environment. We say maliciously
because there are intended ways of communication that we want to keep in check. Here are some simple examples
of security provided through virtualization
1. chroot is a contrived way of creating a virtualization environment. chroot is short for change root. This
changes where a program believes that (/) is mounted on the system. For example with chroot, one can
make a hello world program believe /home/bhuvan/ is actually the root directory. This is useful because
no other files are exposed. This is contrived because Linux still needs additional tools (think the c standard
library) to come from different directories such as /usr/lib which means those could still be vulnerable.
2. namespaces are Linux’s better way to create a virtualization environment. We won’t go into this too much,
just know that they exist.
3. Hardware virtualization technology. Hardware vendors have become increasingly aware that physical
protections are needed when emulating instructions. As such, there can be switches enabled by the user
that allows the operating system to flip into a virtualization mode where instructions are run as normal but
are monitored for malicious activity. This helps the performance and increases the security of virtualized
environments.
Extra: Security through scrubbing
Security isn’t just about making sure that a program cannot be manipulated to a malicious user’s ends. Some
times, security is making sure that the program can’t be crashed or effectively timed with input. An example of the
former is any attack where the attacker threatens to shut a system down due to a leak. This means that the system
has a flaw where the user can input a value and cause the system to fail. One can imagine in mission-critical
systems – power grids, medical devices, etc – this isn’t a lightly taken threat.
Also, a lot of novice programmers neglect making a program look similar given all inputs. A common example
of this is comparing two strings. Let’s say that you are guessing the Cross-Site Request Forgery (CSRF) token on a
website. If the programmed server returns the request immediately after the token doesn’t match the server’s,
that is a security bug. If the token matches none of the characters, an attacker now know that a quick response
means that the first few characters were off. If the token matches some of the characters, an attacker knows
that this request will take slightly longer. It is important to balance speed and security. If correct authentication
parameters are supplied, the action should succeed immediately. However, if there is a mismatch, there should be
no difference in time or content of the response for different reasons of failure, nor should any information be
given to the user about the source of error. This is often accomplished by using a standard delay for responses to
unsuccessful authentication attempts. We want to use a fast memcmp, but it may be insecure.
Cyber Security
Cyber Security is arguably the most popular area of security. More and more of our systems are hacked over the
web, it is important to understand how we can protect against these attacks
2. Identity Verification. In TCP, there is no way to verify the identity of who the program is connecting to.
There are no checks or federated databases in place. One just has to trust the DNS server gave a reasonable
response which is almost always the incorrect answer. Apart from systems that have an approved white list
or a “secret” connection protocol, there is little at the TCP level that one can do to stop.
3. Syn-Ack Sequence Number. This is a security improvement. TCP features what we call sequence numbers.
That means that during the SYN-SYN/ACK-ACK dance, a connection starts at a random integer. This is
important because if an attacker is trying to spoof packets (pretend those packets are coming from your
program) that means that the attacker must either correctly guess – which is hard – or be in the route that
your packet takes to the destination – much more likely. ISPs help out with the destination problem because
it may send a connection through varying routers which makes it hard for an attacker to sit anywhere and
be sure that they will receive your packets – this is why security experts usually advise against using coffee
shop wifi for sensitive tasks.
4. Syn-Flood. Before the first synchronization packet is acknowledged, there is no connection. That means a
malicious attacker can write a bad TCP implementation that sends out a flood of SYN packets to a hapless
server. The SYN flood is easily mitigated by using IPTABLES or another netfilter module to drop all incoming
connections from an IP address after a certain volume of traffic is reached for a certain period.
5. Denial of Service, Distributed Denial of Service is the hardest form of attack to stop. Companies today are
still trying to find good ways to ease these attacks. This involves sending all sorts of network traffic forward
to servers in the hopes that the traffic will clog them up and slow down the servers. In big systems, this can
lead to cascading failures. If a system is engineered poorly, one server’s failure causes all the other servers
to pick up more work which increases the probability that they will fail and so on and so forth.
Topics
1. Security Terminology
3. Security in CyberSpace
Review
2. What is a chmod statement to break only the confidentiality and availability of your data?
3. An attacker gains root access on a Linux system that you use to store private information. Does this affect
confidentiality, integrity, or availability of your information, or all three?
4. Hackers brute force your git username and password. Who is affected?
9. Is creating and implementing client-server protocols that are secure and invulnerable to malicious attackers
easy?
14. HeartBleed is an example of what kind of security issue? Which one(s) of the triad does it break?
15. Meltdown and Spectre is an example of what kind of security issue? Which one(s) of the triad does it break?
Bibliography
[1] Part Guide. Intel R 64 and ia-32 architectures software developers manual. Volume 3B: System programming
[2] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas
Prescher, Michael Schwarz, and Yuval Yarom. Spectre attacks: Exploiting speculative execution. arXiv preprint
arXiv:1801.01203, 2018.
15
Review
int a;
static int b;
void func() {
static int c;
int d;
printf("%d %d %d %d\n",a,b,c,d);
}
2. In the example below, which variables are guaranteed to print the value of zero?
void func() {
int* ptr1 = malloc(sizeof(int));
int* ptr2 = realloc(NULL, sizeof(int));
int* ptr3 = calloc(1, sizeof(int));
int* ptr4 = calloc(sizeof(int), 1);
printf("%d %d %d %d\n",*ptr1,*ptr2,*ptr3,*ptr4);
}
345
3. Explain the error in the following attempt to copy a string.
char* copy(char*src) {
char*result = malloc( strlen(src) );
strcpy(result, src);
return result;
}
4. Why does the following attempt to copy a string sometimes work and sometimes fail?
char* copy(char*src) {
char*result = malloc( strlen(src) +1 );
strcat(result, src);
return result;
}
5. Explain the two errors in the following code that attempts to copy a string.
char* copy(char*src) {
char result[sizeof(src)];
strcpy(result, src);
return result;
}
7. Complete the function pointer typedef to declare a pointer to a function that takes a void* argument and
returns a void*. Name your type ‘pthread_callback’
typedef ______________________;
return dest;
}
10. Implement version of size_t strlen(const char*) using a loop and no function calls.
Printing
1. Spot the two errors!
2. Complete the following code to print to a file. Print the name, a comma and the score to the file ‘result.txt’
3. How would you print the values of the variables a, mesg, val and ptr to a string? Print a as an integer,
mesg as C string, val as a double val and ptr as a hexadecimal pointer. You may assume the mesg points to a
short C string(<50 characters). Bonus: How would you make this code more robust or able to cope with?
Input parsing
1. Why should you check the return value of sscanf and scanf? ## Q 5.2 Why is ‘gets’ dangerous?
2. Write a complete program that uses getline. Ensure your program has no memory leaks.
3. When would you use calloc instead of malloc? When would realloc be useful?
4. What mistake did the programmer make in the following code? Is it possible to fix it
i) using heap memory? ii) using global (static) memory?
char* next_ticket() {
id ++;
char result[20];
sprintf(result,"%d",id);
return result;
}
Processes
1. What is a process?
2. What attributes are carried over from a process on fork? How about on a successful exec call?
3. What is a fork bomb? How can we avoid one?
Memory
7. What are the benefits and drawbacks to first fit, worst fit, best fit?
Be acceptable?
7. What is Peterson’s Solution to the critical section problem? How about Dekker’s?
8. Is the following code thread-safe? Redesign the following code to be thread-safe. Hint: A mutex is
unnecessary if the message memory is unique to each call.
void *format(int v) {
pthread_mutex_lock(&mutex);
sprintf(message, ":%d:" ,v);
pthread_mutex_unlock(&mutex);
return message;
}
(a) Returning from the pthread’s starting function in the last running thread.
(b) The original thread returning from main.
(c) Any thread causing a segmentation fault.
(d) Any thread calling exit.
(e) Calling pthread_exit in the main thread with other threads still running.
10. Write a mathematical expression for the number of “W” characters that will be printed by the following
program. Assume a,b,c,d are small positive integers. Your answer may use a ‘min’ function that returns its
lowest valued argument.
11. Complete the following code. The following code is supposed to print alternating A and B. It represents two
threads that take turns to execute. Add condition variable calls to func so that the waiting thread need not
to continually check the turn variable. Q: Is pthread_cond_broadcast necessary or is pthread_cond_signal
sufficient?
pthread_cond_t cv = PTHREAD_COND_INITIALIZER;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
void* turn;
while(turn == mesg) {
/* poll again ... Change me - This busy loop burns CPU
time! */
}
}
return 0;
}
12. Identify the critical sections in the given code. Add mutex locking to make the code thread safe. Add
condition variable calls so that total never becomes negative or above 1000. Instead the call should block
until it is safe to proceed. Explain why pthread_cond_broadcast is necessary.
int total;
void add(int value) {
if(value < 1) return;
total += value;
}
void sub(int value) {
if(value < 1) return;
total -= value;
}
13. An thread unsafe data structure has size() enq and deq methods. Use condition variable and mutex lock
to complete the thread-safe, blocking versions.
14. Your startup offers path planning using the latest traffic information. Your overpaid intern has created a
thread unsafe data structure with two functions: shortest (which uses but does not modify the graph)
and set_edge (which modifies the graph).
For performance, multiple threads must be able to call shortest at the same time but the graph can only
be modified by one thread when no threads other are executing inside shortest or set_edge.
15. Use mutex lock and condition variables to implement a reader-writer solution. An incomplete attempt is
shown below. Though this attempt is thread safe (thus sufficient for demo day!), it does not allow multiple
threads to calculate shortest path at the same time and will not have sufficient throughput.
path_t* shortest_safe(graph_t* graph, int i, int j) {
pthread_mutex_lock(&m);
path_t* path = shortest(graph, i, j);
pthread_mutex_unlock(&m);
return path;
}
void set_edge_safe(graph_t* graph, int i, int j, double dist) {
pthread_mutex_lock(&m);
set_edge(graph, i, j, dist);
pthread_mutex_unlock(&m);
}
16. How many of the following statements are true for the reader-writer problem?
Deadlock
1. What do each of the Coffman conditions and what do they mean? Can you provide a definition of each one
and an example of breaking them using mutexes?
2. Give a real life example of breaking each Coffman condition in turn. A situation to consider: Painters, paint
and paint brushes.
3. Identify when Dining Philosophers code causes a deadlock (or not). For example, if you saw the following
code snippet which Coffman condition is not satisfied?
• P1 acquires R1
• P2 acquires R2
• P1 acquires R3
• P2 waits for R3
• P3 acquires R5
• P1 acquires R4
• P3 waits for R1
• P4 waits for R5
• P5 waits for R1
5. What are the pros and cons for the following solutions to dining philosophers
(a) Arbitrator
(b) Dijkstra
(c) Stalling’s
(d) Trylock
IPC
2. How do you determine how many bits are used in the page offset?
3. 20 ms after a context switch the TLB contains all logical addresses used by your numerical code which
performs main memory access 100% of the time. What is the overhead (slowdown) of a two-level page
table compared to a single-level page table?
4. Explain why the TLB must be flushed when a context switch occurs (i.e. the CPU is assigned to work on a
different process).
5. Fill in the blanks to make the following program print 123456789. If cat is given no arguments it simply
prints its input until EOF. Bonus: Explain why the close call below is necessary.
int main() {
int i = 0;
while(++i < 10) {
pid_t pid = fork();
if(pid == 0) {/* child */
char buffer[16];
sprintf(buffer, ______,i);
int fds[ ______];
pipe(fds);
write(fds[1], ______,______ ); // Write the buffer into
the pipe
close(______);
dup2(fds[0], ______);
execlp("cat", "cat", ______);
perror("exec"); exit(1);
}
waitpid(pid, NULL, 0);
}
return 0;
}
6. Use POSIX calls fork pipe dup2 and close to implement an autograding program. Capture the standard
output of a child process into a pipe. The child process should exec the program ./test with no additional
arguments (other than the process name). In the parent process read from the pipe: Exit the parent process
as soon as the captured output contains the ! character. Before exiting the parent process send SIGKILL to
the child process. Exit 0 if the output contained a !. Otherwise if the child process exits causing the pipe
write end to be closed, then exit with a value of 1. Be sure to close the unused ends of the pipe in the parent
and child process
7. This advanced challenge uses pipes to get an “AI player” to play itself until the game is complete. The
program tic tac toe accepts a line of input - the sequence of turns made so far, prints the same sequence
followed by another turn, and then exits. A turn is specified using two characters. For example “A1” and
“C3” are two opposite corner positions. The string B2A1A3 is a game of 3 turns/plys. A valid response is
B2A1A3C1 (the C1 response blocks the diagonal B2 A3 threat). The output line may also include a suffix
-I win -You win -invalid or -draw Use pipes to control the input and output of each child process
created. When the output contains a -, print the final output line (the entire game sequence and the result)
and exit.
8. Write a function that uses fseek and ftell to replace the middle character of a file with an ‘X’
9. What is an MMU? What are the drawbacks to using it versus a direct memory system?
Filesystems
7. What is an UID? GID? What is the difference between UID and Effective UID?
8. What is umask?
12. In an ext2 filesystem how many inodes are read from disk to access the first byte of the file /dir1/subdirA/notes.txt
? Assume the directory names and inode numbers in the root directory (but not the inodes themselves) are
already in memory.
13. In an ext2 filesystem what is the minimum number of disk blocks that must be read from disk to access the
first byte of the file /dir1/subdirA/notes.txt ? Assume the directory names and inode numbers in
the root directory and all inodes are already in memory.
14. In an ext2 filesystem with 32 bit addresses and 4KiB disk blocks, an inode can store 10 direct disk block
numbers. What is the minimum file size required to require a single indirection table? ii) a double direction
table?
15. Fix the shell command chmod below to set the permission of a file secret.txt so that the owner can
read,write,and execute permissions the group can read and everyone else has no access.
Networking
1. What is a socket?
2. What are the different layers of the internet?
6. Create a simple TCP echo server. This is a server that reads bytes from a client until it closes and echoes the
bytes back to the client.
7. Create a UDP client that would send a flood of packets to a hostname at argv[1].
8. What is HTTP?
9. What is DNS?
15. If a host address is 32 bits which IP scheme am I most likely using? 128 bits?
16. Which common network protocol is packet based and may not successfully deliver the data?
17. Which common protocol is stream-based and will resend data if packets are lost?
20. What protocol uses sequence numbers? What is their initial value? And why?
21. What are the minimum network calls are required to build a TCP server? What is their correct order?
22. What are the minimum network calls are required to build a TCP client? What is their correct order?
25. Which of the above calls can block, waiting for a new client to connect?
26. What is DNS? What does it do for you? Which of the CS241 network calls will use it for you?
29. Which network call specifies the size of the allowed backlog?
32. When is epoll a better choice than select? When is select a better choice than epoll?
33. Will write(fd, data, 5000) always send 5000 bytes of data? When can it fail?
35. Assuming a network has a 20ms One Way Transit Time between Client and Server, how much time would it
take to establish a TCP Connection?
(a) 20ms
(b) 40ms
(c) 100ms
(d) 60ms
36. What are some of the differences between HTTP 1.0 and HTTP 1.1? How many ms will it take to transmit 3
files from server to client if the network has a 20ms transmit time? How does the time taken differ between
HTTP 1.0 and HTTP 1.1?
37. Writing to a network socket may not send all of the bytes and may be interrupted due to a signal. Check the
return value of write to implement write_all that will repeatedly call write with any remaining data.
If write returns -1 then immediately return -1 unless the errno is EINTR - in which case repeat the last
write attempt. You will need to use pointer arithmetic.
38. Implement a multithreaded TCP server that listens on port 2000. Each thread should read 128 bytes from
the client file descriptor and echo it back to the client, before closing the connection and ending the thread.
39. Implement a UDP server that listens on port 2000. Reserve a buffer of 200 bytes. Listen for an arriving
packet. Valid packets are 200 bytes or less and start with four bytes 0x65 0x66 0x67 0x68. Ignore invalid
packets. For valid packets add the value of the fifth byte as an unsigned value to a running total and print
the total so far. If the running total is greater than 255 then exit.
Security
4. How does an operating system provide security? What are some examples from Networking and Filesystems?
6. Is DNS secure?
Signals
1. Give the names of two signals that are normally generated by the kernel
3. Why is it unsafe to call any function (something that it is not signal handler safe) in a signal handler?
4. Write brief code that uses SIGACTION and a SIGNALSET to create a SIGALRM handler.
5. What is the difference between a disposition, mask, and pending signal set?
6. What attributes are passed over to process children? How about exececuted processes?
Honors topics
16
If I have seen further it is by standing on the sholders [sic] of Giants
Sir Isaac Newton
This chapter contains the contents of some of the honors lectures (CS 296-41). These topics are aimed at
students who want to dive deeper into the topics of CS 241.
Throughout the course of CS 241, you become familiar with system calls - the userspace interface to interacting
with the kernel. How does this kernel actually work? What is a kernel? In this section, we will explore these
questions in more detail and shed some light on various black boxes that you have encountered in this course. We
will mostly be focusing on the Linux kernel in this chapter, so please assume that all examples pertain to the Linux
kernel unless otherwise specified.
361
System Calls Demystified
System Calls use an instruction that can be run by a program operating in userspace that traps to the kernel (by
use of a signal) to complete the call. This includes actions such as writing data to disk, interacting directly with
hardware in general or operations related to gaining or relinquishing privileges (e.g. becoming the root user and
gaining all capabilities).
In order to fulfill a user’s request, the kernel will rely on kernel calls. Kernel calls are essentially the
"public" functions of the kernel - functions implemented by other developers for use in other parts of the kernel.
Here is a snippet for a kernel call man page:
Name
Arguments
size_t size
Description
You’ll note that some flags are marked as potentially causing sleeps. This tells us whetherwe can use those
flags in special scenarios, like interrupt contexts, where speed is of the essence, and operations that may block or
wait for another process may never complete.
Containerization
As we enter an era of unprecedented scale with around 20 billion devices connected to the internet in 2018,
we need technologies that help us develop and maintain software capable of scaling upwards. Additionally,
as software increases in complexity, and designing secure software becomes harder, we find that we have new
constraints imposed on us as we develop applications. As if that wasn’t enough, efforts to simplify software
distribution and development, like package manager systems can often lead to headaches of their own, leading to
broken packages, dependencies that are impossible to resolve and other such environmental nightmares that have
become all to common today. While these seem like disjoint problems at first, all of these and more can be solved
by throwing containerization at the problem.
What is a container?
A container is almost like a virtual machine. In some senses, containers are to virtual machines as threads are
to processes. A container is a lightweight environment that shares resources and a kernel with a host machine,
while isolating itself from other containers or processes on the host. You may have encountered containers while
working with technologies such as Docker, perhaps the most well-known implementation of containers out there.
Linux Namespaces
Bibliography
17
Appendix
Shell
A shell is actually how you are going to be interacting with the system. Before user-friendly operating systems,
when a computer started up all you had access to was a shell. This meant that all of your commands and editing
had to be done this way. Nowadays, our computers boot up in desktop mode, but one can still access a shell using
a terminal.
(Stuff) $
It is ready for your next command! You can type in a lot of Unix utilities like ls, echo Hello and the shell
will execute them and give you the result. Some of these are what are known as shell-builtins meaning that
the code is in the shell program itself. Some of these are compiled programs that you run. The shell only looks
through a special variable called path which contains a list of colon separated paths to search for an executable
with your name, here is an example path.
$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:
/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
So when the shell executes ls, it looks through all of those directories, finds /bin/ls and executes that.
$ ls
...
$ /bin/ls
You can always call through the full path. That is always why in past classes if you want to run something on
365
the terminal you’ve had to do ./exe because typically the directory that you are working in is not in the PATH
variable. The . expands to your current directory and your shell executes <current_dir>/exe which is a valid
command.
• p̂atŝub takes the last command and substitutes the pattern pat for the substitution sub
What’s a terminal?
A terminal is an application that displays the output from the shell. You can have your default terminal, a quake
based terminal, terminator, the options are endless!
Common Utilities
1. cat concatenate multiple files. It is regularly used to print out the contents of a file to the terminal but the
original use was concatenation.
$ cat file.txt
...
$ cat shakespeare.txt shakespeare.txt > two_shakes.txt
2. diff tells you the difference between the two files. If nothing is printed, then zero is returned meaning the
files are the same byte for byte. Otherwise, the longest common subsequence difference is printed
$ cat prog.txt
hello
world
$ cat adele.txt
hello
it’s me
$ diff prog.txt prog.txt
$ diff shakespeare.txt shakespeare.txt
2c2
< world
---
> it’s me
3. grep tells you which lines in a file or standard input match a POSIX pattern.
$ grep it adele.txt
it’s me
$ cd /usr
$ cd lib/
$ cd -
$ pwd
/usr/
6. man every system programmers favorite command tells you more about all your favorite functions!
Syntactic
Shells have many useful utilities like saving some output to a file using redirection >. This overwrites the file from
the beginning. If you only meant to append to the file, you can use ». Unix also allows file descriptor swapping.
This means that you can take the output going to one file descriptor and make it seem like it’s coming out of
another. The most common one is 2>1 which means take the stderr and make it seem like it is coming out of
standard out. This is important because when you use > and » they only write the standard output of the file.
There are some examples below.
$ ./program > output.txt # To overwrite
$ ./program >> output.txt # To append
$ ./program 2>&1 > output_all.txt # stderr & stdout
$ ./program 2>&1 > /dev/null # don’t care about any output
The pipe operator has a fascinating history. The UNIX philosophy is writing small programs and chaining them
together to do new and interesting things. Back in the early days, hard disk space was limited and write times
were slow. Brian Kernighan wanted to maintain the philosophy while omitting intermediate files that take up
hard drive space. So, the UNIX pipe was born. A pipe takes the stdout of the program on its left and feeds it
to the stdin of the program on its write. Consider the command tee. It can be used as a replacement for the
redirection operators because tee will both write to a file and output to standard out. It also has the added benefit
that it doesn’t need to be the last command in the list. Meaning, that you can write an intermediate result and
continue your piping.
The and || operator are operators that execute a command sequentially. only executes a command if the
previous command succeeds, and || always executes the next command.
Each thread uses a stack memory. The stack ‘grows downwards’ - if a function calls another function, then the stack
is extended to smaller memory addresses. Stack memory includes non-static automatic (temporary) variables,
parameter values, and the return address. If a buffer is too small some data (e.g. input values from the user),
then there is a real possibility that other stack variables and even the return address will be overwritten. The
precise layout of the stack’s contents and order of the automatic variables is architecture and compiler dependent.
With a little investigative work, we can learn how to deliberately smash the stack for a particular architecture.
The example below demonstrates how the return address is stored on the stack. For a particular 32 bit
architecture Live Linux Machine, we determine that the return address is stored at an address two pointers (8
bytes) above the address of the automatic variable. The code deliberately changes the stack value so that when the
input function returns, rather than continuing on inside the main method, it jumps to the exploit function instead.
void breakout() {
puts("Welcome. Have a shell...");
system("/bin/sh");
}
void input() {
void *p;
printf("Address of stack variable: %p\n", &p);
printf("Something that looks like a return address on stack:
%p\n", *((&p)+2));
// Let’s change it to point to the start of our sneaky function.
*((&p)+2) = breakout;
}
int main() {
printf("main() code starts at %p\n",main);
input();
while (1) {
puts("Hello");
sleep(1);
}
return 0;
}
There are a lot of ways that computers tend to get around this.
Assorted Man Pages
Malloc
Copyright (c) 1993 by Thomas Koenig ([email protected])
%%%LICENSE_START(VERBATIM)
Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.
Since the Linux kernel and libraries are constantly changing, this
manual page may be incorrect or out-of-date. The author(s) assume no
responsibility for errors or omissions, or for damages resulting from
the use of the information contained herein. The author(s) may not
have taken the same level of care in the production of this manual,
which is licensed free of charge, as they might when working
professionally.
NAME
malloc, free, calloc, realloc - allocate and free dynamic memory
SYNOPSIS
#include <stdlib.h>
reallocarray():
_GNU_SOURCE
DESCRIPTION
The malloc() function allocates size bytes and returns a
pointer to the allocated memory. The memory is not initialized.
If size is 0, then malloc() returns either NULL, or
a unique pointer value that can later be successfully passed
to free().
RETURN VALUE
The malloc() and calloc() functions return a pointer to the
allocated memory, which is suitably aligned for any built-in
type. On error, these functions return NULL. NULL may also be
returned by a successful call to malloc() with a size of zero,
or by a successful call to calloc() with nmemb or size equal
to zero.
ERRORS
calloc(), malloc(), realloc(), and reallocarray() can fail with
the following error:
ATTRIBUTES
For an explanation of the terms used in this section, see
attributes(7).
+---------------------+---------------+---------+
|Interface | Attribute | Value |
|-----------------------------------------------|
|malloc(), free(), | Thread safety | MT-Safe |
|calloc(), realloc() | | |
+---------------------+---------------+---------+
CONFORMING TO
malloc(), free(), calloc(), realloc(): POSIX.1-2001,
POSIX.1-2008, C89, C99.
NOTES
By default, Linux follows an optimistic memory allocation
strategy. This means that when malloc() returns non-NULL there
is no guarantee that the memory is available. In case it
turns out that the system is out of memory, one or more
processes will be killed by the OOM killer. For more
information, see the description of /proc/sys/vm/over-
commit_memory and /proc/sys/vm/oom_adj in proc(5), and the
Linux kernel source file Documentation/vm/overcommit-accounting.
SEE ALSO
valgrind(1), brk(2), mmap(2), alloca(3), malloc_get_state(3),
malloc_info(3), malloc_trim(3), malloc_usable_size(3),
mallopt(3), mcheck(3), mtrace(3), posix_memalign(3)
0x43 0x61 0x74 0xe0 0xf9 0xbf 0x5f 0xff 0x7f 0x00
Warning: Authors are not responsible for any neuro-apoptosis caused by these “jokes.” - Groaners are allowed.
Groaners
Why did the baby system programmer like their new colorful blankie? It was multithreaded.
Why are your programs so fine and soft? I only use 400-thread-count or higher programs.
Where do bad student shell processes go when they die? Forking Hell.
Why are C programmers so messy? They store everything in one big heap.
This chapter is meant to serve as a big "why are we learning all of this". In all of your previous classes, you
were learning what to do. How to program a data structure, how to code a for loop, how to prove something.
This is the first class that is largely focused on what not to do. As a result, we draw experience from our past in
real ways. Sit back and scroll through this chapter as we tell you about the problems of past programmers. Even
if you are dealing with something much higher level like web-development, everything relates back to the system.
Shell Shock
Required: Appendix/Shell
This was a back door into most shells. The bug allowed an attacker to exploit an environment variable to
execute arbitrary code.
This meant that in any system that uses environment variables and doesn’t sanitize their input (hint no one
sanitized environment variable input because they saw it as safe) you can execute whatever code you want on
other’s machines including setting up a web server.
Lessons Learned: On production machines make sure that there is a minimal operating system (something
like BusyBox with DietLibc) so that you can understand most of the code in the systems and their effectiveness.
Put in multiple layers of abstraction and checks to make sure that data isn’t leaked. For example the above is
a problem insofar as getting information back to the attackers if it is allowed to communicate with them. This
means that you can harden your machine ports by disallowing connections on all but a few ports. Also, you can
harden your system to never perform exec calls to perform tasks (i.e. perform an exec call to update a value) and
instead do it in C or your favorite programming language of choice. Although you don’t have flexibility, you have
peace of mind what you allow users to do.
375
Heartbleed
Required: Intro to C
To put it simply, there were no bounds on buffer checking. The SSL Heartbeat is super simple. A server sends
a string of a certain length, and the second server is supposed to send the string of the length back. The problem
is someone can maliciously change the size of the request to larger than what they sent (i.e. send “cat” but request
500 bytes) and get crucial information like passwords from the server. There is a Relevant XKCD on it.
Lessons Learned: Check your buffers! Know the difference between a buffer and a string.
Dirty Cow
Meltdown
Spectre
Mars Pathfinder
Mars Again
Year 2038
What happens if $0 or the first parameter passed into a script doesn’t exist? You move to root, and you delete
your entire computer.
Lessons Learned: Do parameter checks, always always always set -e on a script and if you expect a command
to fail, explicitly list it. You can also alias rm to mv and then delete the trash later.