Chapter 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Chapter 1

Secondary storage
Dr. Ibaa

Index

1) Introduction
2) File system features
3) Implementing File Systems
4) Physical storage
1. Introduction
The role of the file system is to provide a high level abstraction of the hardware disk,
making it appear to be a large number of disk-like objects called files. Like a disk, a file
is capable of storing a large amount of data cheaply, reliably, and persistently.
The main goal of files is store persistent data, which is defined as data whose lifetime is
longer than the one of the process that created it; when the process terminates,
persistent data remains, but volatile data.

e.g.: variables inside the disappears

Given that virtual memory and file systems use the same device (hard disk drive),
to do almost the same thing (store data objects), why are they so different ?
Couldn't there be a system where we have only virtual memory, and inside this
virtual memory we can designate objects as "volatile" or "persistent"?

b One important question may arise:


The operating system would then just free memory dedicated to volatile objects when
the process using them is terminated; persistent objects would just be left.
The advantage of such a system would be tremendous: Imagine that a programmer
would never have to encode/decode data objects from/into files anymore. Think about
the time it takes to load an image or a map, or even a text document, all of this time is
usually taken on conversion between the file data and internal data structures.
Such a system, sometimes called “single-level store”, is a great idea. So why is it not
common in current operating systems? In other words, why are virtual memory and files
presented as very different kinds of objects?
The main reason is due to the limitation of the address space. The 32 bits address space
which was common between CPUs was limited to 2GBs of memory, which is too small to
hold an entire file system.
The new 64-bits address CPUs could be used to resolve this problem, but it is difficult
now to change the habits of users and programmers which were acquired during the last
30 years. Hence, file systems are here to stay for a long period and we have better
understand their
implementation.
2. File system features
2.1. Naming
Every file system provides some way to give a name to each file. We will consider only
names for individual files here, and talk about directories. File names should be usable
by human end users, and there are several restrictions on these names:

Size. Some systems put severe restrictions on the length of


names. For example DOS restricts names to 11
characters, while most versions of UNIX restrict names to
255 characters.

Case. Are upper and lower case letters considered different? The
Unix tradition is to consider the names Foo and foo to be
completely different and unrelated names. In DOS and its
descendants, however, they are considered the same.

Character While file names used to be written using Latin alphabet,


Set. there is a move to support the UNICODE character set.
Windows NT and its derivatives (WinXP, Win2000, etc) are
an excellent example of this.

Format. It is common to divide a file name into a base name and


an extension that indicates the type of the file. DOS
requires that each name be compose of a bast name of
eight or less characters and an extension of three or less
characters. When the name is displayed, it is represented
as base.extension. Unix internally makes no such
distinction, but it is a common convention to include
exactly one period in a file name (e.g. foo.c for a C source
file).

2.2. File system Access Modes


Systems support various access modes for operations on a file.
. Sequential: Read or write the next record or next n bytes of the file. Usually,
sequential access also allows a rewind operation.
th
. Random: Read or write the n record or bytes i through j. Unix provides an
equivalent facility by adding a seek operation to the sequential operations listed
above.
. Indexed: Read or write the record with a given key. In some cases, the “key”
need not be unique--there can be more than one record with the same key. In
this case, programs use a combination of indexed and sequential operations: Get
the first record with a given key, then get other records with the same key by
doing sequential reads.
This is the area where there is the most variation among file systems. Attributes can
also be grouped by general category.

Name

Ownership and Owner, owner's “group,” creator, access-control list.


Protection

Time stamps Time created, time last modified, time last accessed,
time the attributes were last changed, etc.

Sizes Current size, size limit.

Type File is ASCII, is executable, is a “system” file, is an Excel


Information spread sheet, etc.

Misc Some systems have attributes describing how the file


should be displayed when a directly is listed. For example
MacOS records an icon to represent the file and the
screen coordinates where it was last displayed. DOS has
a “hidden” attribute meaning that the file is not normally
shown.

2.3. File systems Operations


POSIX, a standard API (Application Programming Interface) based on Unix, provides
the following operations (among others) for manipulating files:

fd = open(name, operation)
fd = creat(name, mode)
status = close(fd)
byte_count = read(fd, buffer, byte_count)
byte_count = write(fd, buffer, byte_count)
offset = lseek(fd, offset, whence)
status = link(oldname, newname)
status = unlink(name)
status = stat(name, buffer)
status = fstat(fd, buffer)
status = utimes(name, times)
status = chown(name, owner, group) or fchown(fd, owner, group)
status = chmod(name, mode) or fchmod(fd, mode)
status = truncate(name, size) or ftruncate(fd, size)

Some types of arguments and results need explanation.

status Many functions return a “status” which is either 0 for success or -1


for errors (there is another mechanism to get more information
about went wrong). Other functions also use -1 as a return value
to indicate an error.

name A character-string name for a file.

fd A “file descriptor”, which is a small non-negative integer used as a


short, temporary name for a file during the lifetime of a process.

buffer The memory address of the start of a buffer for supplying or


receiving data.

whence One of three codes, signifying from start, from end, or from current
location.

mode A bit-mask specifying protection information.

operation An integer code, one of read, write, read and write, and perhaps a few
other possibilities such as append only.

fd = open(name, operation)

The open call finds a file and assigns a descriptor to it. It also indicates how the file will
be used by this process (read only, read/write, etc).

fd = creat(name, mode)

The creat call is similar, but creates a new (empty) file. The mode argument specifies
protection attributes (such as “writable by owner but read-only by others”) for the new
file. (Most modern versions of Unix have merged creat into open by adding an optional
mode argument and allowing the operation argument to specify that the file is
automatically created if it doesn't already exist.)

status = close(fd)

The close call simply announces that fd is no longer in use and can be reused for
another open or creat. .)

read(fd, buffer, byte_count)


write(fd, buffer, byte_count)

The read and write operations transfer data between a file and memory. The starting
location in memory is indicated by the buffer parameter; the starting location in the file
(called the seek pointer is wherever the last read or write left off. The result is the
number of bytes transferred. another open or creat. .)

byte_count = read(fd, buffer, byte_count)


byte_count = write(fd, buffer, byte_count)

For write it is normally the same as the byte_count parameter unless there is an error.
For read it may be smaller if the seek pointer starts out near the end of the file. The
lseek operation adjusts the seek pointer (it is also automatically updated by read and
write). be used by this process (read only, read/write, etc).

offset = lseek(fd, offset, whence)

The specified offset is added to zero, the current seek pointer, or the current size of
the file, depending on the value of whence.

status = link(oldname, newname)


status = unlink(name)

The function link adds a new name (alias) to a file, while unlink removes a name.
There is no function to delete a file; the system automatically deletes it when there are
no remaining names for it.

status = stat(name, buffer)


status = fstat(fd, buffer)
status = utimes(name, times)
status = chown(name, owner, group) or fchown(fd, owner, group)
status = chmod(name, mode) or fchmod(fd, mode)
status = truncate(name, size) or ftruncate(fd, size)

The stat function retrieves meta-data about the file and puts it into a buffer (in a fixed,
documented format), while the remaining functions can be used to update the meta-
data:
c Utimes: updates time stamps
c Chown: updates ownership
c Chmod: updates protection information
c Truncate: changes the size (files can be make bigger by write, but only truncate
can make them smaller).
Most come in two flavors:
5) one that take a file name
6) one that takes a descriptor for an open file.

3. Implementing File Systems


From the OS' standpoint, the file consists of a bunch of blocks stored on the device.
Programmer may actually see a different interface (bytes or records), but this does not
matter to the file system, just pack bytes into blocks, unpack them again on reading.

3.1. File allocation


We will assume that all the blocks of the disk are given block numbers starting at zero
and running through consecutive integers up to some maximum.
We will further assume that blocks with numbers that are near each other are located
physically near each other on the disk (e.g. same cylinder) so that the arithmetic difference
between the numbers of two blocks gives a good estimate how long it takes to get from
one to the other. First let's consider how to represent an individual file.
There are (at least!) four possibilities:
1) Contiguous
2) Linked List
3) Disk Index
4) File Index

 Contiguous:
The blocks of a file are the block numbered n, n+1, n+2,
..., m. We can represent any file with a pair of numbers:
the block number of of first block and the length of the
file (in blocks). The advantages of this approach are
c It's simple

c The blocks of the file are all physically near each


other on the disk and in order so that a sequential
scan through the file will be fast.

The drawback is the excessive fragmentation


which will preclude large files.

 Linked List:
In the file descriptor, we just keep pointer to first block. In each block of file we keep a
pointer to the next block. We can also keep a linked list of free blocks for the free list.

b Advantages: files can be extended, no


fragmentation problems. Sequential access is easy: just
chase links.

b Drawbacks: random access is virtually impossible.


Lots of seeking, even in sequential access.

 Disk Index
The idea here is to keep the linked-list representation, but take the link fields out of the
blocks and gather them together all in one place. At some fixed place on disk, allocate
an array I with one element for each block on the disk, and move the link field from
block n to I[m].
recall that a single disk access takes as long as 10's or even 100's of thousands
of instructions.

The whole array of links, called a file access table (FAT) is now small enough that it can
be read into main memory when the system starts up. Accessing the 100th block of a file
still requires walking through 99 links of a linked list, but now the entire list is in
memory, so time to traverse it is negligible.

The main problem with this


approach is that the index
array I can get quite large
with modern disks

The inode structure introduced by Unix groups together index information about each
file individually.
The basic idea is to represent each file as a tree of blocks, with the data blocks as
leaves. Each internal block (called an indirect block in Unix jargon) is an array of block
numbers, listing its children in order. If a disk block is 2K bytes and a block number is
four bytes, 512 block numbers fit in a block, so a one-level tree (a single root node
pointing directly to the leaves) can accommodate files up to 512 blocks, or one
megabyte in size.

If the root node is cached in memory, the “address” (block number) of any block of the
file can be found without any disk accesses. A two-level tree, with 513 total indirect
blocks, can handle files 512 times as large (up to one-half gigabyte).
The only problem with this idea is that it wastes space for small files. Any file with more
than one block needs at least one indirect block to store its block numbers.
A 4K file would require three 2K
blocks, wasting up to one
third of its space. Since many
files are quite small, this is
serious problem. The Unix
solution is to use a different
kind of “block” for the root of the
tree.
 File Index
An index node (or inode for short) contains almost all the meta-data about a file listed
above: ownership,
permissions, time stamps, etc.
(but not the file name).
Inodes are small enough that
several of them can be packed
into one disk block. In addition
to the meta-data, an inode
contains the block numbers of
the first few blocks of the file.
What if the file is too big to fit
all its block numbers into the
inode?

The earliest version of Unix had a bit in the meta-data to indicate whether the file
was “small” or “big.” For a big file, the inode contained the block numbers of
indirect blocks rather than data blocks.
More recent versions of Unix contain pointers to indirect blocks in addition to the
pointers to the first few data blocks.

The inode contains pointers to (i.e., block numbers of) the first few blocks of the file, a
pointer to an indirect block containing pointers to the next several blocks of the file, a
pointer to a doubly indirect block, which is the root of a two-level tree whose leaves are
the next blocks of the file, and a pointer to a triply indirect block. A large file is thus a
lop-sided tree.

The arrangement of disk blocks in Unix is as shown in the figure below.

BB SB IL DB
BB: Boot Block IL: Inode List
SB: Super Block DB: Data Blocks

The first block of the filesystem is called the superblock. The superblock gives
information regarding the tuneable parameters of the filesystem: the number of inodes,
the number of data blocks, the size of the data blocks... It may also include information
such as a volume name to identify the partition.

3.2. Space Management


Block Size and Extents
All of the file organizations mentioned store the contents of a file in a set of disk blocks.

b How big should a block be?

The problem with small blocks is I/O overhead. There is a certain overhead to read or
write a block beyond the time to actually transfer the bytes. If we double the block size,
a typical file will have half as many blocks. Reading or writing the whole file will transfer
the same amount of data, but it will involve half as many disk I/O operations. The
overhead for an I/O operations includes a variable amount of latency (seek time and
rotational delay) that depends on how close the blocks are to each other, as well as a
fixed overhead to start each operation and respond to the interrupt when it completes.
3. Implementing File Systems
3.2. Space Management
Block Size and Extents

Decrease: Increase: Incr


Block size easi
\ I/O overhead | Better performance
ng
| Less I/O overhead the
\ Internal fragmentation bloc
k
size would definitely provide better performance because there will be less disk I/O
operations. On the other hand side, if blocks are too big, this will result into internal
fragmentation.
A file can only grow in increments of whole blocks. If the sizes of files are random, we
would expect on the average that half of the last block of a file is wasted. If most files
are many blocks long, the relative amount of waste is small, but if the block size is large
compared to the size of a typical file, half a block per file is significant. In fact, if files
are very small (compared to the block size), the problem is even worse. If, for example,
we choose a block size of 8k and the average file is only 1K bytes long, we would be
wasting about 7/8 of the disk.
Most files in a typical Unix system are very small, it was found that simply rounding the
size of each file up to a multiple of 512 bytes resulted in wasting 4.2% of the space.
Including overhead for inodes and indirect blocks, the original 512-byte file system had
a total space overhead of 6.9%.
Changing to 1K blocks raised the overhead to 11.8%. With 2k blocks, the overhead
would be 22.4% and with 4k blocks it would be 45.6%.
Would 4k blocks be worthwhile?

The answer depends on economics.


In those days disks were very expensive, and a wasting half the disk seemed extreme.
These days, disks are cheap, and for many applications people would be happy to pay
twice as much per byte of disk space to get a disk that was twice as fast. As disks get
cheaper and CPU's get faster, wasted space is less of a problem and the speed
mismatch between the CPU and the disk gets worse. Thus the trend is towards larger
and larger disk blocks, 8K blocks are common today.

3.3. Reliability
Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile
memory. There are several techniques that can be used to mitigate the effects of these
failures. We only have room for a brief survey.

 Bad-block Forwarding
When the disk drive writes a block of data, it also writes a checksum, a small number of
additional bits whose value is some function of the “user data” in the block.
When the block is read back in, the checksum is also read and compared with the data.
If either the data or checksum were corrupted, it is extremely unlikely that the
checksum comparison will succeed. Thus
the disk drive itself has a way of 1 block of data
discovering bad blocks with extremely high
probability.

The hardware is also responsible for


recovering from bad blocks. Modern disk
user data checksum
drives do automatic bad-block
forwarding.
The disk drive or controller is responsible for mapping block numbers to absolute
locations on the disk (cylinder, track, and sector).
It holds a little bit of space in reserve, not mapping
any block numbers to this space. When a bad block is
discovered, the disk allocates one of these reserved
blocks and maps the block number of the bad block to
the replacement block. All references to this block
number access the replacement block instead of the
bad block.
There are two problems with this scheme:
 First:
When a block goes bad, the data in it is lost. In practice, blocks tend to be bad from the
beginning, because of small defects in the surface coating of the disk platters. There is
usually a stand-alone formatting program that tests all the
blocks on the disk and sets up forwarding entries for those
that fail. Thus the bad blocks never get used in the first
place.
The main reason for the forwarding is that it is just too
hard (expensive) to create a disk with no defects. It is
much more economical to manufacture a “pretty good” disk
and then use bad-block forwarding to work around the few
bad blocks.

 Second:

Forwarding interferes with the OS's attempts to lay out files


optimally. The OS may think it is doing a good job by
assigning consecutive blocks of a file to consecutive block
numbers, but if one of those blocks is forwarded, it may be
very far away for the others.
In practice, this is not much of a problem since a disk
typically has only a handful of forwarded sectors out of
millions.
The software can also help avoid bad blocks by simply
leaving them out of the free list (or marking them as
allocated in the allocation bitmap).

 Consistency Checking
Some of the information in a file system is redundant. For example, the free list could
be reconstructed by checking which blocks are not in any file. Redundancy arises
because the same information is represented in different forms to make different
operations faster.

If you want to know which blocks are in a given file, look at the inode. If you you
want to know which blocks are not in any inode, use the free list.

Unfortunately, various hardware and software errors can cause the data to become
inconsistent. File systems often include a utility that checks for consistency and
optionally attempts to repair inconsistencies. These programs are particularly handy for
cleaning up the disks after a crash.
Unix has a utility called fscheck. It has two principal tasks:
 First, it checks that blocks are properly allocated. Each inode is supposed to be the
root of a tree of blocks, the free list is supposed to be a tree of blocks, and each block is
supposed to appear in exactly one of these trees. Fscheck runs through all the inodes,
checking each allocated inode for reasonable values, and walking through the tree of
blocks rooted at the inode. It maintains a bit vector to record which blocks have been
encountered.
If block is encountered that has already been seen, there is a problem:
 Either it occurred twice in the same file (in which case it isn't a tree)
 Or it occurred in two different files.

A reasonable recovery would be to allocate a new block, copy the contents of the
problem block into it, and substitute the copy for the problem block in one of the two
places where it occurs.

It would also be a good idea to log an error message so that a human being can
check up later to see what's wrong. Popup 6
After all the files are scanned, any block that hasn't been found should be on the free
list.

It would be possible to scan the free list in a similar manner, but it's probably
easier just to rebuild the free list from the set of blocks that were not found in
any file. If a bitmap instead of a free list is used, this step is even easier: Simply
overwrite the file system's bitmap with the bitmap constructed during the scan.

 Second, The other main consistency requirement concerns the directory structure.
The set of directories is supposed to be a tree, and each inode is supposed to have a
link count that indicates how many times it appears in directories.
The tree structure could be checked by a recursive walk through the directories, but it is
more efficient to combine
this check with the walk
through the inodes that
checks for disk blocks, but
recording, for each
directory inode
encountered, the inumber
of its parent. The set of
directories is a tree if and
only if and only if every
directory other than the
root has a unique parent.
This pass can also rebuild the link count for each inode by maintaining in memory an
array with one slot for each inumber. Each time the inumber is found in a directory,
increment the corresponding element of the array. The resulting counts should match
the link counts in the inodes. If not, correct the counts in the inodes.

Inode table iNumber Directory


i1 Name 1
i2 Name 2
i3 Name 3
i4 Name 4
i5 Name 5

This illustrates a very important principal that pops up throughout operating system
implementation (indeed, throughout any large software system):
The doctrine of hints and absolutes.

Whenever the same fact is recorded in two different ways:


. one of them should be considered the absolute truth

. the other should be considered a hint.

Hints are handy because they allow some operations to be done much more quickly that
they could if only the absolute information was available. But if the hint and the
absolute do not agree, the hint can be rebuilt from the absolutes.

In a well-engineered system, there should be some way to verify a hint whenever it is


used. Unix is a bit lax about this. The link count is a hint (the absolute information is a
count of the number of times the inumber appears in directories), but Unix treats it like
an absolute during normal operation. As a result, a small error can snowball into
completely trashing the file system.
3. Implementing File Systems
3.3. Reliability
 Consistency Checking
For another example of hints, each allocated block could have a header containing the
inumber of the file containing it and its offset in the file. There are systems that do this
(Unix isn't one of them).
The tree of blocks rooted at an inode then becomes a hint, providing an efficient way of
finding a block, but when the block is found, its header could be checked. Any
inconsistency would then be caught immediately, and the inode structures could be
rebuilt from the information in the block headers.
By the way, if the link count calculated by the scan is zero (i.e., the inode, although
marked as allocated, does not appear in any directory), it would not be prudent to
delete the file. A better recovery is to add an entry to a special lost+found directory
pointing to the orphan inode, in case it contains something really valuable.
3. Implementing File Systems
3.3. Reliability
 Performance
The main trick to improve file system performance (like anything else in computer
science) is caching. The system keeps a disk cache (sometimes also called a buffer pool)
of recently used disk blocks.
In contrast with the link: page frames of virtual memory 1, where there were all sorts of
algorithms proposed for managing the cache, management of the disk cache is pretty
simple.
On the whole, it is simply managed LRU (least recently used).

Why is it that for paging we went to great lengths trying to come up with an
algorithm that is “almost as good as LRU” while here we can simply use true LRU?

The problem with implementing LRU is that some information has to be updated on
every single reference. In the case of paging, references can be as frequent as every
instruction, so we have to make do with whatever information hardware is willing to
give us.
The best we can hope for is that the paging hardware will set a bit in a page-table entry.
In the case of file system disk blocks, however, each reference is the result of a system
call, and adding a few extra instructions added to a system call for cache maintenance is
not unreasonable.

Adding page caching to the file system implementation is actually quite simple.
Somewhere in the implementation, there is probably a procedure that gets called when
the system wants to access a disk block. Let's suppose the procedure simply allocates
some memory space to hold the block and reads it into memory.

Block readBlock(int blockNumber) {


Block result = new Block();
Disk.read(blockNumber, result);
return result;
}

1
http://pages.cs.wisc.edu/~solomon/cs537-old/last/paging.html#page_replacement
class CacheEntry {
int blockNumber;
To add caching, all we have Block buffer;
to do is modify this code to CacheEntry next, previous;
}
search the disk cache first. class DiskCache {
CacheEntry head, tail;
CacheEntry find(int blockNumber) {
// Search the list for an entry with a
This code is not quite right, matching block number.
because it ignores writes. // If not found, return null.
}
If the oldest buffer is dirty void moveToFront(CacheEntry entry) {
(it has been modified since // more entry to the head of the list
}
it was read from disk), it CacheEntry oldest() {
first has to be written back return tail;
}
to the disk before it can be Block readBlock(int blockNumber) {
used to hold the new block. Block result;
CacheEntry entry = find(blockNumber);
Most systems actually write if (entry == null) {
dirty buffers back to the entry = oldest();
Disk.read(blockNumber, entry.buffer);
disk sooner than necessary entry.blockNumber = blockNumber;
to minimize the damage }
moveToFront(entry);
caused by a crash. The return entry.buffer;
original version of Unix had }
a background process that }
would write all dirty buffers
to disk every 30 seconds.
Some information is more critical than others. Some versions of Unix, for example, write
back directory blocks (the data block of directory files of type directory) as each time
they are modified. This technique--keeping the block in the cache but writing its
contents back to disk after any modification--is called write-through caching.

LRU management automatically does the “right thing” for most disk blocks.
. If someone is actively manipulating the files in a directory, all of the directory's
blocks will probably be in the cache.

. If a process is scanning a large file, all of its indirect blocks will probably be in
memory most of the time.
But there is one important case where LRU is not the right policy. Consider a process
that is traversing (reading or writing) a file sequentially from beginning to end. Once
that process has read or written the last byte of a block, it will not touch that block
again. The system might as well immediately move the block to the tail of the list as
soon as the read or write request completes.
Tanenbaum calls this technique free behind. It is also sometimes called:

Most Recently Used (MRU) to contrast it with LRU


How does the system know to handle certain blocks MRU?
There are several possibilities.
c If the operating system interface distinguishes between random-access files and
sequential files, it is easy. Data blocks of sequential files should be managed
MRU.
c In some systems, all files are alike, but there is a different kind of open call, or a
flag passed to open, that indicates whether the file will be accessed randomly or
sequentially.
c Even if the OS gets no explicit information from the application program, it can
watch the pattern of reads an writes. If recent history indicates that all (or most)
reads or writes of the file have been sequential, the data blocks should be
managed MRU.

A similar trick is called read-ahead. If a file is being read sequentially, it is a good idea
to read a few blocks at a time. This cuts down on the latency for the application (most
of the time the data the application wants is in memory before it even asks for it).
If the disk hardware allows multiple blocks to be read at a time, it can cut the number of
disk read requests, cutting down on overhead such as the time to service a I/O
completion interrupt. If the system has done a good job of clustering together the disks
of the file, read-ahead also takes better advantage of the clustering. If the system reads
one block at a time, another process, accessing a different file, could make the disk
head move away from the area containing the blocks of this file between accesses.
3. Implementing File Systems
3.3. Reliability
 Performance
The Berkeley2 file system introduced another trick to improve file system performance.
They divided the disk into chunks, which they called cylinder groups (CGs) because each
one is comprised of some number of adjacent cylinders. Each CG is like a miniature
disk. It has its own super block and array of inodes. The system attempts to put all the
blocks of a file in the same CG as its inode. It also tries to keep all the inodes in one
directory together in the same CG so that operations like:
ls -l *
will be fast. It uses a variety to techniques to assign inodes and blocks to CGs in such as
way as to distribute the free space fairly evenly between them, so there will be enough
room to do this clustering.

2
http://pages.cs.wisc.edu/~solomon/cs537-old/last/filesys1.html#block-size
In particular:

c When a new file is created: its inode is placed in the same CG as its parent
directory (if possible). But when a new directory is created, its inode is placed in
CG with the largest amount of free space (so that the files in the directory will be
able to be near each other).

c When blocks are added to a file: they are allocated (if possible) from the
same CG that contains it inode. But when the size of the file crosses certain
thresholds (say every megabyte or so), the system switches to a different CG,
one that is relatively empty. The idea is to prevent a big file from hogging all the
space in one CG and preventing other files in the CG from being well clustered.

4. Physical storage
4.1. Disk Hardware
A (hard) disk drive record data on the surfaces of metal plates called platters that are
coated with a substance
containing ground-up iron,
or other substances that
allow zeros and ones to be
recorded as tiny spots of
magnetization. Floppy
disks (also called
“diskettes” by those who
think the term “floppy” is
undignified) are similar,
but use a sheet of plastic
rather than metal, and
permanently enclose it in a
paper or plastic envelope.

I won't say anything more about floppy disks, but most of facts about hard disks are
also true for floppies, but slower. It is customary to use the simple term “disk” to mean
“hard disk drive” and say “platter” when you mean the disk itself.
When in use, the disk spins rapidly and a read/write head slides along the surface.
Usually, both sides of a platter are used for recording, so there is a head for each
surface. In some more expensive disk drives, there are several platters, all on a
common axle spinning together. The heads are fixed to an arm that can move radially in
towards the axle or out towards the edges of the platters. All of the heads are attached
to the same arm, so they are all at the same distance from the centers of their platters
at any given time.
To read or write a bit of data on the disk, a
head has to be right over the spot where
the data is stored. This may require three
operations, giving rise to four kinds of
delay.
c The correct head (i.e., the correct
surface) must be selected. This is
done electronically, so it is very fast
(at most a few microseconds).
c The head has to be moved to the
correct distance from the center of
the disk. This movement is called
seeking and involves physically
moving the arm in or out. Because
the arm has mass (inertia), it must
be accelerated and decelerated.
When it finally gets where it's going,
the disk has to wait a bit for the
vibrations caused by the jerky movement to die out. All in all, seeking can take
several milliseconds, depending on how far the head has to move.
c The disk has to rotate until the correct spot is under the selected disk. Since the
disk is constantly spinning, all the drive has to do is wait for the correct spot to
come around.
c Finally, the actual data has to be transferred. On a read operation, the data is
usually transferred to a RAM buffer in the device and then copied, by DMA, to the
computer's main memory. Similarly, on write, the data is transferred by DMA to
a buffer in the disk, and then copied onto the surface of a platter.

The total time spent getting to the right place on the disk is called latency and is divided
into rotational latency and seek time
(although sometimes people use the term
“seek time” to cover both kinds of latency).
The data on a disk is divided up into fixed-
sized disk blocks. The hardware only
supports reading or writing a whole block at
a time. If a program wants to change one
bit (or one byte) on the disk, it has to read
in an entire disk block, change the part of it
it want to change, and then write it back
out.
Each block has a location, sometimes called
a disk address that consists of three
numbers:
. Surface
. Track
. sector.
The part of the disk swept out by a head while it is not moving is a ring-shaped region
on the surface called a track. The track number indicates how far the data is from the
center of the disk (the axle).

Each track is divided up into some number of sectors. On some disks, the outer tracks
have more sectors than the inner ones because the outer tracks are are longer, but all
sectors are the same size. The set of tracks swept out by all the heads while the arm is
not moving is a called a cylinder. Thus a seek operation moves to a new cylinder,
positioning each each on one track of the cylinder.
This basic picture of disk organization hasn't changed much in forty years. What has
changed is that disks keep getting smaller and cheaper and the data on the surfaces
gets denser (the spots used to record bits are getting smaller and closer together).
The first disks were several feet in diameter, cost tens of thousands of dollars, and held
tens of thousands of bytes. Currently (2006) a typical disk is 3-1/2 inches or 1 inch in
diameter, costs a few hundred dollars and holds several hundred gigabytes (billions of
bytes) of data.
What hasn't changed much is physical limitations. Early disks spun at 3600 revolutions
per minute (RPM); currently they spin at about 7200 RPM, or 15,000 RPM for high-
performance disks. At 7200 RPM, the rotational latency is at worst 1/7200 minute (8.33
milliseconds) and on the average it is half that (4.17 ms).
The heads and the arm that moves them have gotten much smaller and lighter, allowing
them to be moved more quickly, but the improvement has been modest.
Current disks take anywhere from a millisecond to 10s of milliseconds to seek to
particular cylinder.

Just for reference, here are the specs for popular HD model used in personal computers:

Capacity 400 Gbyte


Heads 16(*)
Cylinders 16,383(*)
Sector size 512 bytes
Sectors per track 63(*)
Sectors 781,422,768
Density 763,000 BPI; 120,000 TPI; 91,560 Mb/in2
Min seek (1 track) 0.5 ms
Max seek 10.5 ms
Average seek 8 ms
Rotational speed 7200 RPM
Average rotational latency 4.16 ms
Max Media transfer rate 95 Mbits/sec
Cache 8MB
Sustained transfer rate 65 MB/sec
Price About $300
4.2. Disk Scheduling
When a process wants to do disk I/O, it makes a call to the operating system. Since the
operation may take some time, the process is put into a blocked state, and the I/O
request is sent to a part of the OS called a device driver. If the disk is idle, the operation
can be started right away, but if the disk is busy servicing another request, it must be
added to a queue of requests and wait its turn. Thus the total delay seen by the process
has several components:
c The overhead of getting into and out of the OS, and the time the OS spends
fiddling with queues, etc.
c The queuing time spent waiting for the disk to become available.

c The latency spent waiting for the disk to get the right track and sector.

c The transfer time spent actually reading or writing the data.

Although I mentioned a “queue” of requests, there is no reason why the requests


have to be satisfied first-come first-served.
In fact, that is a very bad way to schedule disk requests. Since requests from
different processes may be scattered all over the disk, satisfying them in the order
they arrive would entail an awful lot of jumping around on the disk, resulting in
excessive rotational latency and seek time -- both for individual requests and for the
system as a whole. Fortunately, better algorithms are not hard to devise.

Shortest Seek Time First (SSTF):


When a disk operation finishes, choose the request that is closest to the current head
position (the one that minimizes
rotational latency and seek time).
This algorithm minimizes latency
and thus gives the best overall
performance, but suffers from poor
fairness. Requests will get widely
varying response depending on how
lucky they are in being close the the
current location of the heads. In the
worst case, requests can be starved
(be delayed arbitrarily long).
4. Physical storage
4.2. Disk Scheduling

The Elevator Algorithm:


The disk head progresses in a single direction (from the center of the disk to the edge,
or vice versa) serving the closest request in that direction. When it runs out of requests
in the direction it is currently moving, it switches to the opposite direction.
This algorithm usually gives more equitable service to all requests, but in the worst
case, it can still lead to starvation. While it is satisfying requests on one cylinder, other
requests for the same cylinder
could arrive. If enough requests
for the same cylinder keep
coming, the heads would stay at
that cylinder forever, starving all
other requests. This problem is
easily avoided by limiting how
long the heads will stay at any
one cylinder. One simple
scheme is only to serve the
requests for the cylinder that
are already there when the
heads gets there. New requests
for that cylinder that arrive
while existing requests are
being served will have to wait
for the next pass.

One-way Elevator Algorithm:


The simple (two-way) elevator algorithm gives poorer service to requests near the
center and edges of the disk than to requests in between.
Suppose it takes time T for a pass (from the center to the edge or vice versa). A
request at either end of a pass (near the hub or the edge of the disk) may have to wait
up to time 2T for the heads to travel to the other end and back, and on average the
delay will be T. A request near the “middle” (half way between the hub and the edge)
will get twice as good service: The worse-case delay is T and the average is T/2. If this
bias is a problem, it can be solved by making the elevator run in one direction only (say
from hub to edge). When it finishes the request closest to the edge, it seeks all the way
back to the first request (the one closes to the hub) and starts another pass from hub to
edge. In general, this approach will increase the total amount of seek time because of
the long seek from the edge back to the hub, but on a heavily loaded disk, that seek will
be so infrequent as not to make much difference.

End of Chapter

You might also like