Chapter 1
Chapter 1
Chapter 1
Secondary storage
Dr. Ibaa
Index
1) Introduction
2) File system features
3) Implementing File Systems
4) Physical storage
1. Introduction
The role of the file system is to provide a high level abstraction of the hardware disk,
making it appear to be a large number of disk-like objects called files. Like a disk, a file
is capable of storing a large amount of data cheaply, reliably, and persistently.
The main goal of files is store persistent data, which is defined as data whose lifetime is
longer than the one of the process that created it; when the process terminates,
persistent data remains, but volatile data.
Given that virtual memory and file systems use the same device (hard disk drive),
to do almost the same thing (store data objects), why are they so different ?
Couldn't there be a system where we have only virtual memory, and inside this
virtual memory we can designate objects as "volatile" or "persistent"?
Case. Are upper and lower case letters considered different? The
Unix tradition is to consider the names Foo and foo to be
completely different and unrelated names. In DOS and its
descendants, however, they are considered the same.
Name
Time stamps Time created, time last modified, time last accessed,
time the attributes were last changed, etc.
fd = open(name, operation)
fd = creat(name, mode)
status = close(fd)
byte_count = read(fd, buffer, byte_count)
byte_count = write(fd, buffer, byte_count)
offset = lseek(fd, offset, whence)
status = link(oldname, newname)
status = unlink(name)
status = stat(name, buffer)
status = fstat(fd, buffer)
status = utimes(name, times)
status = chown(name, owner, group) or fchown(fd, owner, group)
status = chmod(name, mode) or fchmod(fd, mode)
status = truncate(name, size) or ftruncate(fd, size)
whence One of three codes, signifying from start, from end, or from current
location.
operation An integer code, one of read, write, read and write, and perhaps a few
other possibilities such as append only.
fd = open(name, operation)
The open call finds a file and assigns a descriptor to it. It also indicates how the file will
be used by this process (read only, read/write, etc).
fd = creat(name, mode)
The creat call is similar, but creates a new (empty) file. The mode argument specifies
protection attributes (such as “writable by owner but read-only by others”) for the new
file. (Most modern versions of Unix have merged creat into open by adding an optional
mode argument and allowing the operation argument to specify that the file is
automatically created if it doesn't already exist.)
status = close(fd)
The close call simply announces that fd is no longer in use and can be reused for
another open or creat. .)
The read and write operations transfer data between a file and memory. The starting
location in memory is indicated by the buffer parameter; the starting location in the file
(called the seek pointer is wherever the last read or write left off. The result is the
number of bytes transferred. another open or creat. .)
For write it is normally the same as the byte_count parameter unless there is an error.
For read it may be smaller if the seek pointer starts out near the end of the file. The
lseek operation adjusts the seek pointer (it is also automatically updated by read and
write). be used by this process (read only, read/write, etc).
The specified offset is added to zero, the current seek pointer, or the current size of
the file, depending on the value of whence.
The function link adds a new name (alias) to a file, while unlink removes a name.
There is no function to delete a file; the system automatically deletes it when there are
no remaining names for it.
The stat function retrieves meta-data about the file and puts it into a buffer (in a fixed,
documented format), while the remaining functions can be used to update the meta-
data:
c Utimes: updates time stamps
c Chown: updates ownership
c Chmod: updates protection information
c Truncate: changes the size (files can be make bigger by write, but only truncate
can make them smaller).
Most come in two flavors:
5) one that take a file name
6) one that takes a descriptor for an open file.
Contiguous:
The blocks of a file are the block numbered n, n+1, n+2,
..., m. We can represent any file with a pair of numbers:
the block number of of first block and the length of the
file (in blocks). The advantages of this approach are
c It's simple
Linked List:
In the file descriptor, we just keep pointer to first block. In each block of file we keep a
pointer to the next block. We can also keep a linked list of free blocks for the free list.
Disk Index
The idea here is to keep the linked-list representation, but take the link fields out of the
blocks and gather them together all in one place. At some fixed place on disk, allocate
an array I with one element for each block on the disk, and move the link field from
block n to I[m].
recall that a single disk access takes as long as 10's or even 100's of thousands
of instructions.
The whole array of links, called a file access table (FAT) is now small enough that it can
be read into main memory when the system starts up. Accessing the 100th block of a file
still requires walking through 99 links of a linked list, but now the entire list is in
memory, so time to traverse it is negligible.
The inode structure introduced by Unix groups together index information about each
file individually.
The basic idea is to represent each file as a tree of blocks, with the data blocks as
leaves. Each internal block (called an indirect block in Unix jargon) is an array of block
numbers, listing its children in order. If a disk block is 2K bytes and a block number is
four bytes, 512 block numbers fit in a block, so a one-level tree (a single root node
pointing directly to the leaves) can accommodate files up to 512 blocks, or one
megabyte in size.
If the root node is cached in memory, the “address” (block number) of any block of the
file can be found without any disk accesses. A two-level tree, with 513 total indirect
blocks, can handle files 512 times as large (up to one-half gigabyte).
The only problem with this idea is that it wastes space for small files. Any file with more
than one block needs at least one indirect block to store its block numbers.
A 4K file would require three 2K
blocks, wasting up to one
third of its space. Since many
files are quite small, this is
serious problem. The Unix
solution is to use a different
kind of “block” for the root of the
tree.
File Index
An index node (or inode for short) contains almost all the meta-data about a file listed
above: ownership,
permissions, time stamps, etc.
(but not the file name).
Inodes are small enough that
several of them can be packed
into one disk block. In addition
to the meta-data, an inode
contains the block numbers of
the first few blocks of the file.
What if the file is too big to fit
all its block numbers into the
inode?
The earliest version of Unix had a bit in the meta-data to indicate whether the file
was “small” or “big.” For a big file, the inode contained the block numbers of
indirect blocks rather than data blocks.
More recent versions of Unix contain pointers to indirect blocks in addition to the
pointers to the first few data blocks.
The inode contains pointers to (i.e., block numbers of) the first few blocks of the file, a
pointer to an indirect block containing pointers to the next several blocks of the file, a
pointer to a doubly indirect block, which is the root of a two-level tree whose leaves are
the next blocks of the file, and a pointer to a triply indirect block. A large file is thus a
lop-sided tree.
BB SB IL DB
BB: Boot Block IL: Inode List
SB: Super Block DB: Data Blocks
The first block of the filesystem is called the superblock. The superblock gives
information regarding the tuneable parameters of the filesystem: the number of inodes,
the number of data blocks, the size of the data blocks... It may also include information
such as a volume name to identify the partition.
The problem with small blocks is I/O overhead. There is a certain overhead to read or
write a block beyond the time to actually transfer the bytes. If we double the block size,
a typical file will have half as many blocks. Reading or writing the whole file will transfer
the same amount of data, but it will involve half as many disk I/O operations. The
overhead for an I/O operations includes a variable amount of latency (seek time and
rotational delay) that depends on how close the blocks are to each other, as well as a
fixed overhead to start each operation and respond to the interrupt when it completes.
3. Implementing File Systems
3.2. Space Management
Block Size and Extents
3.3. Reliability
Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile
memory. There are several techniques that can be used to mitigate the effects of these
failures. We only have room for a brief survey.
Bad-block Forwarding
When the disk drive writes a block of data, it also writes a checksum, a small number of
additional bits whose value is some function of the “user data” in the block.
When the block is read back in, the checksum is also read and compared with the data.
If either the data or checksum were corrupted, it is extremely unlikely that the
checksum comparison will succeed. Thus
the disk drive itself has a way of 1 block of data
discovering bad blocks with extremely high
probability.
Second:
Consistency Checking
Some of the information in a file system is redundant. For example, the free list could
be reconstructed by checking which blocks are not in any file. Redundancy arises
because the same information is represented in different forms to make different
operations faster.
If you want to know which blocks are in a given file, look at the inode. If you you
want to know which blocks are not in any inode, use the free list.
Unfortunately, various hardware and software errors can cause the data to become
inconsistent. File systems often include a utility that checks for consistency and
optionally attempts to repair inconsistencies. These programs are particularly handy for
cleaning up the disks after a crash.
Unix has a utility called fscheck. It has two principal tasks:
First, it checks that blocks are properly allocated. Each inode is supposed to be the
root of a tree of blocks, the free list is supposed to be a tree of blocks, and each block is
supposed to appear in exactly one of these trees. Fscheck runs through all the inodes,
checking each allocated inode for reasonable values, and walking through the tree of
blocks rooted at the inode. It maintains a bit vector to record which blocks have been
encountered.
If block is encountered that has already been seen, there is a problem:
Either it occurred twice in the same file (in which case it isn't a tree)
Or it occurred in two different files.
A reasonable recovery would be to allocate a new block, copy the contents of the
problem block into it, and substitute the copy for the problem block in one of the two
places where it occurs.
It would also be a good idea to log an error message so that a human being can
check up later to see what's wrong. Popup 6
After all the files are scanned, any block that hasn't been found should be on the free
list.
It would be possible to scan the free list in a similar manner, but it's probably
easier just to rebuild the free list from the set of blocks that were not found in
any file. If a bitmap instead of a free list is used, this step is even easier: Simply
overwrite the file system's bitmap with the bitmap constructed during the scan.
Second, The other main consistency requirement concerns the directory structure.
The set of directories is supposed to be a tree, and each inode is supposed to have a
link count that indicates how many times it appears in directories.
The tree structure could be checked by a recursive walk through the directories, but it is
more efficient to combine
this check with the walk
through the inodes that
checks for disk blocks, but
recording, for each
directory inode
encountered, the inumber
of its parent. The set of
directories is a tree if and
only if and only if every
directory other than the
root has a unique parent.
This pass can also rebuild the link count for each inode by maintaining in memory an
array with one slot for each inumber. Each time the inumber is found in a directory,
increment the corresponding element of the array. The resulting counts should match
the link counts in the inodes. If not, correct the counts in the inodes.
This illustrates a very important principal that pops up throughout operating system
implementation (indeed, throughout any large software system):
The doctrine of hints and absolutes.
Hints are handy because they allow some operations to be done much more quickly that
they could if only the absolute information was available. But if the hint and the
absolute do not agree, the hint can be rebuilt from the absolutes.
Why is it that for paging we went to great lengths trying to come up with an
algorithm that is “almost as good as LRU” while here we can simply use true LRU?
The problem with implementing LRU is that some information has to be updated on
every single reference. In the case of paging, references can be as frequent as every
instruction, so we have to make do with whatever information hardware is willing to
give us.
The best we can hope for is that the paging hardware will set a bit in a page-table entry.
In the case of file system disk blocks, however, each reference is the result of a system
call, and adding a few extra instructions added to a system call for cache maintenance is
not unreasonable.
Adding page caching to the file system implementation is actually quite simple.
Somewhere in the implementation, there is probably a procedure that gets called when
the system wants to access a disk block. Let's suppose the procedure simply allocates
some memory space to hold the block and reads it into memory.
1
http://pages.cs.wisc.edu/~solomon/cs537-old/last/paging.html#page_replacement
class CacheEntry {
int blockNumber;
To add caching, all we have Block buffer;
to do is modify this code to CacheEntry next, previous;
}
search the disk cache first. class DiskCache {
CacheEntry head, tail;
CacheEntry find(int blockNumber) {
// Search the list for an entry with a
This code is not quite right, matching block number.
because it ignores writes. // If not found, return null.
}
If the oldest buffer is dirty void moveToFront(CacheEntry entry) {
(it has been modified since // more entry to the head of the list
}
it was read from disk), it CacheEntry oldest() {
first has to be written back return tail;
}
to the disk before it can be Block readBlock(int blockNumber) {
used to hold the new block. Block result;
CacheEntry entry = find(blockNumber);
Most systems actually write if (entry == null) {
dirty buffers back to the entry = oldest();
Disk.read(blockNumber, entry.buffer);
disk sooner than necessary entry.blockNumber = blockNumber;
to minimize the damage }
moveToFront(entry);
caused by a crash. The return entry.buffer;
original version of Unix had }
a background process that }
would write all dirty buffers
to disk every 30 seconds.
Some information is more critical than others. Some versions of Unix, for example, write
back directory blocks (the data block of directory files of type directory) as each time
they are modified. This technique--keeping the block in the cache but writing its
contents back to disk after any modification--is called write-through caching.
LRU management automatically does the “right thing” for most disk blocks.
. If someone is actively manipulating the files in a directory, all of the directory's
blocks will probably be in the cache.
. If a process is scanning a large file, all of its indirect blocks will probably be in
memory most of the time.
But there is one important case where LRU is not the right policy. Consider a process
that is traversing (reading or writing) a file sequentially from beginning to end. Once
that process has read or written the last byte of a block, it will not touch that block
again. The system might as well immediately move the block to the tail of the list as
soon as the read or write request completes.
Tanenbaum calls this technique free behind. It is also sometimes called:
A similar trick is called read-ahead. If a file is being read sequentially, it is a good idea
to read a few blocks at a time. This cuts down on the latency for the application (most
of the time the data the application wants is in memory before it even asks for it).
If the disk hardware allows multiple blocks to be read at a time, it can cut the number of
disk read requests, cutting down on overhead such as the time to service a I/O
completion interrupt. If the system has done a good job of clustering together the disks
of the file, read-ahead also takes better advantage of the clustering. If the system reads
one block at a time, another process, accessing a different file, could make the disk
head move away from the area containing the blocks of this file between accesses.
3. Implementing File Systems
3.3. Reliability
Performance
The Berkeley2 file system introduced another trick to improve file system performance.
They divided the disk into chunks, which they called cylinder groups (CGs) because each
one is comprised of some number of adjacent cylinders. Each CG is like a miniature
disk. It has its own super block and array of inodes. The system attempts to put all the
blocks of a file in the same CG as its inode. It also tries to keep all the inodes in one
directory together in the same CG so that operations like:
ls -l *
will be fast. It uses a variety to techniques to assign inodes and blocks to CGs in such as
way as to distribute the free space fairly evenly between them, so there will be enough
room to do this clustering.
2
http://pages.cs.wisc.edu/~solomon/cs537-old/last/filesys1.html#block-size
In particular:
c When a new file is created: its inode is placed in the same CG as its parent
directory (if possible). But when a new directory is created, its inode is placed in
CG with the largest amount of free space (so that the files in the directory will be
able to be near each other).
c When blocks are added to a file: they are allocated (if possible) from the
same CG that contains it inode. But when the size of the file crosses certain
thresholds (say every megabyte or so), the system switches to a different CG,
one that is relatively empty. The idea is to prevent a big file from hogging all the
space in one CG and preventing other files in the CG from being well clustered.
4. Physical storage
4.1. Disk Hardware
A (hard) disk drive record data on the surfaces of metal plates called platters that are
coated with a substance
containing ground-up iron,
or other substances that
allow zeros and ones to be
recorded as tiny spots of
magnetization. Floppy
disks (also called
“diskettes” by those who
think the term “floppy” is
undignified) are similar,
but use a sheet of plastic
rather than metal, and
permanently enclose it in a
paper or plastic envelope.
I won't say anything more about floppy disks, but most of facts about hard disks are
also true for floppies, but slower. It is customary to use the simple term “disk” to mean
“hard disk drive” and say “platter” when you mean the disk itself.
When in use, the disk spins rapidly and a read/write head slides along the surface.
Usually, both sides of a platter are used for recording, so there is a head for each
surface. In some more expensive disk drives, there are several platters, all on a
common axle spinning together. The heads are fixed to an arm that can move radially in
towards the axle or out towards the edges of the platters. All of the heads are attached
to the same arm, so they are all at the same distance from the centers of their platters
at any given time.
To read or write a bit of data on the disk, a
head has to be right over the spot where
the data is stored. This may require three
operations, giving rise to four kinds of
delay.
c The correct head (i.e., the correct
surface) must be selected. This is
done electronically, so it is very fast
(at most a few microseconds).
c The head has to be moved to the
correct distance from the center of
the disk. This movement is called
seeking and involves physically
moving the arm in or out. Because
the arm has mass (inertia), it must
be accelerated and decelerated.
When it finally gets where it's going,
the disk has to wait a bit for the
vibrations caused by the jerky movement to die out. All in all, seeking can take
several milliseconds, depending on how far the head has to move.
c The disk has to rotate until the correct spot is under the selected disk. Since the
disk is constantly spinning, all the drive has to do is wait for the correct spot to
come around.
c Finally, the actual data has to be transferred. On a read operation, the data is
usually transferred to a RAM buffer in the device and then copied, by DMA, to the
computer's main memory. Similarly, on write, the data is transferred by DMA to
a buffer in the disk, and then copied onto the surface of a platter.
The total time spent getting to the right place on the disk is called latency and is divided
into rotational latency and seek time
(although sometimes people use the term
“seek time” to cover both kinds of latency).
The data on a disk is divided up into fixed-
sized disk blocks. The hardware only
supports reading or writing a whole block at
a time. If a program wants to change one
bit (or one byte) on the disk, it has to read
in an entire disk block, change the part of it
it want to change, and then write it back
out.
Each block has a location, sometimes called
a disk address that consists of three
numbers:
. Surface
. Track
. sector.
The part of the disk swept out by a head while it is not moving is a ring-shaped region
on the surface called a track. The track number indicates how far the data is from the
center of the disk (the axle).
Each track is divided up into some number of sectors. On some disks, the outer tracks
have more sectors than the inner ones because the outer tracks are are longer, but all
sectors are the same size. The set of tracks swept out by all the heads while the arm is
not moving is a called a cylinder. Thus a seek operation moves to a new cylinder,
positioning each each on one track of the cylinder.
This basic picture of disk organization hasn't changed much in forty years. What has
changed is that disks keep getting smaller and cheaper and the data on the surfaces
gets denser (the spots used to record bits are getting smaller and closer together).
The first disks were several feet in diameter, cost tens of thousands of dollars, and held
tens of thousands of bytes. Currently (2006) a typical disk is 3-1/2 inches or 1 inch in
diameter, costs a few hundred dollars and holds several hundred gigabytes (billions of
bytes) of data.
What hasn't changed much is physical limitations. Early disks spun at 3600 revolutions
per minute (RPM); currently they spin at about 7200 RPM, or 15,000 RPM for high-
performance disks. At 7200 RPM, the rotational latency is at worst 1/7200 minute (8.33
milliseconds) and on the average it is half that (4.17 ms).
The heads and the arm that moves them have gotten much smaller and lighter, allowing
them to be moved more quickly, but the improvement has been modest.
Current disks take anywhere from a millisecond to 10s of milliseconds to seek to
particular cylinder.
Just for reference, here are the specs for popular HD model used in personal computers:
c The latency spent waiting for the disk to get the right track and sector.
End of Chapter