Distributed File System
Distributed File System
Distributed File System
A distributed file system enables programs to store and access remote files exactly as they do local
ones, allowing users to access files from any computer on a network. The performance and reliability
experienced for access to files stored at a server should be comparable to that for files stored on local
disks.
In this chapter we define a simple architecture for file systems and describe two basic distributed file
service implementations with contrasting designs that have been in widespread use for over two
decades:
Each emulates the UNIX file system interface, with differing degrees of scalability; fault tolerance
and deviation from the strict UNIX one-copy file update semantics.
Introduction:-
The requirements for sharing within local networks and intranets lead to a need for a different type of
service – one that supports the persistent storage of data and programs of all types on behalf of clients
and the consistent distribution of up-to-date data. The purpose of this chapter is to describe the
architecture and implementation of these basic distributed file systems. We use the word ‘basic’ here
to denote distributed file systems whose primary purpose is to emulate the functionality of a non-
distributed file system for client programs running on multiple remote computers. They do not
maintain multiple persistent replicas of files, nor do they support the bandwidth and timing
guarantees required for multimedia data streaming – those requirements are addressed in later
chapters. Basic distributed file systems provide an essential underpinning for organizational
computing based on intranets.
File systems were originally developed for centralized computer systems and desktop computers as
an operating system facility providing a convenient programming interface to disk storage. They
subsequently acquired features such as access-control and file-locking mechanisms that made them
useful for the sharing of data and programs. Distributed file systems support the sharing of
information in the form of files and hardware resources in the form of persistent storage throughout
an intranet. A well designed file service provides access to files stored at a server with performance
and reliability similar to, and in some cases better than, files stored on local disks. Their design is
adapted to the performance and reliability characteristics of local networks, and hence they are most
effective in providing shared persistent storage for use in intranets. The first file servers were
developed by researchers in the 1970 and Sun’s Network File System became available in the early
1980.
A file service enables programs to store and access remote files exactly as they do local ones,
allowing users to access their files from any computer in an intranet. The concentration of persistent
storage at a few servers reduces the need for local disk storage and (more importantly) enables
economies to be made in the management and archiving of the persistent data owned by an
organization. Other services, such as the name service, the user authentication service and the print
service, can be more easily implemented when they can call upon the file service to meet their needs
for persistent storage. Web servers are reliant on filing systems for the storage of the web pages that
they serve. In organizations that operate web servers for external and internal access via an intranet,
the web servers often store and access the material from a local distributed file system.
Figure 12.1 provides an overview of types of storage system. In addition to mentioned the table
includes distributed shared memory (DSM) systems and persistent object stores. It provides an
emulation of a shared memory by the replication of memory pages or segments at each host, but it
does not necessarily provide automatic persistence. Persistent objects aim to provide persistence for
distributed shared objects. Examples include the CORBA Persistent State Service and persistent
extensions to Java. Some research projects have developed in platforms that support the automatic
replication and persistent storage of objects.
Files contain both data and attributes. The data consist of a sequence of data items (typically 8-bit
bytes), accessible by operations to read and write any portion of the sequence. The attributes are held
as a single record containing information such as the length of the file, timestamps, file type, owner’s
identity and access control lists.
File systems are designed to store and manage large numbers of files, with facilities for creating,
naming and deleting files. The naming of files is supported by the use of directories. A directory is a
file, often of a special type, that provides a mapping from text names to internal file identifiers.
Directories may include the names of other directories, leading to the familiar hierarchic file-naming
scheme and the multi-part pathnames for files used in UNIX and other operating systems. File
systems also take responsibility for the control of access to files, restricting access to files according
to users’ authorizations and the type of access requested (reading, updating, executing and so on).
The term metadata is often used to refer to all of the extra information stored by a file system that is
needed for the management of files. It includes file attributes, directories and all the other persistent
information used by the file system.
Design issues for distributed file systems:
The effective use of client caching to achieve performance equal to or better than that of local
file systems.
The maintenance of consistency between multiple cached client copies of files when they are
updated.
Recovery after client or server failure.
High throughput for reading and writing files of all sizes.
Scalability.
Concurrent file updates • Changes to a file by one client should not interfere with the operation of
other clients simultaneously accessing or changing the same file. This is the well-known issue of
concurrency control.
File replication • In a file service that supports replication, a file may be represented by several
copies of its contents at different locations.
Hardware and operating system heterogeneity • The service interfaces should be defined so that
client and server software can be implemented for different operating systems and computers. This
requirement is an important aspect of openness.
Fault tolerance • The central role of the file service in distributed systems makes it essential that the
service continue to operate in the face of client and server failures.
Consistency • Conventional file systems such as that provided in UNIX offer one-copy update
semantics. This refers to a model for concurrent access to files in which the file contents seen by all
of the processes accessing or updating a given file are those that they would see if only a single copy
of the file contents existed.
Security • Virtually all file systems provide access-control mechanisms based on the use of access
control lists. In distributed file systems, there is a need to authenticate client requests so that access
control at the server is based on correct user identities and to protect the contents of request and reply
messages with digital signatures and (optionally) encryption of secret data.
Efficiency • A distributed file service should offer facilities that are of at least the same power and
generality as those found in conventional file systems and should achieve a comparable level of
performance.
Directory service • The directory service provides a mapping between text names for files and their
UFIDs. Clients may obtain the UFID of a file by quoting its text name to the directory service. The
directory service provides the functions needed to generate directories, to add new file names to
directories and to obtain UFIDs from directories. It is a client of the flat file service; its directory files
are stored in files of the flat file service. When a hierarchic file-naming scheme is adopted, as in
UNIX, directories hold references to other directories.
Client module • A client module runs in each client computer, integrating and extending the
operations of the flat file service and the directory service under a single application programming
interface that is available to user-level programs in client computers.
NFS enhancements • Several research projects have addressed the need for one-copy update
semantics by extending the NFS protocol to include open and close operations and adding a callback
mechanism to enable the server to notify clients of the need to invalidate cache entries. We describe
two such efforts here; their results seem to indicate that these enhancements can be accommodated
without undue complexity or extra communication costs.
Some recent efforts by Sun and other NFS developers have been directed at making NFS servers
more accessible and useful in wide-area networks. While the HTTP protocol supported by web
servers offers an effective and highly scalable method for making whole files available to clients
throughout the Internet, it is less useful to application programs that require access to portions of
large files or those that update portions of files. The Web NFS development (described below) makes
it possible for application programs to become clients of NFS servers anywhere in the Internet (using
the NFS protocol directly instead of indirectly through a kernel module). This, together with
appropriate libraries for Java and other network programming languages, should offer the possibility
of implementing Internet applications that share data directly, such as multi-user games or clients of
large dynamic databases.
AFS enhancements • We have mentioned that DCE/DFS, the distributed file system included in the
Open Software Foundation’s Distributed Computing Environment, was based on the Andrew File
System. The design of DCE/DFS goes beyond AFS, particularly in its approach to cache consistency.
In AFS, callbacks are generated only when the server receives a close operation for a file that has
been updated. DFS adopted a similar strategy to Spritely NFS and NQNFS to generating callbacks as
soon as a file is updated. In order to update a file, a client must obtain a write token from the server,
specifying a range of bytes in the file that the client is permitted to update. When a write token is
requested, clients holding copies of the same file for reading receive revocation callbacks. Tokens of
other types are used to achieve consistency for cached file attributes and other metadata. All tokens
have an associated lifetime, and clients must renew them after their lifetime has expired.
****THE END****