Dir STR
Dir STR
UNIX is based on storing data in bits of the disk, labeled as files. Files can be nearly
any length, although modern OSs impose a limit of between 2 GB and 64 TB (kB=1024
bytes, MB=1024 kB, GB=1024 MB, TB=1024 GB) on file length (depending on whether
the file offset counters are 32- or 64-bit integers). Files are often small chunks of data,
and UNIX has filesystems built to insure efficient storage of small and large files, and is
very good at insuring files are not fragmented across a disk. This is in distinct contrast to
the older Windows filesystems, which would fragment files quite heavily, slowing file
I/O considerably. Users used to the old (pre-NTFS) Windows world should note that
UNIX does not ever require a defragmentation of a disk.
Directories have the same naming rules as files, and are effectively just a small, special
type of file. They hold the entries for files and directories that are contained below them
in the directory tree. All UNIX machines have a directory tree that starts at the root
directory ``/''. Users generally have a home directory located under /home.
UNIX machines do not have a concept of drive letters or names. Instead, all disks on a
system (including network-accessible disks) are given a unique mount point, or directory,
where they are accessible. A disk is mounted to a directory, and the contents appear as
files and directories under the mount point. Thus, a typical UNIX server will have one
disk (or partition) mounted for /, another for /usr, another for /home, and perhaps also
one for /var. To a user, all these disks appear as a single, coherent, filesystem. This
means that an administrator needs to keep track of what is mounted where (for space
concerns), but a user need not. If a disk fills, an administrator can move all or part of the
data on that disk to a new disk (under a new or same mount point), and the user will not
notice except that the space available increases.
This ability to make disks and the filesystem independant, to maintain constant absolute
paths to a given file regardless of the disk it is actually stored on, is invaluable to a
scientist. Upgrades of the disk space on a UNIX machine are not accompanied by a great
reorganization of file locations; the disks simply get bigger, or entire directory trees are
shifted to new disks, while retaining their old filesystem locations! Hence, once an
absolute path is chosen, it can be maintained for all time. This makes maintaining
software and data heirarchies vastly easier.
These directories are present on virtually all UNIX systems, although a specific file may
change its' location depending on the UNIX distribution and system administrator. Often,
administrators who move files from the ``standard'' locations will use a symbolic link to
help users navigate the directory tree.
Therefore, it is best to give each scientific project its' own directory, under the home
directory. All data, results, and interpretations can be kept in the project directory.
Programs which are used by many projects can be kept in a ``bin'' directory in the home
directory, and referenced easily as `` /bin/program name'' in scripts and documentation.
When each project has its' own directory, namespace collision is minimized, so results
and data don't get overwritten on accident. Each project directory can also be the root of
its' own standardized directory tree, with (for example) a subdirectory doc/ for reports
and documents related to the project, data/ for raw data files, processed/ for processed
data, etc. If each project directory has roughly the same structure under it, it is easier to
find results, documents, and data years after finishing a project.
Finally, if each project has its' own directory, it is easy to backup or move an entire
project using UNIX file tools; package the entire directory into an archive (see tar or
cpio) and transfer the one archive to another location. This makes collabaration
significantly easier.
If the raw data and results are in the same directory tree (under the project directory),
then scripts which operate on the data and results need short relative paths, which are less
prone to breaking (less fragile) than long, absolute paths. Scripts can use a shallow, local
directory structure in the project directory to track processing steps. This structure can be
specialized to the project (or subproblem), without worrying about breaking another
project's scripts.
If files from one project are needed in another, use symbolic links rather than copying the
files. Copied files must be updated by hand; symbolic links need no updating.
Raw data files should generally be kept in a primary location for use, and also in a
backup location in case of corruption or accidental deletion. In this case, do not use links,
symbolic or otherwise.