Extxfs Short
Extxfs Short
Extxfs Short
Linux supports many file systems, but ext* family systems are native to it.
3
General features of ext2
4
Directories in ext2
Directory – consists of blocks of type ext2_dir_entry_2.
#define EXT2_NAME_LEN 255
struct ext2_dir_entry_2 {
__le32 inode; /* Inode number */
__le16 rec_len; /* Directory entry length */
__u8 name_len; /* Name length */
__u8 file_type;
char name[]; /* File name, up to EXT2_NAME_LEN */
};
Source: https://computing.ece.vt.edu/~changwoo/ECE-LKP-2019F/l/lec21-fs.pdf 7
Ext2 file system data structures on disk
Superblock — Superblocks in all groups have the same content*.
Group descriptors — As with superblocks, their content is copied to all groups*.
* Originally, the superblock and group descriptors were replicated in every block group
with those located in block group 0 designated as the primary copies. This is no longer
common practice due to the Sparse SuperBlock Option, which replicates the file system
superblock and group descriptors in only a fraction of the block groups.
The kernel only uses the superblock and group descriptors from group 0. When
e2fsck checks the consistency of the file system, it reaches into the superblock and
descriptors from group 0 and copies them to other groups. If as a result of the failure
the structure data stored in block 0 are unusable, the administrator can order e2fsck
to reach older copies in the other groups.
During system initialization, blocks with group descriptors from group 0 are read into
memory. Unless there are exceptional situations, the system does not use blocks with
descriptors and a superblock from other groups
8
Some block numbers may be zero. This means that nothing has been saved to a
certain space in the file (this is possible thanks to the lseek () function). 9
Allocating an ext2 disk data block
The allocation of disk blocks is performed by the function ext2_new_blocks().
The inode parameter indicates the inode for which we allocate the block, count indicates
the desired number of blocks, goal gives the number of the block we would like to
allocate (this is related to pre-allocation).
If it is not possible to allocate a block with this number, the function will try to allocate any
other free block.
ext2_fsblk_t ext2_new_blocks (struct inode *inode, ext2_fsblk_t goal,
unsigned long *count, int *errp)
void ext2_free_blocks (struct inode * inode, unsigned long block,
unsigned long count)
If the goal block is free, it will be allocated. If the request cannot be completed within the
current group, it tries in the others. If it still fails, preallocation is turned off.
Searching for a new block to be allocated first in the immediate vicinity of the given block
makes sense for the speed of the file system.
The ext2 file system owes it its extremely low file fragmentation rate.
The blocks of a given file are almost always close together and loading the file is fast.
10
File systems of the ext* family
11
SKIP
Ext3 – journaling
The ext3 file system was first mentioned in Journaling the Linux ext2fs
Filesystem (Stephen Tweedie, 1998).
If you shut down your computer without unmounting the ext2 file system, you must
examine the integrity of this partition before mounting it again. The larger this
partition is, the longer it takes, for large file systems it can take hours.
In the case of ext3 with the has_journal option enabled, the consistency test is
replaced by playing a journal, which is much faster, in the order of seconds.
After incorrect unmounting, playing the journal restores the correct state of the
data or even the metadata.
Information about pending file system updates is written to the journal.
Regardless of the mode of operation, the journal ensures consistency only at the
level of the system function call.
There are six types of metadata in ext2 and ext3: superblocks, block group
descriptors, inodes, intermediate index blocks, data block bitmaps and inode
bitmaps.
12
Ext3 – journaling
Source: https://computing.ece.vt.edu/~changwoo/ECE-LKP-2019F/l/lec21-fs.pdf 13
Ext3 – journaling
Transactions
Instead of considering each file system update as a separate transaction, ext3 groups
many updates into a single compound transaction that is periodically committed to disk.
Compound transactions may have better performance than more fine-grained
transactions when the same structure is frequently updated in a short period of time
(e.g., a free space bitmap or an inode of a file that is constantly being extended).
Checkpointing
It is the process of writing journaled metadata and data to their fixed-locations.
Checkpointing is triggered when various thresholds are crossed, e.g., when file system
buffer space is low, when there is little free space left in the journal, or when a timer
expires.
https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_
papers/prabhakaran/prabhakaran_html/main.html
14
Ext3 – journaling
At some point we will wish to commit our outstanding filesystem updates to the journal
as a new compound transaction.
When we commit a transaction, the new updated filesystem blocks are sitting in the
journal but have not yet been synced back to their permanent home blocks on disk
(we need to keep the old blocks unsynced in case we crash before committing the
journal).
Once the journal has been committed, the old version on the disk is no longer
important and we can write back the buffers to their home locations at our leisure.
Until we have finished syncing those buffers, we cannot delete the copy of the data
in the journal.
The ext3 uses checkpoints at which a check is made to ascertain whether the changes in
the journal have been written to the filesystem. If they have, the data in the journal
are no longer needed and can be removed.
During recovery, the file system scans the log for committed complete transactions;
incomplete transactions are discarded. Each update in a completed transaction is
simply replayed into the fixed-place ext3 structures.
15
Ext3 – journaling modes
In ext2, the directory is a list of variable size directory entries. Searching for the inode number
takes O (n), when n is the number of entries in the directory. The ext3 partition with the
dir_index option enabled can reduce the search time of the inode several times.
H-trees (htree, hashed binary tree) used in ext3 directory indexes are trees of height 2 or 3 with
equal depth of all nodes. The root of an h-tree index is the first block of a directory file. The
leaves are normal ext2 directory blocks, referenced by the root or indirectly through
intermediate h-tree index blocks. References within the directory file are by means of logical
block offsets within the file.
The nodes other than leaves, as in B-trees, contain key values that separate the keys in the
subtrees attached to subsequent child nodes.
The keys are the hash function values for the file names.
Ext3 supports several hash functions.
The search for a file name in the H-tree begins with a binary search of the leaf in which
the directory entry for the name is found.
Directory entries within the leaf are not ordered, the leaf should be searched linearly.
There may be a collision of hash function values.
An important case is when of the two directory entries for conflicting names, one has
the largest hash value in its leaf and the other has the smallest in the successor of
this node.
If we are looking for the second name, then we get to this first block and there we
recognize, that the searched name is missing.
At this point we need to search the successor node (and perhaps more nodes).
The youngest bit of the hash in the parent node indicates whether such a collision on
the border occurred; thanks to this information we can skip searching the successor.
22
SKIP
Ext3 – directories in H-trees
Adding an entry consists in adding a new directory entry to the appropriate leaf.
If the leaf is full and there is only one level of index nodes, we perform the split
operation as in a B-tree.
If the leaf is full and there are two levels of index nodes, then there are several tens of
millions of entries in the directory, or because of the fragmentation too many other
nodes are not full – in this case the inability to create the file is reported.
The directory entry is deleted only in the leaf.
If the leaf becomes empty, we do nothing about it – which simplifies the
implementation, but potentially prevents the operation of splitting another node
23
File systems of the ext* family
Ext4 (fourth extended file system)
• Introduced in 2008 (not entirely new filesystem, rather fork of ext3).
• Main maintainers: Theodore Ts’o, Andreas Dilger.
• Available since kernel version 2.6.19.
• Supports huge individual file size and overall file system size.
• Maximum individual file size can be from 16 GB to 16 TB.
• Overall file system size can be from 1 EB (exabyte).
1 EB = 1024 PB (petabyte), 1 PB = 1024 TB (terabyte).
• A directory can contain 64,000 subdirectories.
• Several other new features are introduced in ext4: multiple block allocation,
delayed allocation, journal checksum, fast fsck, etc.
• There is an option of turning the journaling feature off.
• An existing ext3 can be mounted as ext4 (without having to upgrade it).
24
Extent
The most important feature that distinguishes ext4 from the ext2 and ext3 is the
extents mechanism, which replaces indirect block addressing.
Instead of addressing individual blocks, ext4 tries to map as much data as possible to a
continuous block area on the disk. To get this ext4 mapping needs 3 values:
– the initial mapping block in the file,
– the size of the mapped area (in blocks) and
– the initial block of data saved on the disk.
The structure that stores these values is called extent.
struct ext4_extent {
__le32 ee_block; /* first logical block extent covers */
__le16 ee_len; /* number of blocks covered by extent */
__le16 ee_start_hi; /* high 16 bits of physical block */
__le32 ee_start_lo; /* low 32 bits of physical block */
};
25
Source: https://computing.ece.vt.edu/~changwoo/ECE-LKP-2019F/l/lec21-fs.pdf 26
SKIP
File, volume, extent size
File blocks in the ext4 system are numbered using 32 bits, which limits their number to
232 4 KB blocks. This gives a maximum file size of 232*212=24*240=16 TiB. In standard
ext3, the file can have a maximum of 2 TiB.
The volume size, in turn, is limited by the 48-bit block identifier on the disk, which for a
4 KB block size gives 248*212=260=1 EiB. For comparison, ext3 with a 32-bit number
and a 4 KB block size offered a maximum partition size of 16 TiB.
The size of the extent is limited by 215 blocks, i.e. for a 4 KB block it gives
215*212=217=128 MB. This limitation results from the division into block groups, and
a single block group can have a maximum size of 128 MB. Due to this limitation, the
last bit of the 16-bit extent size can be used in the preallocation mechanism.
The extents mechanism reduces the size of metadata, which means that operations on
large files are much faster. The 500 MB file in ext4 uses four 12-byte extents, while
the block addresses of the same file need more than 0.5 MB metadata in ext2. The
advantage of the new solution can be seen especially in operations requiring many
operations on metadata (e.g. file deletion).
Extent map (source: Ext4: The Next Generation of Ext2/3 Filesystem, Mingming Cao, Suparna
Bhattacharya, Ted Tso) 28
Storing files over 512 MB
A tree is built for larger chunks of data. For this purpose, an additional structure is used
– an index containing the initial position of the extent in the file and the block number
of the data on the disk. This block always contains a header describing the data and may
contain further indexes or extents with the data.
30
Multiple block allocation
In ext2, as well as ext3, each block of the file had to be allocated separately, which in
the case of large files resulted in a large number of calls to the allocation function. In
addition to performance issues, this made the file system more susceptible to
fragmentation.
Ext4 has a multiple block allocation mechanism (mballoc) that is necessary to ensure a
continuous block area for extents.
Depending on the file size, the allocator uses different strategies
– for small files (<16 blocks) it tries to keep them close together, which will speed
up their reading;
– large files are allocated so that they are in the most continuous memory area
possible.
This solves the performance and fragmentation issues that occur in ext2.
Regardless of which strategy the ext4 allocator uses – it first checks if there are free
preallocated blocks, only in the next step uses the buddy cache.
Description of the allocator:
Mballoc.c @ LXR – 300 line comment on the operation of the multiblock allocator.
Using this mechanism does not affect the format of the data stored on the disk.
31
Delayed allocation
Delayed allocation (allocate-on-flush) is a technique used in many modern file systems,
consisting in maximum delay in block allocation (in contrast to traditional file
systems, in which blocks are allocated as soon as possible) .
If the process writes to a file, the file system immediately allocates the blocks where the
data will be written, even if it does not happen immediately and the data is cached
for some time.
In the case of delayed allocation, blocks are not allocated immediately upon writing,
but only when disk writing is actually to take place. This allows the block allocator to
optimize allocation.
Delayed writing works very well with two other techniques: extents and multiple block
allocation, because in many situations when the file is finally saved to disk, it will be
placed in the extents allocated using the mballoc allocator. This improves
performance and reduces fragmentation.
In the case of temporary files, there is a chance that you will not need to save them to
disk at all.
Disadvantages: Increases the risk of data loss during a failure. Many assumptions about
writing to a file, true for ext2, become false for ext4.
32
Persistent preallocation
Persistent preallocation allows blocks to be assigned to files without initializing first:
• Most useful for databases and video files.
• Also useful for files that grow gradually via small append operations (i.e. Unix mail files
and log files).
• Protects against the lack of disk space for file extension and allows to reduce data
fragmentation.
• The fallocate() system call allows to reserve a specific area for a file that does not initially
use all space.
• Information that the file is pre-allocated and extent contains uninitialized data is
contained in bit 16 of the field describing the size of the extent (ee_len).
• During reads, an uninitialized extent is treated just like a hole, so that the VFS returns
zero-filled blocks.
• Upon writes, the extent must be split into
initialized and uninitialized extents, merging
the initialized portion with an adjacent
initialized extent if contiguous.
LWN: fallocate()
33
Layout of the large inode
Ext3 supports different inode sizes. The inode size can be set to any power-of-two larger
than 128 bytes size up to the filesystem block size. This can be done by the mke2fs -I
[inode size] command at format time. The default inode size is 128 bytes, which is
already crowded with data and has little space for new fields.
In ext4, the default inode structure size is 256 bytes.
Test 1
Test 2
FFSB is a powerful filesystem
benchmarking tool that can be tuned to
simulate very specific workload.
Multithreaded creation of large files.
The test runs 4 threads, which
combined create 24 1-GB files, and
stress the sequential write operation.
35% improvement in throughput and
40% decrease in CPU utilization in ext4
as compared to ext3.
Test 2 (source: The new ext4 filesystem: current status and future plans – 2007) 37
SKIP
Tests
Test 3
Postmark is a well-known benchmark
simulating a mail server performing many
single-threaded transactions on small to
medium files.
About 30% throughput gain with ext4.
Similar percent improvements in CPU
utilization, because metadata is much
more compact with extents. The write
throughput is higher than read throughput
because everything is being written to
memory.
Aside from the obvious performance gain
on large contiguous files, ext4 is also a
good choice on smaller file workloads.
Test 3 (source: The new ext4 filesystem: current status and future plans – 2007) 38
SKIP
Tests
Test 4
For the IOzone benchmark testing, the
system was booted with only 64 M of
memory to really stress disk I/O.
The tests were performed with 8 MB
record sizes on various file sizes.
Write, rewrite, read, reread, random write,
and random read operations were tested.
Figure shows throughput results for 512
MB sized files.
There is great improvement between ext3
and ext4,especially on rewrite, random-
write and reread operations.
In this test, XFS still has better read
performance, while ext4 has shown higher
throughput on write operations.
Test 4 (source: The new ext4 filesystem: current status and future plans – 2007) 39
Additional reading
• Documentation/filesystems/ext2.txt.
• State of the Art: Where we are with the Ext3 filesystem, M. Cao, T. Y. Ts'o, B.
Pulavarty, S. Bhattacharya, IBM.
• A Directory Index for Ext2, Daniel Phillips, 2001.
• Journaling the Linux ext2fs Filesystem, LinuxExpo, Stephen C. Tweedie, 1998.
• Anatomy of Linux journaling file systems, M. Tim Jones, IBM.
• Ext3, Wikipedia, the free encyclopedia, 7 maja 2010.
• Ext3 removal, quota & udf fixes (Linus Torwalds, September 2015)
So the thing I'm happy to see is that the ext4 developers seem to unanimously
agree that maintaining ext3 compatibility is part of their job, and nobody seems
to be arguing for keeping ext3 around.
Assuming no major objections come up, the EXT3 file-system driver will be
dropped for the Linux 4.3 kernel.
• File system design case studies (Paul Krzyzanowski, March 2012)
40
Additional reading
• Documentation/filesystems/ext4.txt.
• Ext4 wiki.
• Ext4 Howto.
• Ext4 Disk Layout.
• Ext4, FOSDEM, Theodore Ts’o, 2009.
• Ted Ts'o on the ext4 filesystem, NYLUG, Theodore Ts’o, 2013.
• Ext4 block and inode allocator improvements, A. Kumar, M. Cao, J. Santos, A.
Diliger, 2008 Linux Symposium.
• Case-insensitive ext4, Jake Edge, March 2019.
• The new ext4 filesystem: current status and future plans, A. Mathur, M. Cao, S.
Bhattacharya, A. Dilger, A. Tomas, L. Vivier, 2007 Linux Symposium.
• Ext4: The Next Generation of Ext2/3 Filesystem, M. Cao, S. Bhattacharya, T. Tso,
IBM, 2007.
• A Minimum Complete Tutorial of Linux ext4 File System, Mete Balci, 2017.
• Understanding Linux filesystems: ext4 and beyond, Jim Salter, April 2018
• How do SSDs work?, Joel Hruska, ExtremeTech, 2021.
41