|
|
Subscribe / Log in / New account

SEEK_HOLE or FIEMAP?

By Jonathan Corbet
December 3, 2007
Sparse files have an apparent size which is larger than the amount of storage actually allocated to them. The usual way to create such a file is to seek past its end and write some new data; Unix-derived systems will traditionally not allocate disk blocks for the portion of the file past the previous end which was skipped over. The result is a "hole," a piece of the file which logically exists, but which is not represented on disk. A read operation on a hole succeeds, with the returned data being all zeroes. Relatively smart file archival and backup utilities will recognize holes in files; these holes are not stored in the resulting archive and will not be filled if the file is restored from that archive.

The process of recognizing holes is relatively primitive, though: about the only way to do it in a portable way is to simply look for blocks filled with zeroes. This technique works, but it requires making a pass over the data to obtain information which the lower levels of the system already know. It seems like there should be a better way.

About two years ago, the Solaris ZFS developers proposed an extension to lseek() which would allow an application to find the holes in sparse files more efficiently. This extension works by adding two new "whence" options:

  • SEEK_HOLE positions the file descriptor to the beginning of the first hole which occurs after the given offset. For the purposes of this operation, "hole" is defined as a region of all zeros of any length, but the system is not required to actually detect all holes. So, in practice, small ranges of zeroes will be skipped over, as will, in all likelihood, large (multi-block) ranges which have actually been written to disk.

  • SEEK_DATA moves to the start of next region (after the given offset) which is not a hole.

This functionality has been part of Solaris for a while; the Solaris developers would like to see it spread elsewhere and become something more than a Solaris-only extension. To that end, Josef Bacik has recently posted an implementation of this extension for Linux. Internally, it adds a new member to the file_operations structure (seek_hole_data()) intended to allow filesystems to efficiently implement the new operations.

One might argue that anybody who wants to separate holes and data in a file can already do so with the FIBMAP ioctl() command. While that is true, FIBMAP is an inefficient way of getting this sort of information, especially on filesystems which support extents. A FIBMAP call returns the mapping information for exactly one block; mapping out a large file may require millions of calls when, once again, the filesystem should already know how to provide that information in a much more straightforward manner.

Even so, this patch looks relatively unlikely to make it into the mainline. The API is unpopular, being seen as ugly and as a change in the semantics of the lseek() call. But, more to the point, it may be interesting to learn much more about the representation of a file than just where the holes are. And, as it turns out, there is already a proposed ioctl() command which can provide all of that information. That interface is the FIEMAP ioctl() specified by Andreas Dilger back in October.

A FIEMAP call takes the following structure as an argument:

    struct fiemap {
	__u64	fm_start;	 /* logical starting byte offset (in/out) */
	__u64	fm_length;	 /* logical length of map (in/out) */
	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in/out) */
	__u32	fm_extent_count; /* number of extents in fm_extents (in/out) */
	__u64	fm_end_offset;	 /* end of mapping in last ioctl */
	struct fiemap_extent	fm_extents[0];
    };

An application wanting to learn something about how a file is stored will put the starting offset into fm_start and the length of the region of interest in fm_length. If fm_flags contains FIEMAP_FLAG_NUM_EXTENTS, the system call will simply set fm_extent_count to the number of extents used to store the specified range of bytes and return. In this form, FIEMAP can be used to determine how fragmented the file is on disk.

If the application is looking for more information than that, it will allocate enough space for one or more fm_extents structures:

    struct fiemap_extent {
    	__u64 fe_offset;/* offset in bytes for the start of the extent */
    	__u64 fe_length;/* length in bytes for the extent */
    	__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
    	__u32 fe_lun;   /* logical device number for extent(starting at 0)*/
    };

In this case, fm_extent_count should be set to the number of these structures before making the FIEMAP call. On return, these structures (as many as is indicated by the returned value of fm_extent_count) will be filled in with information on the actual file extents; fe_offset says where (on disk) the extent starts, and fe_length is the size of the extent. There are quite a few values which can appear in the fe_flags field:

  • FIEMAP_EXTENT_HOLE says that there is no data for this range of the file - it's a hole.

  • FIEMAP_EXTENT_UNWRITTEN says that the space has been allocated on disk, but that nothing has been written to that space. Space which has been preallocated with fallocate() would be marked this way.

  • FIEMAP_EXTENT_UNMAPPED, instead, marks an extent where some application has written data, but for which no disk blocks have been allocated.

  • FIEMAP_EXTENT_DELALLOC indicates that delayed allocation is being done; this flag implies FIEMAP_EXTENT_UNMAPPED as well.

  • FIEMAP_EXTENT_SECONDARY is an indication that the data for this segment is in some sort of secondary storage; one would see this flag on filesystems managed by some sort of hierarchical storage manner. This flag, too, is likely to imply FIEMAP_EXTENT_UNMAPPED.

  • FIEMAP_EXTENT_NO_DIRECT says that the data cannot be accessed directly - it requires processing (decompression or decryption, for example) first.

  • FIEMAP_EXTENT_LAST marks the final extent in a file.

  • FIEMAP_EXTENT_EOF indicates that the requested range goes beyond the end of the file.

  • FIEMAP_EXTENT_ERROR marks an extent which has experienced some sort of error; the fe_offset field will contain an error number in this case.

  • FIEMAP_EXTENT_UNKNOWN says that the data exists, but its location is unknown. This flag would describe much of your editor's personal file space, though it is unclear how the kernel would know that.

As can be seen, there is a wealth of information available from this new call, including details on how the file has been split up on disk, allocation strategies, and even the decisions made by a hierarchical storage engine. An implementation exists for the ext4 filesystem. None of this code has been pushed toward the mainline yet, but it would be surprising if that did not happen sometime in the relatively near future. Once that is done, the C library will be able to implement SEEK_HOLE and SEEK_DATA in user space, should that be desirable.

Index entries for this article
KernelFIEMAP ioctl()
KernelFilesystems


to post comments

SEEK_HOLE or FIEMAP?

Posted Dec 6, 2007 15:10 UTC (Thu) by [email protected] (guest, #38022) [Link] (3 responses)

> One might argue that anybody who wants to separate holes and data in
> a file can already do so with the FIBMAP ioctl() command.

 Such an implementation at:
http://www.mirrorservice.org/sites/download.sourceforge.n...
executable at:
http://www.mirrorservice.org/sites/download.sourceforge.n...
 That can be used to count the level of fragmentation of a filesystem, with some interresting
results.
 The main problem is that some filesystems do not implement it correctly or at all (so LILO or
Gujin cannot be installed on them).
 The other problem, for the case of a bootloader, is that it does not give the position of the
data in the disk but in the device, and there is a big difference when the device is a RAID or
LVM.

 The thing the bootloader has to do is to register where its own code/data are on disk to be
able to load them without the kernel support, and to have only one file, to write the position
of the end of the file at the beginning of itself, so to have block allocated to disk before
the write into the file is finished - possible but tricky.

SEEK_HOLE or FIEMAP?

Posted Dec 7, 2007 1:55 UTC (Fri) by giraffedata (guest, #1954) [Link] (2 responses)

It's much cleaner to have the boot loader use the proper directories, block maps, etc. to access the filesystem. GRUB does this.

In its usual deployment, GRUB still has the problem because that code that knows how to access the filesystem is in the filesystem, and the only way GRUB knows to find it is with built-in block numbers.

But it's possible to put that code outside the filesystem, in an area of disk reserved for that purpose, and then the world is as it should be. You don't need any special kernel interfaces at boot loader installation time, and you don't have to take care to keep the blocks from moving after you've installed the boot loader.

SEEK_HOLE or FIEMAP?

Posted Dec 7, 2007 10:25 UTC (Fri) by [email protected] (guest, #38022) [Link] (1 responses)

<rant>
> It's much cleaner to have the boot loader use the proper directories, block maps, etc. to
access the filesystem. GRUB does this.

 So does Gujin - smaller number of filesystem supported, I have to say.

> In its usual deployment, GRUB still has the problem because that code that knows how to
access the filesystem is in the filesystem, and the only way GRUB knows to find it is with
built-in block numbers.

 So does Gujin.

> But it's possible to put that code outside the filesystem, in an area of disk reserved for
that purpose, and then the world is as it should be.

 By default Gujin puts that code at the end of the disk, outside of any filesystem, but it not
always available depending on the tool used to create the partitions (Linux tools are used to
fill the whole disk - not leaving a single unallocated sector for the bootloader).
</rant>

 Doesn't change that it would be nice to have a kernel interface which maps the device block
into a hard disk block, for that part of the bootloader which shall not move when it is on a
filesystem (RAID and LVM problem).
 It would also be nice to have an interface to tell the filesystem that this file is the boot
code - there is an inode reserved for that in EXT2/3FS but no way to use it.

SEEK_HOLE or FIEMAP?

Posted Dec 13, 2007 13:28 UTC (Thu) by RobLucid (guest, #49530) [Link]

Wonder why good ole' partitions are out of fashion?

Rather than having the ability for applications to do nasty things and 
become dependant on physical block numbers, which prevent copying of files 
around.   You could use a raw partition, and then copy the blocks into 
known offsets from the beginning of the partition.  This seems much 
simpler.

Presumbably a BootFS, with a boot loader friendly structure might also be 
a robust alternative and avoid duplication, of files in the raw partition 
approach.

SEEK_HOLE or FIEMAP?

Posted Apr 18, 2011 14:27 UTC (Mon) by ernest (guest, #2355) [Link]

> FIEMAP_EXTENT_UNKNOWN: This flag would describe much of your editor's
> personal file space, though it is unclear how the kernel would know that.

At least now there is flag describing this poor condition!

SEEK_HOLE or FIEMAP?

Posted Jul 23, 2014 22:42 UTC (Wed) by kolyshkin (guest, #34342) [Link]

FIEMAP appeared in Linux kernel 2.6.28, released on 25 December, 2008.

SEEK_HOLE and SEEK_DATA appeared in Linux kernel 3.1, although ext4 support for these was only added in Linux 3.8.

SEEK_HOLE or FIEMAP?

Posted May 16, 2018 13:11 UTC (Wed) by salewski (subscriber, #121521) [Link] (1 responses)

"... the Solaris ZFS developers proposed an extension to lseek() which ..."

Within the above article fragment, the "proposed an extension to lseek()" link points to:

which currently just redirects to the 404 page: However, it looks like the original (2012-05-05) blog entry is now available here:

(I'm assuming it is the same blog entry -- same author, same timeframe, same subject -- corrections or confirmations welcome.)

SEEK_HOLE or FIEMAP?

Posted May 22, 2018 23:29 UTC (Tue) by lsl (subscriber, #86508) [Link]

I could course at Oracle all day long for the huge list of links (to high-quality content) they've killed overnight. There are countless links to useful Sun-era documents in mailing list archives and newgroups and Oracle just broke them all without thinking twice. If that's not evil, what is? But then, lawnmower etc.


Copyright © 2007, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds