RFC 3284
RFC 3284
RFC 3284
D. Korn
AT&T Labs
J. MacDonald
UC Berkeley
J. Mogul
Hewlett-Packard Company
K. Vo
AT&T Labs
June 2002
Abstract
This memo describes VCDIFF, a general, efficient and portable data
format suitable for encoding compressed and/or differencing data so
that they can be easily transported among computers.
Standards Track
[Page 1]
VCDIFF
RFC 3284
June 2002
Table of Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
1.
2
4
5
6
12
20
21
22
24
25
25
25
25
26
26
26
28
29
Executive Summary
Compression and differencing techniques can greatly improve storage
and transmission of files and file versions. Since files are often
transported across machines with distinct architectures and
performance characteristics, such data should be encoded in a form
that is portable and can be decoded with little or no knowledge of
the encoders. This document describes Vcdiff, a compact portable
encoding format designed for these purposes.
Data differencing is the process of computing a compact and
invertible encoding of a "target file" given a "source file". Data
compression is similar, but without the use of source data. The UNIX
utilities diff, compress, and gzip are well-known examples of data
differencing and compression tools. For data differencing, the
computed encoding is called a "delta file", and for data compression,
it is called a "compressed file". Delta and compressed files are
good for storage and transmission as they are often smaller than the
originals.
Data differencing and data compression are traditionally treated as
distinct types of data processing. However, as shown in the Vdelta
technique by Korn and Vo [1], compression can be thought of as a
special case of differencing in which the source data is empty. The
basic idea is to unify the string parsing scheme used in the LempelZiv77 (LZ77) style compressors [2] and the block-move technique of
Tichy [3]. Loosely speaking, this works as follows:
Standards Track
[Page 2]
RFC 3284
VCDIFF
June 2002
Standards Track
[Page 3]
RFC 3284
VCDIFF
June 2002
Decoding efficiency:
Except for secondary encoder issues, the decoding algorithm
runs in time proportionate to the size of the target file and
uses space proportionate to the maximal window size. Vcdiff
differs from more conventional compressors in that it uses only
byte-aligned data, thus avoiding bit-level operations, which
improves decoding speed at the slight cost of compression
efficiency.
The combined differencing and compression method is called "delta
compression" [14]. As this way of data processing treats compression
as a special case of differencing, we shall use the term "delta file"
to indicate the compressed output for both cases.
2. Conventions
The basic data unit is a byte. For portability, Vcdiff shall limit a
byte to its lower eight bits even on machines with larger bytes. The
bits in a byte are ordered from right to left so that the least
significant bit (LSB) has value 1, and the most significant bit
(MSB), has value 128.
For purposes of exposition in this document, we adopt the convention
that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers
never appear in the encoded format itself.
Vcdiff encodes unsigned integer values using a portable, variablesized format (originally introduced in the Sfio library [7]). This
encoding treats an integer as a number in base 128. Then, each digit
in this representation is encoded in the lower seven bits of a byte.
Except for the least significant byte, other bytes have their most
significant bit turned on to indicate that there are still more
digits in the encoding. The two key properties of this integer
encoding that are beneficial to a data compression format are:
a. The encoding is portable among systems using 8-bit bytes, and
b. Small values are encoded compactly.
For example, consider the value 123456789, which can be represented
with four 7-bit digits whose values are 58, 111, 26, 21 in order from
most to least significant. Below is the 8-bit byte encoding of these
digits. Note that the MSBs of 58, 111 and 26 are on.
+-------------------------------------------+
| 10111010 | 11101111 | 10011010 | 00010101 |
+-------------------------------------------+
MSB+58
MSB+111
MSB+26
0+21
Standards Track
[Page 4]
VCDIFF
RFC 3284
June 2002
Henceforth, the terms "byte" and "integer" will refer to a byte and
an unsigned integer as described.
Algorithms in the C language are occasionally exhibited to clarify
the descriptions. Such C code is meant for clarification only, and
is not part of the actual specification of the Vcdiff format.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, RFC 2119 [12].
3.
Delta Instructions
A large target file is partitioned into non-overlapping sections
called "target windows". These target windows are processed
separately and sequentially based on their order in the target file.
A target window T, of length t, may be compared against some source
data segment S, of length s. By construction, this source data
segment S comes either from the source file, if one is used, or from
a part of the target file earlier than T. In this way, during
decoding, S is completely known when T is being decoded.
The choices of T, t, S and s are made by some window selection
algorithm, which can greatly affect the size of the encoding.
However, as seen later, these choices are encoded so that no
knowledge of the window selection algorithm is needed during
decoding.
Assume that S[j] represents the jth byte in S, and T[k] represents
the kth byte in T. Then, for the delta instructions, we treat the
data windows S and T as substrings of a superstring U, formed by
concatenating them like this:
S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
The "address" of a byte in S or T is referred to by its location in
U. For example, the address of T[k] is s+k.
The instructions to encode and direct the reconstruction of a target
window are called delta instructions. There are three types:
ADD:
This
of x
COPY: This
p in
that
must
Standards Track
[Page 5]
VCDIFF
RFC 3284
RUN:
June 2002
Below are example source and target windows and the delta
instructions that encode the target window in terms of the source
window.
a b c d e f g h i j k l m n o p
a b c d w x y z e f g h e f g h e f g h e f g h z z z z
COPY 4, 0
ADD
4, w x y z
COPY 4, 4
COPY 12, 24
RUN
4, z
Thus, the first letter a in the target window is at location 16 in
the superstring. Note that the fourth instruction, "COPY 12, 24",
copies data from T itself since address 24 is position 8 in T. This
instruction also shows that it is fine to overlap the data to be
copied with the data being copied from, as long as the latter starts
earlier. This enables efficient encoding of periodic sequences,
i.e., sequences with regularly repeated subsequences. The RUN
instruction is a compact way to encode a sequence repeating the same
byte even though such a sequence can be thought of as a periodic
sequence with period 1.
To reconstruct the target window, one simply processes one delta
instruction at a time and copies the data, either from the source
window or the target window being reconstructed, based on the type of
the instruction and the associated address, if any.
4.
Standards Track
[Page 6]
VCDIFF
RFC 3284
June 2002
Header
Header1
Header2
Header3
Header4
Hdr_Indicator
[Secondary compressor ID]
[Length of code table data]
[Code table data]
Size of near cache
Size of same cache
Compressed code table data
Window1
Win_Indicator
[Source segment size]
[Source segment position]
The delta encoding of the target window
Length of the delta encoding
The delta encoding
Size of the target window
Delta_Indicator
Length of data for ADDs and RUNs
Length of instructions and sizes
Length of addresses for COPYs
Data section for ADDs and RUNs
Instructions and sizes section
Addresses section for COPYs
Window2
...
byte
byte
byte
byte
byte
byte
integer
- byte
- byte
- byte
- integer
- integer
- integer
-
integer
byte
integer
integer
integer
array of bytes
array of bytes
array of bytes
Standards Track
byte = 0xD6
byte = 0xC3
byte = 0xC4
byte
byte
byte
integer
[Page 7]
RFC 3284
VCDIFF
June 2002
The first three Header bytes are the ASCII characters V, C and
D with their most significant bits turned on (in hexadecimal, the
values are 0xD6, 0xC3, and 0xC4). The fourth Header byte is
currently set to zero. In the future, it might be used to indicate
the version of Vcdiff.
The Hdr_Indicator byte shows if there is any initialization data
required to aid in the reconstruction of data in the Window sections.
This byte MAY have non-zero values for either, both, or neither of
the two bits VCD_DECOMPRESS and VCD_CODETABLE below:
7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+
| | | | | | | | |
+-+-+-+-+-+-+-+-+
^ ^
| |
| +-- VCD_DECOMPRESS
+---- VCD_CODETABLE
If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a
secondary compressor may have been used to further compress certain
parts of the delta encoding data as described in Sections 4.3 and 6.
In that case, the ID of the secondary compressor is given next. If
this bit is zero, the compressor ID byte is not included.
If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an
application-defined code table is to be used for decoding the delta
instructions. This table itself is compressed. The length of the
data comprising this compressed code table and the data follow next.
Section 7 discusses application-defined code tables. If this bit is
zero, the code table data length and the code table data are not
included.
If both bits are set, then the compressor ID byte is included before
the code table data length and the code table data.
4.2 The Format of a Window Section
Each Window section is organized as follows:
Win_Indicator
[Source segment length]
[Source segment position]
The delta encoding of the target window
Standards Track
- byte
- integer
- integer
[Page 8]
VCDIFF
RFC 3284
June 2002
Standards Track
[Page 9]
VCDIFF
RFC 3284
June 2002
- integer
-
integer
byte
integer
integer
integer
array of bytes
array of bytes
array of bytes
bit value 1.
bit value 2.
bit value 4.
Standards Track
[Page 10]
VCDIFF
RFC 3284
June 2002
Standards Track
[Page 11]
VCDIFF
RFC 3284
June 2002
Standards Track
[Page 12]
RFC 3284
VCDIFF
June 2002
Standards Track
[Page 13]
RFC 3284
VCDIFF
June 2002
Near modes: The "near modes" are in the range [2,s_near+1]. Let m
be the mode of the address encoding. The address was encoded
as the integer value "addr - near[m-2]".
Same modes: The "same modes" are in the range
[s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.
The address was encoded as a single byte b such that "addr ==
same[(m - (s_near+2))*256 + b]".
5.4 Example code for encoding and decoding of COPY instruction addresses
We show example algorithms below to demonstrate the use of address
modes more clearly. The encoder has the freedom to choose address
modes, the sample addr_encode() algorithm merely shows one way of
picking the address mode. The decoding algorithm addr_decode() will
uniquely decode addresses, regardless of the encoders algorithm
choice.
Note that the address caches are updated immediately after an address
is encoded or decoded. In this way, the decoder is always
synchronized with the encoder.
Standards Track
[Page 14]
VCDIFF
RFC 3284
June 2002
/* VCD_SELF == 0 */
Standards Track
[Page 15]
VCDIFF
RFC 3284
June 2002
Standards Track
[Page 16]
VCDIFF
RFC 3284
June 2002
The
inst: An "inst" field can have one of the four values: NOOP (0),
ADD (1), RUN (2) or COPY (3) to indicate the instruction
types. NOOP means that no instruction is specified. In
this case, both the corresponding size and mode fields will
be zero.
size: A "size" field is zero or positive. A value zero means that
the size associated with the instruction is encoded
separately as an integer in the "Instructions and sizes
section" (Section 6). A positive value for "size" defines
the actual data size. Note that since the size is
restricted to a byte, the maximum value for any instruction
with size implicitly defined in the code table is 255.
mode: A "mode" field is significant only when the associated delta
instruction is a COPY. It defines the mode used to encode
the associated addresses. For other instructions, this is
always zero.
5.6 The Code Table
Following the discussions on address modes and instruction code
tables, we define a "Code Table" to have the data below:
s_near: the size of the near cache,
s_same: the size of the same cache,
i_code: the 256-entry instruction code table.
Vcdiff itself defines a "default code table" in which s_near is 4 and
s_same is 3. Thus, there are 9 address modes for a COPY instruction.
The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5
are for addresses coded against the near cache. And modes 6, 7 and
8, are for addresses coded against the same cache.
Standards Track
[Page 17]
RFC 3284
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
VCDIFF
June 2002
TYPE
SIZE
MODE
TYPE
SIZE
MODE
INDEX
--------------------------------------------------------------RUN
0
0
NOOP
0
0
0
0
NOOP
0
0
[1,18]
ADD
0, [1,17]
0
NOOP
0
0
[19,34]
COPY
0, [4,18]
1
NOOP
0
0
[35,50]
COPY
0, [4,18]
2
NOOP
0
0
[51,66]
COPY
0, [4,18]
COPY
0, [4,18]
3
NOOP
0
0
[67,82]
4
NOOP
0
0
[83,98]
COPY
0, [4,18]
5
NOOP
0
0
[99,114]
COPY
0, [4,18]
6
NOOP
0
0
[115,130]
COPY
0, [4,18]
7
NOOP
0
0
[131,146]
COPY
0, [4,18]
8
NOOP
0
0
[147,162]
COPY
0, [4,18]
0
COPY
[4,6]
0
[163,174]
ADD
[1,4]
0
COPY
[4,6]
1
[175,186]
ADD
[1,4]
0
COPY
[4,6]
2
[187,198]
ADD
[1,4]
0
COPY
[4,6]
3
[199,210]
ADD
[1,4]
0
COPY
[4,6]
4
[211,222]
ADD
[1,4]
0
COPY
[4,6]
5
[223,234]
ADD
[1,4]
0
COPY
4
6
[235,238]
ADD
[1,4]
ADD
[1,4]
0
COPY
4
7
[239,242]
0
COPY
4
8
[243,246]
ADD
[1,4]
COPY
4
[0,8]
ADD
1
0
[247,255]
---------------------------------------------------------------
Standards Track
[Page 18]
RFC 3284
VCDIFF
June 2002
If a line in the depiction includes more than one entry using the
[i,j] notation, implying a "nested loop" to convert the line to a
range of table entries, the first such [i,j] range specifies the
outer loop, and the second specifies the inner loop.
The below examples should make clear the above description:
Line 1 shows the single RUN instruction with index 0. As the size
field is 0, this RUN instruction always has its actual size encoded
separately.
Line 2 shows the 18 single ADD instructions. The ADD instruction
with size field 0 (i.e., the actual size is coded separately) has
index 1. ADD instructions with sizes from 1 to 17 use code indices 2
to 18 and their sizes are as given (so they will not be separately
encoded.)
Following the single ADD instructions are the single COPY
instructions ordered by their address encoding modes. For example,
line 11 shows the COPY instructions with mode 8, i.e., the last of
the same cache. In this case, the COPY instruction with size field 0
has index 147. Again, the actual size of this instruction will be
coded separately.
Lines 12 to 21 show the pairs of instructions that are combined
together. For example, line 12 depicts the 12 entries in which an
ADD instruction is combined with an immediately following COPY
instruction. The entries with indices 163, 164, 165 represent the
pairs in which the ADD instructions all have size 1, while the COPY
instructions have mode 0 (VCD_SELF) and sizes 4, 5 and 6
respectively.
The last line, line 21, shows the eight instruction pairs, where the
first instruction is a COPY and the second is an ADD. In this case,
all COPY instructions have size 4 with mode ranging from 0 to 8 and
all the ADD instructions have size 1. Thus, the entry with the
largest index 255 combines a COPY instruction of size 4 and mode 8
with an ADD instruction of size 1.
The choice of the minimum size 4 for COPY instructions in the default
code table was made from experiments that showed that excluding small
matches (less then 4 bytes long) improved the compression rates.
Standards Track
[Page 19]
VCDIFF
RFC 3284
June 2002
Standards Track
[Page 20]
VCDIFF
RFC 3284
June 2002
- byte
- byte
The "compressed code table data" encodes the delta between the
default code table (source) and the new code table (target) in the
same manner as described in Section 4.3 for encoding a target window
in terms of a source window. This delta is computed using the
following steps:
Standards Track
[Page 21]
RFC 3284
VCDIFF
June 2002
of the first
of the
of the first
of the
of the first
of the
Standards Track
[Page 22]
RFC 3284
VCDIFF
June 2002
Standards Track
[Page 23]
RFC 3284
VCDIFF
June 2002
The last three rows in the column gcc-2.95.2 show that when two file
versions are very similar, differencing can give dramatically good
compression rates. Vcdiff-d and Vcdiff-dc use the same simple window
selection method of aligning by file offsets, but Vcdiff-dc also does
compression so its output is slightly smaller. Vcdiff-dcw uses a
content-based algorithm to search for source data that likely will
match a given target window. Although it does a good job, the
algorithm does not always find the best matches, which in this case,
are given by the simple algorithm of Vcdiff-d. As a result, the
output size for Vcdiff-dcw is slightly larger.
The situation is reversed in the gcc-2.95.3 column. Here, the files
and their contents were sufficiently rearranged or changed between
the making of the gcc-2.95.3.tar archive and the gcc-2.95.2 archive
so that the simple method of aligning windows by file offsets no
longer works. As a result, Vcdiff-d and Vcdiff-dc do not perform
well. By allowing compression, along with differencing, Vcdiff-dc
manages to beat Vcdiff-c, which does compression only. The contentbased window matching algorithm in Vcdiff-dcw is effective in
matching the right source and target windows so that Vcdiff-dcw is
the overall winner.
9. Further Issues
This document does not address a few issues:
Secondary compressors:
As discussed in Section 4.3, certain sections in the delta
encoding of a window may be further compressed by a secondary
compressor. In our experience, the basic Vcdiff format is
adequate for most purposes so that secondary compressors are
seldom needed. In particular, for normal use of data
differencing, where the files to be compared have long stretches
of matches, much of the gain in compression rate is already
achieved by normal string matching. Thus, the use of secondary
compressors is seldom needed in this case. However, for
applications beyond differencing of such nearly identical files,
secondary compressors may be needed to achieve maximal compressed
results.
Therefore, we recommend leaving the Vcdiff data format defined as
in this document so that the use of secondary compressors can be
implemented when they become needed in the future. The formats of
the compressed data via such compressors or any compressors that
may be defined in the future are left open to their
implementations. These could include Huffman encoding, arithmetic
encoding, and splay tree encoding [8,9].
Standards Track
[Page 24]
VCDIFF
RFC 3284
June 2002
Summary
Standards Track
[Page 25]
VCDIFF
RFC 3284
June 2002
[2]
[3]
Standards Track
[Page 26]
VCDIFF
RFC 3284
June 2002
[4]
[5]
[6]
[7]
D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Proc. of the
Summer 91 Usenix Conference, 1991.
[8]
[9]
M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851434-1, M&T Books, New York, NY, 1995.
Standards Track
[Page 27]
VCDIFF
RFC 3284
June 2002
David G. Korn
AT&T Labs, Room D237
180 Park Avenue
Florham Park, NJ 07932
Phone: 1 973 360 8602
EMail: [email protected]
Jeffrey C. Mogul
Western Research Laboratory
Hewlett-Packard Company
1501 Page Mill Road, MS 1251
Palo Alto, California, 94304, U.S.A.
Phone: 1 650 857 2206 (email preferred)
EMail: [email protected]
Joshua P. MacDonald
Computer Science Division
University of California, Berkeley
345 Soda Hall
Berkeley, CA 94720
EMail: [email protected]
Standards Track
[Page 28]
VCDIFF
RFC 3284
18.
June 2002
Standards Track
[Page 29]