XML Diff Survey
XML Diff Survey
XML Diff Survey
Daniel Ehrenberg
Introduction
XHTML documents cannot be accurately compared and merged by simple line-
by-line tools like diff3 because the tree structure may be compromised in both identifying
the differences and performing a semi-automatic merge, and programs like Tidy are
insufficient to clean this up. So we need a different method to compare these documents.
Several algorithms exist which may be suitable to diff HTML documents. Some
algorithms, such as the Zhang-Shasha algorithm, analyzes two trees to find their minimal
edit distance. Other algorithms, such as XyDiff, find a non-optimal edit distance between
trees using some kind of heuristic, but incorporate a broader definition of edit operations,
including moves. This tree view may, however, be inappropriate for HTML, as many
elements do not induce a tree-like structure semantically, such as inline formatting
elements (like <em>). DaisyDiff presents a solution to this problem.
Several types of merging are possible. One strategy is to simply run the edit script
of both versions, but this can cause inconsistencies in the output. Instead, either
operational transformation or a diff3-like algorithm should be used to merge edit scripts.
Alternatively, Lindholm’s merge algorithm can be used directly, without using an explicit
edit script.
A warning: All of the algorithms are fairly difficult to understand. I don’t
understand all of them; it took me months to figure out the Zhang-Shasha algorithm. You
don’t need to understand the details of each algorithm to roughly evaluate its advantages
and disadvantages.
The Problem
The obvious solution to the problem is to just use diff3 to merge the differences.
To allow diff3 to see the differences better, the tags are all split onto their own lines in a
normalization pass. But diff3 this ignores the (superficial) tree structure of XML. For
example, look at the following example:
<p> <p> <p> <p>
This is <b> This is <b>
some This is <i> This is
text some some <i>
</p> </b> text some
text </i> </b>
</p> </p> text
</i>
</p>
Original Part bolded Part italicized Line-by-line merge
In this case, the Tidy program could fix the generated HTML. But in other cases,
things get more complicated and difficult to fix up. Additionally, in a system like a
WYSIWYG editor, where the user is not exposed to the HTML itself, the diff would
indicate that, in the “part bolded” variant, the lines “<b>” and “</b>” are added. But it
should give the information that “This is some” was previously not bold and it became
bold.
Additionally, it’d be nice if we could track moves. If someone, say, swaps two
paragraphs, it’d be nice if that could be tracked and reported rather than explained as a
deletion and insertion. But this isn’t absolutely necessary.
Some definitions
An XML document can be viewed as an ordered tree, where each node has an
unbounded number of children and each internal node has a label. So we can solve these
problems of diffing and merging XML by solving a more general problem on ordered
trees, as most authors have. There are some specific aspects of XML which deserve
mention (attributes, which are guaranteed to have unique names within a node; IDs,
which are guaranteed to be unique within a document), but these are minor aspects which
we can ignore for most of the time.
To avoid confusion, I'll define some terms I've been using or will soon start using.
When I talk about "diffing", what I mean, formally, is generating an "edit script", or list
of changes between two documents that can be used to get the modified document from
the original. Sometimes, these edit scripts are invertible, but not always. When I talk
about a "merge", I mean a way to reconcile the changes between documents to
incorporate both of these changes. A merge can be an operation on edit scripts or it can be
done directly on a tree matching. A "matching" is a set of correspondences between nodes
in different trees; it is the basis for doing either a diff or a merge, and it's difficult to do
efficiently.
The idea of diffing trees is closely related to that of comparing strings. It could be
thought of as a generalization. For this, there is a O(n2) solution using dynamic
programming [1] and there is also a O(nd) algorithm, where d is the edit distance,
discovered by Eugene Meyers, using more advanced techniques [18]. The O(nd)
algorithm is sometimes referred to as the GNU diff algorithm, for some reason. For
merging, the GNU diff3 program presents a useful model which has been formally
analyzed [19].
Three-document merge
Creating an edit script is all well and good, but it's only half of the problem: the
merge. Remember that a three-document merge is one where we have the original
document and two modified versions, and we want to create a fourth version with both
modifications together. Here was my idea: create an edit script for both modified versions
with respect to the original, then do one followed by another, with repeated modifications
done only once. We know there's a conflict if the order matters, in terms of which comes
first in applying to the original document.
But this will come up with more conflicts than actually exist. For example, say
some node A has four children, B C D and E. In one change, we insert a new node X after
B as a child of A, and in another change, we insert a node Y after D as a child of A. So a
sensible merge would have A's new children be B X C D Y E, in that order. But with the
model described above, there would be an edit conflict!
One solution to this is the more general strategy of operational transformation [8].
The basic idea for this technique as applied here is that, if we insert Y after inserting X,
we have to add 1 to the index that Y is being inserted. If, on the other hand, Y is inserted
first, we don't have to add one to the index that X is inserted on. This technique leads to
fewer conflicting merges, or in OT lingo, it converges in more cases. There are a few
formal properties of an operational transformation that have only recently been proven
correct in the best-known algorithms. Pascal Molli used operational transformation,
together with Cobéna's diff algorithm and format, in his So6 synchronization framework
[9].
Tancred Lindholm went a different route altogether in creating a three-way merge,
throwing out the edit script and basing it on a tree matching [10]. He based the merge
definition on several large, by-hand merges of realistic XHTML and other XML
documents, figuring out how to automate the results. But the algorithm isn’t perfect; it,
like all of the other ones mentioned so far, cannot properly handle the motivating
example at the beginning. Unfortunately, I don’t understand the algorithm.