Myers O ND Diference

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Algorithmica (1986) 1:251-266

Algorithmica
~) 1986Springer-VerlagNew York Inc.

An O(ND) Difference Algorithm and Its Variations 1


Eugene W. Myers 2

Abstract. The problems of finding a longest common subsequence of two sequences A and B and
a shortest edit script for transforming A into B have long been known to be dual problems. In this
paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this
perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the
lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs
well when differences are small (sequences are similar) and is consequently fast in typical applications.
The algorithm is shown to have O(N+ D 2) expected-time performance under a basic stochastic
model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to
an O(N log N + D 2) time variation.

Key Words. Longest common subsequence, Shortest edit script, Edit graph, File comparison.

1. Introduction. The problem of determining the differences between two


sequences of symbols has been studied extensively [1,8,11, 13, 16, 19,20].
Algorithms for the problem have numerous applications, including spelling
correction systems, file comparison tools, and the study of genetic evolution
[4, 5, 17, 18]. Formally, the problem statement is to find a longest common
subsequence or, equivalently, to find the minimum "script" of symbol deletions
and insertions that transform one sequence into the other. One of the earliest
algorithms is by Wagner and Fischer [20] and takes O ( N 2) time and space to
solve a generalization they call the string-to-string correction problem. A later
refinement by Hirschberg [7] delivers a longest common subsequence using only
linear space. When algorithms are over arbitrary alphabets, use "equal-unequal"
comparisons, and are characterized in terms of the size of their input, it has been
shown that 12(N 2) time is necessary [1]. A "Four Russians" approach leads to
slightly better O(N 2log log N/log N) and O(N2/log N) time algorithms for
arbitrary and finite alphabets, respectively [ 13]. The existence of faster algorithms
using other comparison formats is still open. Indeed, for algorithms that use "less
than-equal-greater than" comparisons, f~(N log N) time is the best lower bound
known [9].
Recent work improves upon the basic O ( N 2) time arbitrary alphabet algorithm
by being sensitive to other problem size parameters. Let the output parameter L
be the length of a longest common subsequence and let the dual parameter
D = 2 ( N - L ) be the length of the shortest edit script. (It is assumed throughout
this introduction that both strings have the same length N.) The two best output-

t This work was supported in part by the National Science Foundation under Grant MCS82-10096.
2 Department of Computer Science, University of Arizona, Tucson, AZ 85721, USA.

Received June 11, 1985; revised January 17, 1986. Communicated by David Dobkin.
252 E.W. Myers

sensitive algorithms are by Hirschberg [8] and take O ( N L + N l g N )


and O ( D L l o g N ) time. An algorithm by Hunt and Szymanski [11] takes
O ( ( R + N ) log N ) time where the parameter R is the total number of ordered
pairs of positions at which the two input strings match. Note that all these
algorithms are f~(N 2) or worse in terms of N alone.
In practical situations, it is usually the parameter D that is small. Programmers
wish to know how they have altered a text file. Biologists wish to know how one
D N A stand has mutated into another. For these situations, an O ( N D ) time
algorithm is superior to Hirschberg's algorithms because L is O ( N ) when D is
small. Furthermore, the approach of Hunt and Szymanski [11] is predicated on
the hypothesis that R is small in practice. While this is frequently true, it must
be noted that R has no correlation with either the size of the input or the size
of the output and can be O ( N 2) is many situations. For example, if 10% of all
lines in a file are blank and the file is compared against itself, R is greater than
0.01N 2. For D N A molecules, the alphabet size is four implying that R is at least
0.25N 2 when an arbitrary molecule is compared against itself or a very similar
molecule.
In this paper an O ( N D ) time algorithm is presented. Our algorithm is simple
and based on an intuitive edit graph formalism. Unlike others it employs the
"greedy" design paradigm and exposes the relationship of the longest common
subsequence problem to the single-source shortest path problem. Another O ( N D )
algorithm has been presented elsewhere [16]. However, it uses a different design
paradigm and does not share the following features. The algorithm can be refined
to use only linear space, and its expected-case time behavior is shown to be
O ( N + D2). Moreover, the method admits an O ( N log N + D 2) time worst-case
variation. This is asymptotically superior to previous algorithms [8, 16, 20] when
D is o ( N ) .
With the exception of the O ( N log N + D 2) worst-case variation, the algorithms
presented in this paper are practical. The basic O ( N D ) algorithm served as the
basis for a new implementation of the UNIX d/ff program [15]. The versio~
usually runs two to four times faster than the System 5 implementation based on
the Hunt and Szymanski algorithm [10]. However, there are cases when D is
large where their algoirthm is superior (e.g., for files that are completely different,
R = 0 and D = 2N). The linear space refinement is roughly twice as slow as the
basic O ( N D ) algorithm but still competitive because it can perform extremely
large compares that are out of the range of other algorithms. For instance, two
1.5 million byte sequences were compared in less than two minutes (on a VAX
785 running 4.2 BSD UNIX) even though the difference was greater than 500.

2. Edit Graphs. Let A = ala2 9 9 9 aN and B = bib2 9 9 9 bM be sequences of length


N and M, respectively. The edit graph for A and B has a vertex at each point
in the grid (x, y), x ~ [0, N ] and y ~ [0, M]. The vertices of the edit graph are
connected by horizontal, vertical, and diagonal directed edges to form a directed
acyclic graph. Horizontal edges connect each vertex to its right neighbor, i.e.,
(x - 1, y) ~ (x, y) for x ~ [1, N ] and y ~ [0, M]. Vertical edges connect each vertex
An O(ND) Difference Algorithm and Its Variations 253

to the neighbor below it, i.e., (x, y - 1) -->(x, y) for x e [0, N ] and y e [1, M]. If
ax = by then there is a diagonal edge connecting vertex ( x - 1 , y - 1 ) to vertex
(x, y). The points (x, y) for which ax = by are called match points. The total number
of match points between A and B is the parameter R characterizing the Hunt
and Szymansk algorithm [11]. It is also the number of diagonal edges in the edit
graph as diagonal edges are in one-to-one correspondence with match points.
Figure 1 depicts the edit graph for the sequences A = abcabba and B = cbabac.
A trace of length L is a sequence of L match points, (xl, yl)(X2, Y2) 9 " 9 (xL, YL),
such that xi < xi+l and Yi <Yi+l for successive points (x~,y~) and (x~+~, Y~+I),
i e [1, L - 1 ] . Every trace is in exact correspondence with the diagonal edges of
a path in the edit graph from (0, 0) to (N, M). The sequence of match points
visited in traversing a path from start to finish is easily verified to be a trace.
Note that L is the number of diagonal edges in the corresponding path. To
construct a path from a trace, take the sequence of diagonal edges corresponding
to the match points of the trace and connect successive diagonals with a series
of horizontal and vertical edges. This can always be done as xi < x~+l and y~ < Yi+l
for successive match points. Note that several paths differing only in their

PATH
(~) TRACE= (3,1) (4,3) (5,4) (7,5)
COVt'DN SUB.SEQUENCE = CABA= A3A4A5A7 = BIB3B4B5
EDIT SCRIPT : _l.D, 2D, 31B, b-D, 71c

A B C A B B A
0
I

m . . . . . . 3

i 2 3 4 5 6 (7,6)
Fig. 1. An edit graph.
254 E.W. Myers

nondiagonal edges can correspond to a given trace. Figure 1 illustrates this


relation between paths and traces.
A subsequence of a string is any string obtained by deleting zero or more
symbols from the given string. A common subsequence of two strings, A and B,
is a subsequence of both. Each trace gives rise to a common subsequence of A
and B and vice versa. Specifically, a x i a x 2 . . . a x L = by~by2...by L is a common
subsequence of A and B if and only if (Xl, yl)(x2, Y 2 ) " " " (XL, YL) is a trace of
A and B.
An edit script for A and B is a set of insertion and deletion commands that
transform A into B. The delete c o m m a n d " x D " deletes the symbol ax from A.
T h e insert c o m m a n d " x l b l , b2, 9 9 9 bt" inserts the sequence of symbols b I 9 9 9 bt
immediately after ax. Script commands refer to symbol positions within A before
any commands have been performed. One must think of the set of commands
in a script as being executed simultaneously. The length of a script is the number
of symbols inserted and deleted.
Every trace corresponds uniquely to an edit script. Let (xl,yl)(X2,y2)''"
(xL, YL) be a trace Let Yo= 0 and YL+I= M + 1. The associated script consists of
the commands: " x D " for x ~ { x l , x 2 , . . . , x L } , and "xklbyk+b...,byk+,-l"
for k such that Yk + 1 < Yk+~. The script deletes N - L symbols and inserts M - L
symbols. So for every trace of length L there is a corresponding script of length
D = N+ M-2L. To map an edit script to a trace, simply perform all delete
commands on A, observe that the result is a common subsequence of A and B,
and then map the subsequence to its unique trace. Note that inverting the action
of the insert commands gives a set of delete commands that map B to the same
common subsequence.
Common subsequences, edit scripts, traces, and paths from (0, 0) to (N, M)
in the edit graph are all isomorphic formalisms. The edges of every path have
the following direct interpretations in terms of the corresponding common sub-
sequence and edit script. Each diagonal edge ending at (x, y) gives a symbol, ax
(=by), in the common subsequence; each horizontal edge to point (x, y) corre-
sponds to the delete command " x D " ; and a sequence of vertical edges from
(x, y) to (x, z) corresponds to the insert command, "xlby+~, 9 9 9 bz". Thus the
number of vertical and horizontal edges in the path is the length of its correspond-
ing script, the number of diagonal edges is the length of its corresponding
subsequence , and the total number of edges is N + M - L . Figure 1 illustrates
these observations.
The problem of finding a longest common subsequence (LCS) is equivalent to
finding a path from (0, 0) to (N, M) with the maximum number of diagonal
edges. The problem of finding a shortest edit script (SES) is equivalent to finding
a path from (0, 0) to (N, M) with the minimum number of nondiagonal edges.
These are dual problems as a path with the maximum number of diagonal edges
has the minimal number of nondiagonal edges ( D + 2 L = M + N ) . Consider
adding a weight or cost to every edge. Give diagonal edges weight 0 and
nondiagonal edges weight 1. The LCS/SES problem is equivalent to finding a
minimum-cost path from (0, 0) to (N, M) in the weighted edit graph and is thus
a special instance of the single-source shortest path problem.
An O(ND) DifferenceAlgorithmand Its Variations 255

3. An O ( ( M + N ) D ) Greedy Algorithm. The problem of finding a shortest edit


script reduces to finding a path from (0, 0) to (N, M) with the fewest number of
horizontal and vertical edges. Let a D-path be a path starting at (0, 0) that has
exactly D nondiagonal edges. A 0-path must consist solely of diagonal edges.
By a simple induction, it follows that a D-path must consist of a ( D - 1)-path
followed by a nondigaonal edge and then a possibly empty sequence of diagonal
edges called a snake.
Number the diagonals in the grid of edit graph vertices so that diagonal k
consists of the points (x, y) for which x - y = k. With this definition the diagonals
are numbered from - M to N. Note that a vertical (horizontal) edge with start
point on diagonal k has end point on diagonal k - 1 (k+ 1) and a snake remains
on the diagonal in which it starts.

LEMMA 1. A D-path must end on diagonal k e { - D, - D + 2 , . . . D - 2 , D}.

PROOF. A 0-path consists solely of diagonal edges and starts on diagonal 0.


Hence it must end on diagonal 0. Assume inductively that a D-path must end
on diagonal k in {-D, - D + 2 , . . . D - 2 , D}. Every ( D + l ) - p a t h consists of a
prefix D-path, ending on say diagonal k, a nondiagonal edge ending on diagonal
k + 1 or k - 1 , and a snake that must also end on diagonal k + 1 or k - 1 . It then
follows that every ( D + 1)-path must end on a diagonal in {(-D) • 1, ( - D + 2 ) +
1,...,(D-2)+l, (D)+I}={-D-1,-D+I,..., D - l , D + I } . Thus the result
holds by induction. []

The lemma implies that D-paths end solely on odd diagonals when D is odd
and even diagonals when D is even.
A D-path is furthest reaching in diagonal k if and only if it is one of the
D-paths ending on diagonal k whose endpoint has the greatest possible row
(column) number of all such paths. Informally, of all D-paths ending in diagonal
k, it ends furthest from the origin, (0, 0). The following lemma gives an inductive
characterization of furthest reaching D-paths and embodies a greedy principle:
furthest reaching D-paths are obtained by greedily extending furthest reaching
(D - 1)-paths.

LEMMA2. A furthest reaching O-path ends at (x, x), where x is min(z - 11[ az # bz
or z > M or z > N). A furthest reaching D-path on diagonal k can without loss of
generality be decomposed into a furthest reaching ( D -1)-path on diagonal k - 1 ,
followed by a horizontal edge, followed by the longest possible snake or it may be
decomposed into a furthest reaching ( D - 1)-path on diagonal k + 1, followed by a
vertical edge, followed by the longest possible snake.

PROOF. The basis for 0-paths is straightforward. As noted before, a D-path


consists of a ( D - D-path, a nondiagonal edge, and a snake. If the D-path ends
on diagonal k, it follows that the ( D - 1 ) - p a t h must end on diagonal k + l
depending on whether a vertical or horizontal edge precedes the final snake. The
final snake must be maximal, as the D-path would not be furthest reaching if
256 E.W. Myers

the snake could be extended. Suppose that the ( D - 1 ) - p a t h is not furthest


reaching in its diagonal. But then a further reaching (D - 1)-path can be connected
to the final snake of the D-path with an appropriate nondiagonal move. Thus
the D-path can always be decomposed as desired. []

Given the end points of the furthest reaching ( D - 1)-paths in diagonal k + 1


and k - 1, say (x', y') and (x", y"), respectively, Lemma 2 gives a procedure for
computing the endpoint of the furthest reaching D-path in diagonal k. Namely,
take the further reaching of (x', y ' + 1) and (x"+ 1, y") in diagonal k and then
follow diagonal edges until it is no longer possible to do so or until the boundary
of the edit graph is reached. Furthermore, by Lemma 1 there are only D + 1
diagonals in which a D-path can end. This suggests computing the endpoints of
D-paths in the relevant D + 1 diagonals for successively increasing values of D
until the furthest reaching path in diagonal N - M reaches (N, M).

For De-O to M + N Do
For k<- - D to D in steps of 2 Do
Find the endpoint of the furthest reaching D-path in diagonal k.
If (N, M) is the endpoint Then
The D-path is an optimal solution.
Stop

The outline above stops when the smallest D is encountered for which there is
a furthest reaching D-path to (N, M). This must happen before the outer loop
terminates because D must be less than or equal to M + N. By construction this
path must be minimal with respect to the number of nondiagonal edges within
it. Hence it is a solution to the LCS/SES problem.
In presenting the detailed algorithm in Figure 2, a number of simple optimiz-
ations are employed. An array, V, contains the endpoints of the furthest reaching
D-paths in elements V[-D], V [ - D + 2 ] , . . . , V [ D - 2 ] , V[D]. By Lemma 1 this
set of elements is disjoint from those where the endpoints of the ( D + 1)-paths
will be stored in the next iteration of the outer loop. Thus the array V can
simultaneously hold the endpoints of the D-paths while the ( D + 1)-path end-
points are being computed from them. Furthermore, to record an endpoint (x, y)
in diagonal k it suffices to retain just x because y is known to be x - k . Con-
sequently, V is an array of integers where V[k] contains the row index of the
endpoint of a furthest reaching path in diagonal k.
As a practical matter the algorithm searches D-paths where D - MAX and if
no such path reaches (N, M) then it reports that any edit script for A and B
must be longer than MAX in Line 14. By setting the constant MAX to M + N
as in the outline above, the algorithm is guaranteed to find the length of the
LCS/SES. Figure 3 illustrates the D-paths searched when the algorithm is applied
to the example of Figure 1. Note that a fictitious endpoint, (0, -1), set up in Line
1 of the algorithm is used to find the endpoint of the furthest reaching 0-path.
Also note that D-paths extend off the left and lower boundaries of the edit graph
An O(ND) Difference Algorithm and Its Variations 257

Constant M A X ~ [0, M + N ]
Var V: Array [ - M A X _ M A X ] of Integer
1. V[1]~O
2. For D ~ - 0 to M A X Do
3. For k < - - D to D in steps of 2 Do
4. If k = - D or k ~ D and V [ k - 1 ] < V [ k + l ] Then
5. x~ V[k+l]
6. Else
7. x~- V [ k - 1 ] + l
8. y~x-k
9. While x < N and y < M and a~+l = by+l Do ( x , y ) ~ ( x + l , y + l )
10. V[k]~ x
11. If x ~ N and y - M Then
12. Length of an SES is D
13. Stop
14. Length of an SES is greater than M A X
Fig. 2. The greedy LCS/SES algorithm.

proper as the algorithm progresses. This boundary situation is correctly handled


by assuming that there are no diagonal edges in this region.
The greedy algorithm takes at most O((M + N)D) time. Lines 1 and 14 consume
O(1) time. The inner For loop (Line 3) is repeated at most ( D + l ) ( D + 2 ) / 2
times because the outer For loop (Line 3) is repeated D + 1 times and during its
kth iteration the inner loop is repeated at most k times. All the lines within this
inner loop take constant time except for the While loop (Line 9). Thus O(D 2)
time is spent executing Lines 2-8 and 10-13. The While loop is iterated once for
each diagonal traversed in the extension of furthest reaching paths. But at most
O((M+ N)D) diagonals are traversed since all D-paths lie between diagonals
- D and D and there are at most (2D+ 1) min (N, M) points within this band.
Thus the algorithm requires a total of O((M + N)D) time. Note that just Line
9, the traversal of snakes, is the limiting step. The rest of the algorithm is O(D2).
Furthermore, the algorithm never takes more than O((M + N)MAX) time in the
practical case where the threshold MAX is set to a value much less than M + N.
The search of the greedy algorithm traces the optimal D-paths among others.
But only the current set of furthest reaching endpoints are retained in V. Con-
sequently, only the length of an SES/LCS can be reported in Line 12. To explicitly
generate a solution path, O(D 2) space 3 is used to store a copy of V after each
iteration of the outer loop. Let Va be the copy of V kept after the dth iteration.
To list an optimal path from (0, 0) to the point Vd[k], first determine whether it
is at the end of a maximal snake following a vertical edge from Va-l[k+ 1] or a
horizontal edge from Vd-l[k--1]. To be concrete, suppose it is Vd-l[k--1].
Recursively list an optimal path from (0, 0) to this point and then list the vertical
edge and maximal snake to Va[k]. The recursion stops when d = 0 in which case
the snake from (0, 0) to (Vo[0], Vo[0]) is listed. So with O(M+ N) additional

3 If only O ( D 2) space is to be allocated, the algorithm is first run to determine D in O ( N ) space,


then the space is allocated, and finally, the algorithm is run again to determine a solution path.
258 E.W. Myers

---m"EVEN EXTENSIONS .... DIAGONALS


ODD EXTENSIONS EDIT GRAPH BOUNDARY
,///,/////ENVELOPE OF D-PATH ENDPOINTS

D=3
tt
I
(0,0) iiI -
11[Ill i%

\\ ~ . N\ iiIIIIIl|ll

\ \ \ ,MllI|
\ \ ii P X
,I N #
\ \
\ "N \

{, "
, N
\
\

\ N-,, \
N
N \
\
"\ \
\

\
\
, \ ___\~|ltx" N3
Im
N ,Iql ",
\
\
,,"i \
\
\ \ illllt \~.l,~Ui
\ \ \
\
\ \ .ql,1'
j~ /I I N "-i "0 1
\ i111 N i# I I~ \
\ !T" I
N\ i .,111~1"
N

l' \

Fig. 3. Furthest reaching paths.

time and O(D 2) space an optimal path can be listed by replacing Line 12 with
a call to this recursive procedure with VD[N- M] as the initial point. A refinement
requiring only O(M+ N) space is shown in the next section.
As noted in Section 2, the LCS/SES problem can be viewed as an instance of
the single-source shortest paths problem on a weighted edit graph. This suggests
that an efficient algorithm can be obtained by specializing Dijkstra's algorithm
[3]. A basic exercise [2: 207-208] shows that the algorithm takes O(E log V)
time where E is the number of edges and V is the number of vertices in the
subject graph. For an edit graph E < 3 V since each point has outdegree at most
three. Moreover, the log V term comes from the cost of managing a priority
queue. In the case at hand the priorities will be integers in [0, M + N ] as edge
costs are 0 or 1 and the longest possible path to any point is M + N. Under these
conditions, the priority queue operations can be implemented in constant time
An O(ND) DifferenceAlgorithmand Its Variations 259

using "bucketing" and linked-list techniques. Thus Dijkstra's algorithm can be


specialized to perform in time linear in the number of vertices in the edit graph,
i.e., O(MN). The final refinement stems from noting that all that is needed is
the shortest path from the source (0, 0) to the point (M, N). Dijkstra's algorithm
determines the minimum distances of vertices from the source in increasing order,
one vertex per iteration. By Lemma 1 there are at most O((M+N)D) points
less distant from (0, 0) than (M, N) and the previous refinements reduce the cost
of each iteration to O(1). Thus the algorithm can stop as soon as the minimum
distance to (M, N) is ascertained and it only spends O((M+ N)D) time in so
doing.
It has been shown that a specialization of Dijkstra's algorithm also gives an
O(ND) time algorithm for the LCS/SES problem. However, the resulting
algorithm involves a relatively complex discrete priority queue and this queue
may contain as many as O(ND) entries even in the case where just the length
of the LCS/SES is being computed. While one could argue that further refinement
leads to the simple algorithm of this paper, the connection becomes so tenuous
that the direct and easily motivated derivation used in this section is preferable.
The aim of the discussion is to expose the close relationship between the shortest
paths and LCS/SES problems and their algorithms.

4. Refinements. The basic algorithm can be embellished in a number of ways.


First, the algorithm's expected performance is O(M+ N+ D2), which is much
superior to the worst case prediction of O((M + N)D). While not shown here,
experiments reveal that the variance about the mean is small especially as the
alphabet size becomes large. Thus while there are pathological cases that require
O((M + N)D) time they are extremely rare (e.g., like O ( N 2) problems for quick-
sort). Second, the algorithm can be refined to use only linear space when reporting
an edit script. The only other algorithm that has been shown to admit such a
refinement is the basic O(MN) dynamic programming algorithm [7]. A linear
space algorithm is of practical import since many large problems can reasonably
be solved in O ( D 2) time but not in O(D 2) space. Finally, an O((M + N) log (M +
N) + D 2) worst-case time variation is obtained by speeding up the traversal of
snakes with some previously developed techniques [6, 14]. The variation is
impractical due to the sophistication of these underlying methods but its superior
asymptotic worst-case complexity is of theoretical interest.

4.1. A Probabilisitc Analysis. Consider the following stochastic model for the
sequences A and B in a shortest edit script problem. A and B are sequences
over an alphabet E where each symbol occurs with probability p~ for o- ~ E. The
N symbols of A are randomly and independently chosen according to the
probability densities, p,. The M = N - ~ + ~ symbol sequence B is obtained by
randomly deleting 8 symbols from A and randomly inserting ~ randomly chosen
symbols. The deletion and insertion positions are chosen with uniform probability.
An equivalent model is to generate a random sequence of length L = N - 8 and
then randomly insert 8 and ~ randomly generated symbols to this sequence to
260 E . W. Myers

produce A and B, respectively. Note that the LCS of A and B must consist of
at least L symbols but may be longer.
An alternate model is to consider A and B as randomly generated sequences
of length N and M which are constrained to have an LCS of length L. This
model is not equivalent to the one above except in the limit when the size of
becomes arbitrarily large and every probability p~ goes to zero. Nonetheless, the
ensuing treatment can also be applied to this model with the same asymptotic
results. The first model is chosen as it reflects the edit scripts for mapping A into
B that are assumed by the SES problem. While other edit script commands such
as "transfers", "moves", and "exchanges" are more reflective of actual editing
sessions, their inclusion results in distinct optimization problems from the SES
problem discussed here. Hence stochastic models based on such edit process are
not considered.
In the edit graph of A and B there are L diagonal edges corresponding to the
randomly generated LCS of A and B. Any other diagonal edge, ending at say
(x, y), occurs with the same probability that ax = by as these symbols were obtained
by independent random trials. Thus the probability of an off-LCS diagonal is
p = Y~,~ p~. The SES algorithm searches by extending furthest reaching paths
until the point (N, M ) is reached. Each extension consists of a horizontal or
vertical edge followed by the longest possible snake. The maximal snakes consist
of a number of LCS and off-LCS diagonals. The probability that there are exactly
t off-LCS diagonals in a given extension's snake is p t ( 1 - p ) . Thus the expected
number of off-LCS diagonals in an extension is ~t~=o tpt(1-p) = p / ( 1 - p ) . At
most d + 1 extensions are made in the dth iteration of the outer For loop of the
SES algorithm. Therefore at most ( D + 1)(D + 2)p/2(1 - p) off-LCS diagonals
are traversed in the expected case. Moreover, at most L LCS diagonals are ever
traversed. Consequently, the critical While loop of the algorithm is executed an
average of O(L+ D 2) times when p is bounded away from 1. The remainder of
the algorithm has already been observed to take at worst O(D 2) time. When
p = 1, there is only one letter of nonzero probability in the alphabet E, so A and
B consists of repetitions of that letter, with probability one. In this case, the
algorithm runs in O(M + N) time. Thus the SES algorithm takes O(M + N + D~)
time in the expected case:

4.2. A Linear Space Refinement. The LCS/SES problem is symmetric with


respect to the orientation of edit graph edges. Consider reversing the direction
of every edge in the edit graph for sequences A and B. Subsequences and edit
scripts for A and B are still modeled as paths in this reverse edit graph but now
the paths start at (N, M ) and end at (0, 0). Also, the interpretation of paths alters
just slightly to reflect the reversal of direction. Each diagonal edge beginning at
(x, y) gives a symbol, ax (=by), in the common subsequence; each horizontal
edge from point (x, y) corresponds to the delete command " x D " ; etc. So the
LCS/SES problem can be solved by starting at (N, M) and progressively extend-
ing furthest reaching paths in the reverse edit graph until one reaches (0, 0).
Hereafter, forward paths will refer to those in the edit graph and reverse paths
will refer to those in the reverse edit graph. Since paths in opposing directions
An O(ND) DifferenceAlgorithmand Its Variations 261

are in exact correspondence, the direction of a path is distinguished only when


it is of operational importance.
As in the linear space algorithm of Hirschberg [7], a divide-and-conquer
strategy is employed. A D-path has D + 1 snakes some of which may be empty.
The divide step requires finding the [D/21 + 1 or middle snake of an optimal
D-path. The idea for doing so is to simultaneously run the basic algorithm in
both the forward and reverse directions until furthest reaching forward and
reverse paths starting at opposing corners "overlap". Lemma 3 provides the
formal observation underlying this approach.

LEMMA 3. There is a D-path from (0, O) to (N, M ) if and only if there is a


[D/2]-path from (0, O) to some point (x, y) and a LD/2J -path from some point
(u, v) to (N, M ) such that:

(feasibility) u+v>-[D/2] and x+y<-N+M-[D/2J, and


(overlap) x-y=u-v and x>_u.

Moreover, both D/2-paths are contained within D-paths from (0, 0) to ( N, M).

PROOF. Suppose there is a D-path from (0, 0) to (N, M). It can be partitioned
at the start, (x, y), of its middle snake into a rD/2]-path from (0, 0) to (x, y)
and a [D/2J -path from (u, v) to (N, M) where (u, v) = (x, y). A path from (0, 0)
to (u, v) can have at most u + v nondiagonal edges and there is a [D/2]-path
to (u, v) implying that u + v--- [D/2]. A path from (x, y) to (N, M) can have at
most ( N + M ) - ( x + y ) nondiagonal edges and there is a [D/21-path to (x, y)
implying that x + y <- N + M - [D/2J. Finally, u - v = x - y and u _<x as (x, y) =
(u, v).
Conversely, suppose the [D/21- and lD/2J-paths exist. But u ~ x implies
there is a k-path from (0, 0) to (u, v) where k -< rD/2]. By Lemma 1, A = FD/2] - k
is a multiple of 2 as both the k-path and [D/2]-path end in the same diagonal.
Moreover, the k-path has ( u + v - k ) / 2 > - A / 2 diagonals as u + v >
- [D/2]. By
replacing each of h/2 of the diagonals in the k-path with a pair of horizontal
and vertical edges, a [D/2]-path from (0, 0) to (u, v) is obtained. But then there
is a D-path from (0, 0) to (N, M) consisting of this [D/2]-path to (u, v) and
the given LD/2J -path from (u, v) to (N, M). Note that the [D/2J -path is part
of this D-path. By a symmetric argument the [D/2]-path is also part of a D-path
from (0, 0) to (N, M). []
The outline below gives the procedure for finding the middle snake of an
optimal path. For successive values of D, compute the endpoints of the furthest
reaching forward D-paths from (0, 0) and then compute the furthest reaching
reverse D-paths from (N, M). Do so in V vectors, one for each direction, as in
the basic algorithm. As each endpoint is computed, check to see if it overlaps
with the path in the same diagonal but opposite direction. A check is needed to
ensure that there is an opposing path in the given diagonal because forward
paths are in diagonals centered about 0 and reverse paths are in diagonals centered
around A = N - M. Moreover, by Lemma 1, the optimal edit script length is odd
262 E.W. Myers

or even as A is odd or even. Thus when A is odd, check for overlap only while
extending forward paths and when A is even, check for overlap only while
extending reverse paths. As soon as a pair of opposing and furthest reaching
paths overlap, stop and report the overlapping snake as the middle snake of an
optimal path. Note that the endpoints of this snake can be readily delivered as
the snake was just computed in the previous step.

A~-N-M
For D ~ 0 to [(M+N)/2] Do
For k ~ - D to D in steps of 2 Do
Find the end of the furthest reaching forward D-path in diagonal k.
If A is odd and k c [A - (D - 1), A+ ( D - 1)] Then
If the path overlaps the furthest reaching reverse (D - 1)-path
in diagonal k Then
Length of an SES is 2 D - 1 .
The last snake of the forward path is the middle snake.
For k ~- - D to D in steps of 2 Do
Find the end of the furthest reaching reverse D-path in diagonal
k+~.
If A is even and k + A ~ [ - D , D ] Then
If the path overlaps the furthest reaching forward D-path in
diagonal k + A Then
Length of an SES is 2D.
The last snake of the reverse path is the middle snake.

The correctness of this procedure relies heavily on Lemma 3. Without loss of


generality suppose A is even. The algorithm stops as soon as the smallest D is
encountered for which furthest reaching D-paths in opposite directions overlap.
First, the overlapping paths must be shown to satisfy the feasibility condition of
Lemma 3. Suppose the reverse furthest reaching path ends at (u, v) where u + v = k.
There is always a k-path of nondiagonal edges from (0, 0) to (u, v) which when
combined with the reverse D-path forms a ( k + D ) - p a t h from (0, 0) to (N, M).
This path and Lemma 3 imply there are overlapping h-paths where h =
(k+ D)/2 (k+ D is divisible by 2 as A is even). So certainly there are overlapping
furthest reach'ing h-or-less paths. If k < D, then h < D contradicting the fact that
furthest reaching D-paths are the first to overlap. So u + v-> D as desired. A
similar argument shows that the furthest reaching forward D-path also satisfies
the feasibility constraint of Lemma 3. Now the feasible, overlapping D-paths
and Lemma 3 imply that there is a solution path of length 2D. This must be
optimal for if there is a 2k-path, k < D, then the lemma implies there are k-paths
that overlap. But this implies that there are overlapping furthest reaching k-or-less
paths, contradicting the fact that there are no overlapping paths for smaller D.
Thus by Lemma 3, both of the overlapping paths are parts of optimal 2D-paths
from (0, 0) to (N, M). The same conclusion can be drawn when A is odd. It
remains to show that the chosen snake is the middle snake of the optimal path
An O(ND) DifferenceAlgorithmand Its Variations 263

of which it is a part. If n is odd(even) then this snake is the ( D + l ) s t of a


forward(reverse) path which is part of an optimal 2 D - l ( 2 D ) - p a t h . In either
case the snake is the ( [k/2] + 1)st of an optimal k-path containing it.
The procedure for finding the middle snake of an optimal D-path requires a
total of O(D) working storage for the two V vectors. The procedure only requires
O((M+ N)D) time because the forward and reverse path extension portions
both consume O((M+N)D) time by the same argument used for the basic
algorithm. In fact, the number of snakes traversed in extending forward and
reverse paths can be half that of the basic algorithm because only D/2 diagonals
are searched. This feature is of practical import--experiments reveal that this
procedure for determining the length of an SES is as efficient as the basic algorithm
when D = 0 and rapidly becomes twice as fast as D increases.
Given the middle snake procedure, a linear space algorithm for finding an
optimal path through the edit graph of A and B of sizes N and M can be devised.
For simplicity, the divide-and-conquer algorithm below just lists a longest com-
mon subsequence of A and B. (Producing a shortest edit script is left as an
exercise.) First, divide the problem by finding a middle snake from, say, (x, y)
to (u, v) of an optimal D-path. Begin conquering the problem, by recursively
finding a [ D / 2 ] - p a t h from (0, 0) to (x, y) and listing its LCS. Then list the middle
snake ("Output A[x..u]'" lists nothing if u < x ) . Finally, recursively find the
[D/2J -path from (u, v) to (N, M) and list its LCS. The recursion ends in two
ways. If N = 0 or M = 0 then L = 0 and there is nothing to list. In the other case,
N > 0 and M > 0 and D -< 1. If D --- 1 then B is obtained from A by either deleting
or inserting at most one symbol. But then it follows that the shorter of A and B
is the LCS and should be listed.

LCS(A, N, B, M)
If N > 0 and M > 0 Then
Find the middle snake and length of an optimal path for A
and B. Suppose it is from (x, y) to (u, v).
If D > 1 Then
LCS(A1..x], x, B[1..y], y)
Output A[x + 1..u].
LCS(A[u + 1..N], N - u, B[v + 1..M], M - v)
Else If M > N Then
Output A[1..N].
Else
Output B [ 1..M].

Let T(P, D) be the time taken by the algorithm where P is N+M. It follows
that T satisfies the recurrence inequality:

D)<~aPD+. T ( P ~ , [ D / 2 ] ) + T ( P 2 , [ D / 2 J ) if D > I ,
T( P,
L/3P if D < - I ,
264 E.W. Myers

where P1 + P2 -< P and a and/3 are suitably large constants. Noting that [D/2] -
2D~3 for D->2, a straightforward induction argument shows that T(P, D)<-
3 aPD + tiP. Thus the divide-and-conquer algorithm still takes just O ((M + N) D)
time despite the [log D] levels of recursion through which it descends. Further-
more, the algorithm only requires O(D) working storage. The middle snake
procedure requires two O(D) space V vectors. But this step is completed before
engaging in the recursion. Thus only one pair of global V vectors are shared by
all invocations of the procedure. Moreover, only O(log D) levels of recursion
are traversed implying that only O(log D) storage is needed on the recursion
stack. Unfortunately, the input sequences A and B must be kept in memory,
implying that a total of O(M+ N) space is needed.

4.3. An O((M + N) log (M + N)+ D 2) Worst-Case Variation. The final topic


involves two previous results, each of which are just sketched here. First, suffix
trees [12, 14] are used to efficiently record the common sublists of the sequences
being compared. The term sublist is used as opposed to subsequence to emphasize
that the symbols must be contiguous. Second, a recent RAM-based algorithm for
answering Q on-line queries for the lowest common ancestors of vertices in a
fixed V-vertex tree takes O(V+ Q) time [6]. The efficient variation centers on
quickly finding the length or endpoint of a maximal snake starting at point (x, y).
This is shown to reduce to finding the lowest common ancestor of two leaves in
a suffix tree. This can be done in O((M+ N)log ( M + N)) pre-processing time
and O(1) time per query using the two techniques above. The ensuing paragraphs
embellish these ideas.
A suffix or Patricia tree [12, 14] for a sequence S of length L has edges labelled
with sublists of S, has L leaves labelled with the positions of $, and satisfies the
following three properties.
1. Concatenating the edge labels traversed on the path from the root to the leaf
for position j, gives the suffix, S[j..L], of S starting atj. Thus every path within
the tree denotes a suffix of S.
2. Every interior vertex has out-degree greater than one.
3. The labels of the out-edges of every vertex begin with distinct symbols.
These properties can only be satisfied if the last symbol of S is distinct from
every other symbol in S. This condition is usually met by appending a special
symbol to the target sequence and once satisfied, the suffix tree is unique. Property
2 guarantees that there are less than L interior vertices. Moreover, the substrings
labelling edges can be represented by just storing indices to their first and last
characters in S. Thus suffix trees can be stored in O(L) space. The efficient
construction of suffix trees is beyond the scope of this paper. The reader is referred
to a paper by McCreight [14] giving an algorithm that constructs a suffix tree in
O(L) steps. Most of the steps are easily done in O(1) time but some require
selecting an out-edge based on its first symbol. When the alphabet is finite, the
out-degree of vertices is finite and the selection takes O(1) time. When the
alphabet is unrestricted, height-balanced trees or some other worst-case efficient
search structure permits selection in O(log L) time. Thus suffix tree construction
takes O(L) time for finite alphabets and O(L log L) time otherwise.
An O(ND) Difference Algorithm and Its Variations 265

Consider the two paths from the root of S's suffix tree to leaves i and j. Each
path from the root to a common ancestor of i and j, denotes a common prefix
of the suffixes S[i..L] and S[j..L]. From Property 3 it follows that the path to
the lowest common ancestor of i and j, denotes the longest prefix of their respective
suffixes. This observation motivates the following suffix tree characterization of
the maximal snake starting at point (x, y) in the edit graph of A and B of lengths
N and M, respectively. Form the suffix tree for the sequence S = A.$1.B.$2
where the symbols $1 and $2 are not equal to each other or any symbol in A or
B. The maximal snake starting at (x, y) is denoted by the path from the root of
S's suffix tree to the lowest common ancestor of positions x and y + N + 1. This
follows because neither $1 or $2 can be a part of this longest common prefix for
the suffixes A[x..N].$1.B.$2 and B[y..M].$2. So to find the endpoint of a snake
starting at (x, y), find the lowest common ancestor of leaves x and y + N + 1 in
the suffix tree and return (x+ m, y + m) where m is the length of the sublist
denoted by the path to this ancestor. In a linear preprocessing pass the sublist
lengths to every vertex are computed and the auxiliary structures needed for the
O(V+ Q) lowest common ancestor algorithm of Harel and Tarjan [6] are con-
structed. This RAM-based algorithm requires O(V) preprocessing time but can
then answer each on-line query in O(1) time. Thus with O((M + N) log (M + N))
preprocessing time (building the suffix tree is the dominant cost), a collection of
on-line queries for the endpoints of maximal snakes can be answered in O(1)
time per query.
Modify the basic algorithm of Section 3 by (a) prefacing it with the preprocess-
ing needed for the maximal snake queries and (b) replacing Line 9 with the O(1)
query primitives. Recall that every line in the innermost loop other than Line 9
is, O(! ) and that the loop is repeated O(D 2) times. Now that Line 9 takes O(1)
time it follows that this modification results in an algorithm that runs in O((M +
N) log (M + N) + D 2) time. Note that this variation is primarily of theoretical
interest. The coefficients of proportionality are much larger for the algorithm
fragments employed implying that problems will have to be very large before the
variation becomes faster. But suffix trees are particularly space inefficient and
two auxiliary trees of equal size are needed for the fast lowest common ancestor
algorithm. Thus for problems large enough to make the time savings worthwhile
it is likely that there will not be enough memory to accommodate these additional
structures.

Acknowledgments. Webb Miller originally proposed the problem of finding an


O(ND) algorithm. The author would like to thank him for nurturing this work
and his many helpful suggestions. The referees comments and corrections
improved the paper greatly.

References
[1] A. V. Aho, D. S. Hirschberg, and J. D. Ullman. Bounds on the complexity of the longest
common subsequenee problem. J. ACM, 23, 1 (1976), 1-12.
266 E.W. Myers

[2] A.V. Aho, J. E. Hopcroft, and J. D. UUman. Data Structures and Algorithms. Addison-Wesley:
Reading, MA, 1983, pp. 203-208.
[3] E.W. Dijkstra. A note on two problems in connexion with graphs. Numer. Math. 1 (1959),
269-271.
[4] J. Gosling. A redisplay algorithm. Proceedings A C M SIGPLAN/ SIGOA Symposium on Text
Manipulation, 1981, pp.
[5] P.A.V. Hall and G. R. Dowling. Approximate string matching. Comput. Surv. 12, 4 (1980),
381-402
[6] D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J.
Comput., 13, 2 (1984), 338-355.
[7] D.S. Hirschberg. A linear space algorithm for computing maximal common subsequences.
Commun. ACM, 18, 6 (1975), 341-343.
[8] D.S. Hirschberg. Algorithms for the longest common subsequence problem. J. A C M 24, 4
(1977), 664-675.
[9] D.S. Hirschberg. An information-theoretic lower bound for the longest common subsequence
problem. Inform. Process. Lett. 7, 1 (1978), 40-41.
[10] J.W. Hunt and M. D. Mcllroy. An algorithm for differential file comparison. Computing
Science Technical Report 41, Bell Laboratories (1975).
[ 11] J.W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences.
Commun. ACM, 20, 5 (1977), 350-353.
[12] D.E. Knuth. The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley:
Reading, MA, 1983, pp. 490-493.
[13] W. J. Masek and M. S. Paterson. A faster algorithm for computing string edit distances.
J. Comput. System Sci. 20, 1 (1980) 18-31.
[14] E.M. McCreight. A space-economical suffix tree construction algorithm. J. A C M 23, 2 (1976),
262-272.
[15] W.Mi~erandE.W.Myers.Afi~ec~mparis~npr~gram.S~ftwaremPracticeandExperience~
(1985), 1025-1040.
[16] N. Nakatsu, Y. Kambayashi and S. Yajima. A longest common subsequence algorithm suitable
for similar text strings. Acta Inform., 18 (1982), 171-179.
[17] M.J. Rochkind. The source code control system. IEEE Trans. Software Engrg., 1, 4 (1975),
364-370.
[18] D. Sankoff and J. B. Kruskal. Time Warps, String Edits and Macromolecules: The Theory and
Practice of Sequence Comparison. Addison-Wesley: Reading, MA, 1983.
[19] W. Tichy. The string-to-string correction problem with block moves. A C M Trans. Comput.
Systems, 2, (1984), 309-321.
[20] R A Wagner and M. J. Fischer. The string-to-string correction problem. J. A C M 21, 1 (1974),
168-173.

You might also like