Myers O ND Diference
Myers O ND Diference
Myers O ND Diference
Algorithmica
~) 1986Springer-VerlagNew York Inc.
Abstract. The problems of finding a longest common subsequence of two sequences A and B and
a shortest edit script for transforming A into B have long been known to be dual problems. In this
paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this
perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the
lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs
well when differences are small (sequences are similar) and is consequently fast in typical applications.
The algorithm is shown to have O(N+ D 2) expected-time performance under a basic stochastic
model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to
an O(N log N + D 2) time variation.
Key Words. Longest common subsequence, Shortest edit script, Edit graph, File comparison.
t This work was supported in part by the National Science Foundation under Grant MCS82-10096.
2 Department of Computer Science, University of Arizona, Tucson, AZ 85721, USA.
Received June 11, 1985; revised January 17, 1986. Communicated by David Dobkin.
252 E.W. Myers
to the neighbor below it, i.e., (x, y - 1) -->(x, y) for x e [0, N ] and y e [1, M]. If
ax = by then there is a diagonal edge connecting vertex ( x - 1 , y - 1 ) to vertex
(x, y). The points (x, y) for which ax = by are called match points. The total number
of match points between A and B is the parameter R characterizing the Hunt
and Szymansk algorithm [11]. It is also the number of diagonal edges in the edit
graph as diagonal edges are in one-to-one correspondence with match points.
Figure 1 depicts the edit graph for the sequences A = abcabba and B = cbabac.
A trace of length L is a sequence of L match points, (xl, yl)(X2, Y2) 9 " 9 (xL, YL),
such that xi < xi+l and Yi <Yi+l for successive points (x~,y~) and (x~+~, Y~+I),
i e [1, L - 1 ] . Every trace is in exact correspondence with the diagonal edges of
a path in the edit graph from (0, 0) to (N, M). The sequence of match points
visited in traversing a path from start to finish is easily verified to be a trace.
Note that L is the number of diagonal edges in the corresponding path. To
construct a path from a trace, take the sequence of diagonal edges corresponding
to the match points of the trace and connect successive diagonals with a series
of horizontal and vertical edges. This can always be done as xi < x~+l and y~ < Yi+l
for successive match points. Note that several paths differing only in their
PATH
(~) TRACE= (3,1) (4,3) (5,4) (7,5)
COVt'DN SUB.SEQUENCE = CABA= A3A4A5A7 = BIB3B4B5
EDIT SCRIPT : _l.D, 2D, 31B, b-D, 71c
A B C A B B A
0
I
m . . . . . . 3
i 2 3 4 5 6 (7,6)
Fig. 1. An edit graph.
254 E.W. Myers
The lemma implies that D-paths end solely on odd diagonals when D is odd
and even diagonals when D is even.
A D-path is furthest reaching in diagonal k if and only if it is one of the
D-paths ending on diagonal k whose endpoint has the greatest possible row
(column) number of all such paths. Informally, of all D-paths ending in diagonal
k, it ends furthest from the origin, (0, 0). The following lemma gives an inductive
characterization of furthest reaching D-paths and embodies a greedy principle:
furthest reaching D-paths are obtained by greedily extending furthest reaching
(D - 1)-paths.
LEMMA2. A furthest reaching O-path ends at (x, x), where x is min(z - 11[ az # bz
or z > M or z > N). A furthest reaching D-path on diagonal k can without loss of
generality be decomposed into a furthest reaching ( D -1)-path on diagonal k - 1 ,
followed by a horizontal edge, followed by the longest possible snake or it may be
decomposed into a furthest reaching ( D - 1)-path on diagonal k + 1, followed by a
vertical edge, followed by the longest possible snake.
For De-O to M + N Do
For k<- - D to D in steps of 2 Do
Find the endpoint of the furthest reaching D-path in diagonal k.
If (N, M) is the endpoint Then
The D-path is an optimal solution.
Stop
The outline above stops when the smallest D is encountered for which there is
a furthest reaching D-path to (N, M). This must happen before the outer loop
terminates because D must be less than or equal to M + N. By construction this
path must be minimal with respect to the number of nondiagonal edges within
it. Hence it is a solution to the LCS/SES problem.
In presenting the detailed algorithm in Figure 2, a number of simple optimiz-
ations are employed. An array, V, contains the endpoints of the furthest reaching
D-paths in elements V[-D], V [ - D + 2 ] , . . . , V [ D - 2 ] , V[D]. By Lemma 1 this
set of elements is disjoint from those where the endpoints of the ( D + 1)-paths
will be stored in the next iteration of the outer loop. Thus the array V can
simultaneously hold the endpoints of the D-paths while the ( D + 1)-path end-
points are being computed from them. Furthermore, to record an endpoint (x, y)
in diagonal k it suffices to retain just x because y is known to be x - k . Con-
sequently, V is an array of integers where V[k] contains the row index of the
endpoint of a furthest reaching path in diagonal k.
As a practical matter the algorithm searches D-paths where D - MAX and if
no such path reaches (N, M) then it reports that any edit script for A and B
must be longer than MAX in Line 14. By setting the constant MAX to M + N
as in the outline above, the algorithm is guaranteed to find the length of the
LCS/SES. Figure 3 illustrates the D-paths searched when the algorithm is applied
to the example of Figure 1. Note that a fictitious endpoint, (0, -1), set up in Line
1 of the algorithm is used to find the endpoint of the furthest reaching 0-path.
Also note that D-paths extend off the left and lower boundaries of the edit graph
An O(ND) Difference Algorithm and Its Variations 257
Constant M A X ~ [0, M + N ]
Var V: Array [ - M A X _ M A X ] of Integer
1. V[1]~O
2. For D ~ - 0 to M A X Do
3. For k < - - D to D in steps of 2 Do
4. If k = - D or k ~ D and V [ k - 1 ] < V [ k + l ] Then
5. x~ V[k+l]
6. Else
7. x~- V [ k - 1 ] + l
8. y~x-k
9. While x < N and y < M and a~+l = by+l Do ( x , y ) ~ ( x + l , y + l )
10. V[k]~ x
11. If x ~ N and y - M Then
12. Length of an SES is D
13. Stop
14. Length of an SES is greater than M A X
Fig. 2. The greedy LCS/SES algorithm.
D=3
tt
I
(0,0) iiI -
11[Ill i%
\\ ~ . N\ iiIIIIIl|ll
\ \ \ ,MllI|
\ \ ii P X
,I N #
\ \
\ "N \
{, "
, N
\
\
\ N-,, \
N
N \
\
"\ \
\
\
\
, \ ___\~|ltx" N3
Im
N ,Iql ",
\
\
,,"i \
\
\ \ illllt \~.l,~Ui
\ \ \
\
\ \ .ql,1'
j~ /I I N "-i "0 1
\ i111 N i# I I~ \
\ !T" I
N\ i .,111~1"
N
l' \
time and O(D 2) space an optimal path can be listed by replacing Line 12 with
a call to this recursive procedure with VD[N- M] as the initial point. A refinement
requiring only O(M+ N) space is shown in the next section.
As noted in Section 2, the LCS/SES problem can be viewed as an instance of
the single-source shortest paths problem on a weighted edit graph. This suggests
that an efficient algorithm can be obtained by specializing Dijkstra's algorithm
[3]. A basic exercise [2: 207-208] shows that the algorithm takes O(E log V)
time where E is the number of edges and V is the number of vertices in the
subject graph. For an edit graph E < 3 V since each point has outdegree at most
three. Moreover, the log V term comes from the cost of managing a priority
queue. In the case at hand the priorities will be integers in [0, M + N ] as edge
costs are 0 or 1 and the longest possible path to any point is M + N. Under these
conditions, the priority queue operations can be implemented in constant time
An O(ND) DifferenceAlgorithmand Its Variations 259
4.1. A Probabilisitc Analysis. Consider the following stochastic model for the
sequences A and B in a shortest edit script problem. A and B are sequences
over an alphabet E where each symbol occurs with probability p~ for o- ~ E. The
N symbols of A are randomly and independently chosen according to the
probability densities, p,. The M = N - ~ + ~ symbol sequence B is obtained by
randomly deleting 8 symbols from A and randomly inserting ~ randomly chosen
symbols. The deletion and insertion positions are chosen with uniform probability.
An equivalent model is to generate a random sequence of length L = N - 8 and
then randomly insert 8 and ~ randomly generated symbols to this sequence to
260 E . W. Myers
produce A and B, respectively. Note that the LCS of A and B must consist of
at least L symbols but may be longer.
An alternate model is to consider A and B as randomly generated sequences
of length N and M which are constrained to have an LCS of length L. This
model is not equivalent to the one above except in the limit when the size of
becomes arbitrarily large and every probability p~ goes to zero. Nonetheless, the
ensuing treatment can also be applied to this model with the same asymptotic
results. The first model is chosen as it reflects the edit scripts for mapping A into
B that are assumed by the SES problem. While other edit script commands such
as "transfers", "moves", and "exchanges" are more reflective of actual editing
sessions, their inclusion results in distinct optimization problems from the SES
problem discussed here. Hence stochastic models based on such edit process are
not considered.
In the edit graph of A and B there are L diagonal edges corresponding to the
randomly generated LCS of A and B. Any other diagonal edge, ending at say
(x, y), occurs with the same probability that ax = by as these symbols were obtained
by independent random trials. Thus the probability of an off-LCS diagonal is
p = Y~,~ p~. The SES algorithm searches by extending furthest reaching paths
until the point (N, M ) is reached. Each extension consists of a horizontal or
vertical edge followed by the longest possible snake. The maximal snakes consist
of a number of LCS and off-LCS diagonals. The probability that there are exactly
t off-LCS diagonals in a given extension's snake is p t ( 1 - p ) . Thus the expected
number of off-LCS diagonals in an extension is ~t~=o tpt(1-p) = p / ( 1 - p ) . At
most d + 1 extensions are made in the dth iteration of the outer For loop of the
SES algorithm. Therefore at most ( D + 1)(D + 2)p/2(1 - p) off-LCS diagonals
are traversed in the expected case. Moreover, at most L LCS diagonals are ever
traversed. Consequently, the critical While loop of the algorithm is executed an
average of O(L+ D 2) times when p is bounded away from 1. The remainder of
the algorithm has already been observed to take at worst O(D 2) time. When
p = 1, there is only one letter of nonzero probability in the alphabet E, so A and
B consists of repetitions of that letter, with probability one. In this case, the
algorithm runs in O(M + N) time. Thus the SES algorithm takes O(M + N + D~)
time in the expected case:
Moreover, both D/2-paths are contained within D-paths from (0, 0) to ( N, M).
PROOF. Suppose there is a D-path from (0, 0) to (N, M). It can be partitioned
at the start, (x, y), of its middle snake into a rD/2]-path from (0, 0) to (x, y)
and a [D/2J -path from (u, v) to (N, M) where (u, v) = (x, y). A path from (0, 0)
to (u, v) can have at most u + v nondiagonal edges and there is a [D/2]-path
to (u, v) implying that u + v--- [D/2]. A path from (x, y) to (N, M) can have at
most ( N + M ) - ( x + y ) nondiagonal edges and there is a [D/21-path to (x, y)
implying that x + y <- N + M - [D/2J. Finally, u - v = x - y and u _<x as (x, y) =
(u, v).
Conversely, suppose the [D/21- and lD/2J-paths exist. But u ~ x implies
there is a k-path from (0, 0) to (u, v) where k -< rD/2]. By Lemma 1, A = FD/2] - k
is a multiple of 2 as both the k-path and [D/2]-path end in the same diagonal.
Moreover, the k-path has ( u + v - k ) / 2 > - A / 2 diagonals as u + v >
- [D/2]. By
replacing each of h/2 of the diagonals in the k-path with a pair of horizontal
and vertical edges, a [D/2]-path from (0, 0) to (u, v) is obtained. But then there
is a D-path from (0, 0) to (N, M) consisting of this [D/2]-path to (u, v) and
the given LD/2J -path from (u, v) to (N, M). Note that the [D/2J -path is part
of this D-path. By a symmetric argument the [D/2]-path is also part of a D-path
from (0, 0) to (N, M). []
The outline below gives the procedure for finding the middle snake of an
optimal path. For successive values of D, compute the endpoints of the furthest
reaching forward D-paths from (0, 0) and then compute the furthest reaching
reverse D-paths from (N, M). Do so in V vectors, one for each direction, as in
the basic algorithm. As each endpoint is computed, check to see if it overlaps
with the path in the same diagonal but opposite direction. A check is needed to
ensure that there is an opposing path in the given diagonal because forward
paths are in diagonals centered about 0 and reverse paths are in diagonals centered
around A = N - M. Moreover, by Lemma 1, the optimal edit script length is odd
262 E.W. Myers
or even as A is odd or even. Thus when A is odd, check for overlap only while
extending forward paths and when A is even, check for overlap only while
extending reverse paths. As soon as a pair of opposing and furthest reaching
paths overlap, stop and report the overlapping snake as the middle snake of an
optimal path. Note that the endpoints of this snake can be readily delivered as
the snake was just computed in the previous step.
A~-N-M
For D ~ 0 to [(M+N)/2] Do
For k ~ - D to D in steps of 2 Do
Find the end of the furthest reaching forward D-path in diagonal k.
If A is odd and k c [A - (D - 1), A+ ( D - 1)] Then
If the path overlaps the furthest reaching reverse (D - 1)-path
in diagonal k Then
Length of an SES is 2 D - 1 .
The last snake of the forward path is the middle snake.
For k ~- - D to D in steps of 2 Do
Find the end of the furthest reaching reverse D-path in diagonal
k+~.
If A is even and k + A ~ [ - D , D ] Then
If the path overlaps the furthest reaching forward D-path in
diagonal k + A Then
Length of an SES is 2D.
The last snake of the reverse path is the middle snake.
LCS(A, N, B, M)
If N > 0 and M > 0 Then
Find the middle snake and length of an optimal path for A
and B. Suppose it is from (x, y) to (u, v).
If D > 1 Then
LCS(A1..x], x, B[1..y], y)
Output A[x + 1..u].
LCS(A[u + 1..N], N - u, B[v + 1..M], M - v)
Else If M > N Then
Output A[1..N].
Else
Output B [ 1..M].
Let T(P, D) be the time taken by the algorithm where P is N+M. It follows
that T satisfies the recurrence inequality:
D)<~aPD+. T ( P ~ , [ D / 2 ] ) + T ( P 2 , [ D / 2 J ) if D > I ,
T( P,
L/3P if D < - I ,
264 E.W. Myers
where P1 + P2 -< P and a and/3 are suitably large constants. Noting that [D/2] -
2D~3 for D->2, a straightforward induction argument shows that T(P, D)<-
3 aPD + tiP. Thus the divide-and-conquer algorithm still takes just O ((M + N) D)
time despite the [log D] levels of recursion through which it descends. Further-
more, the algorithm only requires O(D) working storage. The middle snake
procedure requires two O(D) space V vectors. But this step is completed before
engaging in the recursion. Thus only one pair of global V vectors are shared by
all invocations of the procedure. Moreover, only O(log D) levels of recursion
are traversed implying that only O(log D) storage is needed on the recursion
stack. Unfortunately, the input sequences A and B must be kept in memory,
implying that a total of O(M+ N) space is needed.
Consider the two paths from the root of S's suffix tree to leaves i and j. Each
path from the root to a common ancestor of i and j, denotes a common prefix
of the suffixes S[i..L] and S[j..L]. From Property 3 it follows that the path to
the lowest common ancestor of i and j, denotes the longest prefix of their respective
suffixes. This observation motivates the following suffix tree characterization of
the maximal snake starting at point (x, y) in the edit graph of A and B of lengths
N and M, respectively. Form the suffix tree for the sequence S = A.$1.B.$2
where the symbols $1 and $2 are not equal to each other or any symbol in A or
B. The maximal snake starting at (x, y) is denoted by the path from the root of
S's suffix tree to the lowest common ancestor of positions x and y + N + 1. This
follows because neither $1 or $2 can be a part of this longest common prefix for
the suffixes A[x..N].$1.B.$2 and B[y..M].$2. So to find the endpoint of a snake
starting at (x, y), find the lowest common ancestor of leaves x and y + N + 1 in
the suffix tree and return (x+ m, y + m) where m is the length of the sublist
denoted by the path to this ancestor. In a linear preprocessing pass the sublist
lengths to every vertex are computed and the auxiliary structures needed for the
O(V+ Q) lowest common ancestor algorithm of Harel and Tarjan [6] are con-
structed. This RAM-based algorithm requires O(V) preprocessing time but can
then answer each on-line query in O(1) time. Thus with O((M + N) log (M + N))
preprocessing time (building the suffix tree is the dominant cost), a collection of
on-line queries for the endpoints of maximal snakes can be answered in O(1)
time per query.
Modify the basic algorithm of Section 3 by (a) prefacing it with the preprocess-
ing needed for the maximal snake queries and (b) replacing Line 9 with the O(1)
query primitives. Recall that every line in the innermost loop other than Line 9
is, O(! ) and that the loop is repeated O(D 2) times. Now that Line 9 takes O(1)
time it follows that this modification results in an algorithm that runs in O((M +
N) log (M + N) + D 2) time. Note that this variation is primarily of theoretical
interest. The coefficients of proportionality are much larger for the algorithm
fragments employed implying that problems will have to be very large before the
variation becomes faster. But suffix trees are particularly space inefficient and
two auxiliary trees of equal size are needed for the fast lowest common ancestor
algorithm. Thus for problems large enough to make the time savings worthwhile
it is likely that there will not be enough memory to accommodate these additional
structures.
References
[1] A. V. Aho, D. S. Hirschberg, and J. D. Ullman. Bounds on the complexity of the longest
common subsequenee problem. J. ACM, 23, 1 (1976), 1-12.
266 E.W. Myers
[2] A.V. Aho, J. E. Hopcroft, and J. D. UUman. Data Structures and Algorithms. Addison-Wesley:
Reading, MA, 1983, pp. 203-208.
[3] E.W. Dijkstra. A note on two problems in connexion with graphs. Numer. Math. 1 (1959),
269-271.
[4] J. Gosling. A redisplay algorithm. Proceedings A C M SIGPLAN/ SIGOA Symposium on Text
Manipulation, 1981, pp.
[5] P.A.V. Hall and G. R. Dowling. Approximate string matching. Comput. Surv. 12, 4 (1980),
381-402
[6] D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J.
Comput., 13, 2 (1984), 338-355.
[7] D.S. Hirschberg. A linear space algorithm for computing maximal common subsequences.
Commun. ACM, 18, 6 (1975), 341-343.
[8] D.S. Hirschberg. Algorithms for the longest common subsequence problem. J. A C M 24, 4
(1977), 664-675.
[9] D.S. Hirschberg. An information-theoretic lower bound for the longest common subsequence
problem. Inform. Process. Lett. 7, 1 (1978), 40-41.
[10] J.W. Hunt and M. D. Mcllroy. An algorithm for differential file comparison. Computing
Science Technical Report 41, Bell Laboratories (1975).
[ 11] J.W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences.
Commun. ACM, 20, 5 (1977), 350-353.
[12] D.E. Knuth. The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley:
Reading, MA, 1983, pp. 490-493.
[13] W. J. Masek and M. S. Paterson. A faster algorithm for computing string edit distances.
J. Comput. System Sci. 20, 1 (1980) 18-31.
[14] E.M. McCreight. A space-economical suffix tree construction algorithm. J. A C M 23, 2 (1976),
262-272.
[15] W.Mi~erandE.W.Myers.Afi~ec~mparis~npr~gram.S~ftwaremPracticeandExperience~
(1985), 1025-1040.
[16] N. Nakatsu, Y. Kambayashi and S. Yajima. A longest common subsequence algorithm suitable
for similar text strings. Acta Inform., 18 (1982), 171-179.
[17] M.J. Rochkind. The source code control system. IEEE Trans. Software Engrg., 1, 4 (1975),
364-370.
[18] D. Sankoff and J. B. Kruskal. Time Warps, String Edits and Macromolecules: The Theory and
Practice of Sequence Comparison. Addison-Wesley: Reading, MA, 1983.
[19] W. Tichy. The string-to-string correction problem with block moves. A C M Trans. Comput.
Systems, 2, (1984), 309-321.
[20] R A Wagner and M. J. Fischer. The string-to-string correction problem. J. A C M 21, 1 (1974),
168-173.