mirror of https://github.com/golang/go.git
157 lines
12 KiB
Go
157 lines
12 KiB
Go
// Copyright 2022 The Go Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style
|
|
// license that can be found in the LICENSE file.
|
|
|
|
// package lcs contains code to find longest-common-subsequences
|
|
// (and diffs)
|
|
package lcs
|
|
|
|
/*
|
|
Compute longest-common-subsequences of two slices A, B using
|
|
algorithms from Myers' paper. A longest-common-subsequence
|
|
(LCS from now on) of A and B is a maximal set of lexically increasing
|
|
pairs of subscripts (x,y) with A[x]==B[y]. There may be many LCS, but
|
|
they all have the same length. An LCS determines a sequence of edits
|
|
that changes A into B.
|
|
|
|
The key concept is the edit graph of A and B.
|
|
If A has length N and B has length M, then the edit graph has
|
|
vertices v[i][j] for 0 <= i <= N, 0 <= j <= M. There is a
|
|
horizontal edge from v[i][j] to v[i+1][j] whenever both are in
|
|
the graph, and a vertical edge from v[i][j] to f[i][j+1] similarly.
|
|
When A[i] == B[j] there is a diagonal edge from v[i][j] to v[i+1][j+1].
|
|
|
|
A path between in the graph between (0,0) and (N,M) determines a sequence
|
|
of edits converting A into B: each horizontal edge corresponds to removing
|
|
an element of A, and each vertical edge corresponds to inserting an
|
|
element of B.
|
|
|
|
A vertex (x,y) is on (forward) diagonal k if x-y=k. A path in the graph
|
|
is of length D if it has D non-diagonal edges. The algorithms generate
|
|
forward paths (in which at least one of x,y increases at each edge),
|
|
or backward paths (in which at least one of x,y decreases at each edge),
|
|
or a combination. (Note that the orientation is the traditional mathematical one,
|
|
with the origin in the lower-left corner.)
|
|
|
|
Here is the edit graph for A:"aabbaa", B:"aacaba". (I know the diagonals look weird.)
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
b | | | ___/‾‾‾ | ___/‾‾‾ | | |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
c | | | | | | |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
a a b b a a
|
|
|
|
|
|
The algorithm labels a vertex (x,y) with D,k if it is on diagonal k and at
|
|
the end of a maximal path of length D. (Because x-y=k it suffices to remember
|
|
only the x coordinate of the vertex.)
|
|
|
|
The forward algorithm: Find the longest diagonal starting at (0,0) and
|
|
label its end with D=0,k=0. From that vertex take a vertical step and
|
|
then follow the longest diagonal (up and to the right), and label that vertex
|
|
with D=1,k=-1. From the D=0,k=0 point take a horizontal step and the follow
|
|
the longest diagonal (up and to the right) and label that vertex
|
|
D=1,k=1. In the same way, having labelled all the D vertices,
|
|
from a vertex labelled D,k find two vertices
|
|
tentatively labelled D+1,k-1 and D+1,k+1. There may be two on the same
|
|
diagonal, in which case take the one with the larger x.
|
|
|
|
Eventually the path gets to (N,M), and the diagonals on it are the LCS.
|
|
|
|
Here is the edit graph with the ends of D-paths labelled. (So, for instance,
|
|
0/2,2 indicates that x=2,y=2 is labelled with 0, as it should be, since the first
|
|
step is to go up the longest diagonal from (0,0).)
|
|
A:"aabbaa", B:"aacaba"
|
|
⊙ ------- ⊙ ------- ⊙ -------(3/3,6)------- ⊙ -------(3/5,6)-------(4/6,6)
|
|
a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ |
|
|
⊙ ------- ⊙ ------- ⊙ -------(2/3,5)------- ⊙ ------- ⊙ ------- ⊙
|
|
b | | | ___/‾‾‾ | ___/‾‾‾ | | |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ -------(3/5,4)------- ⊙
|
|
a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ |
|
|
⊙ ------- ⊙ -------(1/2,3)-------(2/3,3)------- ⊙ ------- ⊙ ------- ⊙
|
|
c | | | | | | |
|
|
⊙ ------- ⊙ -------(0/2,2)-------(1/3,2)-------(2/4,2)-------(3/5,2)-------(4/6,2)
|
|
a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ |
|
|
⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙
|
|
a a b b a a
|
|
|
|
The 4-path is reconstructed starting at (4/6,6), horizontal to (3/5,6), diagonal to (3,4), vertical
|
|
to (2/3,3), horizontal to (1/2,3), vertical to (0/2,2), and diagonal to (0,0). As expected,
|
|
there are 4 non-diagonal steps, and the diagonals form an LCS.
|
|
|
|
There is a symmetric backward algorithm, which gives (backwards labels are prefixed with a colon):
|
|
A:"aabbaa", B:"aacaba"
|
|
⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙
|
|
a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ |
|
|
⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ --------(:0/5,5)-------- ⊙
|
|
b | | | ____/‾‾‾ | ____/‾‾‾ | | |
|
|
⊙ -------- ⊙ -------- ⊙ --------(:1/3,4)-------- ⊙ -------- ⊙ -------- ⊙
|
|
a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ |
|
|
(:3/0,3)--------(:2/1,3)-------- ⊙ --------(:2/3,3)--------(:1/4,3)-------- ⊙ -------- ⊙
|
|
c | | | | | | |
|
|
⊙ -------- ⊙ -------- ⊙ --------(:3/3,2)--------(:2/4,2)-------- ⊙ -------- ⊙
|
|
a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ |
|
|
(:3/0,1)-------- ⊙ -------- ⊙ -------- ⊙ --------(:3/4,1)-------- ⊙ -------- ⊙
|
|
a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ |
|
|
(:4/0,0)-------- ⊙ -------- ⊙ -------- ⊙ --------(:4/4,0)-------- ⊙ -------- ⊙
|
|
a a b b a a
|
|
|
|
Neither of these is ideal for use in an editor, where it is undesirable to send very long diffs to the
|
|
front end. It's tricky to decide exactly what 'very long diffs' means, as "replace A by B" is very short.
|
|
We want to control how big D can be, by stopping when it gets too large. The forward algorithm then
|
|
privileges common prefixes, and the backward algorithm privileges common suffixes. Either is an undesirable
|
|
asymmetry.
|
|
|
|
Fortunately there is a two-sided algorithm, implied by results in Myers' paper. Here's what the labels in
|
|
the edit graph look like.
|
|
A:"aabbaa", B:"aacaba"
|
|
⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙
|
|
a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ |
|
|
⊙ --------- ⊙ --------- ⊙ --------- (2/3,5) --------- ⊙ --------- (:0/5,5)--------- ⊙
|
|
b | | | ____/‾‾‾‾ | ____/‾‾‾‾ | | |
|
|
⊙ --------- ⊙ --------- ⊙ --------- (:1/3,4)--------- ⊙ --------- ⊙ --------- ⊙
|
|
a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ |
|
|
⊙ --------- (:2/1,3)--------- (1/2,3) ---------(2:2/3,3)--------- (:1/4,3)--------- ⊙ --------- ⊙
|
|
c | | | | | | |
|
|
⊙ --------- ⊙ --------- (0/2,2) --------- (1/3,2) ---------(2:2/4,2)--------- ⊙ --------- ⊙
|
|
a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ |
|
|
⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙
|
|
a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ |
|
|
⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙
|
|
a a b b a a
|
|
|
|
The algorithm stopped when it saw the backwards 2-path ending at (1,3) and the forwards 2-path ending at (3,5). The criterion
|
|
is a backwards path ending at (u,v) and a forward path ending at (x,y), where u <= x and the two points are on the same
|
|
diagonal. (Here the edgegraph has a diagonal, but the criterion is x-y=u-v.) Myers proves there is a forward
|
|
2-path from (0,0) to (1,3), and that together with the backwards 2-path ending at (1,3) gives the expected 4-path.
|
|
Unfortunately the forward path has to be constructed by another run of the forward algorithm; it can't be found from the
|
|
computed labels. That is the worst case. Had the code noticed (x,y)=(u,v)=(3,3) the whole path could be reconstructed
|
|
from the edgegraph. The implementation looks for a number of special cases to try to avoid computing an extra forward path.
|
|
|
|
If the two-sided algorithm has stop early (because D has become too large) it will have found a forward LCS and a
|
|
backwards LCS. Ideally these go with disjoint prefixes and suffixes of A and B, but disjointness may fail and the two
|
|
computed LCS may conflict. (An easy example is where A is a suffix of B, and shares a short prefix. The backwards LCS
|
|
is all of A, and the forward LCS is a prefix of A.) The algorithm combines the two
|
|
to form a best-effort LCS. In the worst case the forward partial LCS may have to
|
|
be recomputed.
|
|
*/
|
|
|
|
/* Eugene Myers paper is titled
|
|
"An O(ND) Difference Algorithm and Its Variations"
|
|
and can be found at
|
|
http://www.xmailserver.org/diff2.pdf
|
|
|
|
(There is a generic implementation of the algorithm the the repository with git hash
|
|
b9ad7e4ade3a686d608e44475390ad428e60e7fc)
|
|
*/
|