diff options
author | Tor Brede Vekterli <vekterli@vespa.ai> | 2024-04-03 09:39:31 +0000 |
---|---|---|
committer | Tor Brede Vekterli <vekterli@vespa.ai> | 2024-04-16 11:45:22 +0000 |
commit | 0e8aab2c447e0ed5cac035d385c95ad44d8594d6 (patch) | |
tree | d7f0c633886a80a43039d7c6684d7e618ab1c5b1 /config-class-plugin | |
parent | fca990d5ed32c408df42bbe178b174711fa54a08 (diff) |
Add prefix match support to Levenshtein algorithm implementations
Adds support for matching the _prefix_ of a source string against
a target string (the prefix query) within a bounded maximum number
of `k` max edits. Iff the number of edits required is within the
specified bound, returns the _minimum_ number of edits required to
transform the source string prefix to the full target string.
By convention, we treat the target string as the "columnar" string
in the Levenshtein matrix (i.e. its characters are column-indexed,
whereas the source string is row-indexed). This matters for prefix
matching, because unlike regular Levenshtein matching it is _not_
symmetric between source and target strings.
By definition, the Levenshtein matrix cell at row `i`, column `j`
provides the minimum number of edits required to transform a prefix
of source string S (up to and including length `i`) into a prefix of
target string T (up to and including length `j`). Since we want to
match against the entire target (prefix query) string of length `n`,
the problem is reduced to finding the minimum value of the `n`th
column that is `<= k`.
Example: matching the source string `abcdef` against the target `acd`
with `k` == 2:
a c d
0 1 2 3
a 1 0 1 2
b 2 1 1 2
c 3 2 1 2
d 4 3 2 1
e 5 4 3 2
f 6 5 4 3
In this case, the _shortest_ matching prefix is simply `a`, but the
_minimum edits_ prefix is `abcd`. The latter prefix's distance is
what we return.
For our generalized (i.e. arbitrary `k`) Levenshtein implementation,
this is implemented fairly straight forward since it operates on
matrix rows already (sparsely populated around the diagonal).
For the DFA implementation(s), transitioning between states in a
Levenshtein DFA is equivalent to explicitly computing the (sparse)
matrix columns around the diagonal for the source character being
currently matched, so the same principle applies directly to it.
Since we don't have any explicit notion of the value of matrix columns
in the abstract DFA, a source string in prefix mode will be considered
a match when _any_ DFA state is a match.
By definition, this is as-if the matrix row represented by the state
has a `n`th column that is <= `k`. We then follow the DFA until it
can no longer match, keeping track of the lowest state edit distance
encountered. This mirrors finding the row whose `n`th column
minimizes `k`.
Prefix matching support has been added to the core DFA matching
loop algorithms, with only very minor changes to the underlying DFA
implementations. Successor generation upon mismatch should work
as expected with no algorithmic changes introduced for prefix
matching. Prefix match mode has been added as a dimension to the
exhaustive successor checking unit test.
Diffstat (limited to 'config-class-plugin')
0 files changed, 0 insertions, 0 deletions