Add prefix match support to Levenshtein algorithm implementations - vespa - An engine for low-latency computation over large data sets

diff options

author	Tor Brede Vekterli <vekterli@vespa.ai>	2024-04-03 09:39:31 +0000
committer	Tor Brede Vekterli <vekterli@vespa.ai>	2024-04-16 11:45:22 +0000
commit	0e8aab2c447e0ed5cac035d385c95ad44d8594d6 (patch)
tree	d7f0c633886a80a43039d7c6684d7e618ab1c5b1 /config-class-plugin
parent	fca990d5ed32c408df42bbe178b174711fa54a08 (diff)

Add prefix match support to Levenshtein algorithm implementations

Adds support for matching the _prefix_ of a source string against a target string (the prefix query) within a bounded maximum number of `k` max edits. Iff the number of edits required is within the specified bound, returns the _minimum_ number of edits required to transform the source string prefix to the full target string. By convention, we treat the target string as the "columnar" string in the Levenshtein matrix (i.e. its characters are column-indexed, whereas the source string is row-indexed). This matters for prefix matching, because unlike regular Levenshtein matching it is _not_ symmetric between source and target strings. By definition, the Levenshtein matrix cell at row `i`, column `j` provides the minimum number of edits required to transform a prefix of source string S (up to and including length `i`) into a prefix of target string T (up to and including length `j`). Since we want to match against the entire target (prefix query) string of length `n`, the problem is reduced to finding the minimum value of the `n`th column that is `<= k`. Example: matching the source string `abcdef` against the target `acd` with `k` == 2: a c d 0 1 2 3 a 1 0 1 2 b 2 1 1 2 c 3 2 1 2 d 4 3 2 1 e 5 4 3 2 f 6 5 4 3 In this case, the _shortest_ matching prefix is simply `a`, but the _minimum edits_ prefix is `abcd`. The latter prefix's distance is what we return. For our generalized (i.e. arbitrary `k`) Levenshtein implementation, this is implemented fairly straight forward since it operates on matrix rows already (sparsely populated around the diagonal). For the DFA implementation(s), transitioning between states in a Levenshtein DFA is equivalent to explicitly computing the (sparse) matrix columns around the diagonal for the source character being currently matched, so the same principle applies directly to it. Since we don't have any explicit notion of the value of matrix columns in the abstract DFA, a source string in prefix mode will be considered a match when _any_ DFA state is a match. By definition, this is as-if the matrix row represented by the state has a `n`th column that is <= `k`. We then follow the DFA until it can no longer match, keeping track of the lowest state edit distance encountered. This mirrors finding the row whose `n`th column minimizes `k`. Prefix matching support has been added to the core DFA matching loop algorithms, with only very minor changes to the underlying DFA implementations. Successor generation upon mismatch should work as expected with no algorithmic changes introduced for prefix matching. Prefix match mode has been added as a dimension to the exhaustive successor checking unit test.

Diffstat (limited to 'config-class-plugin')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: