vespalib/src/vespa/vespalib/fuzzy/levenshtein_dfa.h


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305

// Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
#pragma once

#include <cstdint>
#include <iosfwd>
#include <memory>
#include <string>
#include <string_view>
#include <vector>

namespace vespalib::fuzzy {

/**
 * Levenshtein Deterministic Finite Automata (DFA)
 *
 * The Levenshtein distance (or edit distance) is the minimum number of edits (additions,
 * deletions or substitutions) needed to transform a particular source string s to a
 * particular target string t.
 *
 * Let m be the length of the source string and n be the length of the target string.
 *
 * The classic dynamic programming algorithm uses a n x m cost matrix and is therefore
 * O(nm) in space and time. By observing that only 2 rows of the matrix are actually
 * needed, this is commonly reduced to O(n) space complexity (still O(nm) time complexity).
 * When the maximum number of allowed edits is constrained to k, some clever observations
 * about the nature of the cost matrix allows for reducing the time complexity down to
 * O(kn) (more specifically, O((2k+1) * n)). When k is fixed (e.g. k in {1, 2}), the
 * time complexity simplifies down to O(n).
 *
 * This implements code for building and evaluating Levenshtein Deterministic Finite
 * Automata, where the resulting DFA efficiently matches all possible source strings that
 * can be transformed to the target string within k max edits. This allows for easy linear
 * matching of strings.
 *
 * Inspiration:
 *   - http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata
 *   - https://julesjacobs.com/2015/06/17/disqus-levenshtein-simple-and-fast.html
 *
 * The latter in particular was a close inspiration for the sparse DFA state management.
 *
 * ====== Dictionary skipping via successor string generation ======
 *
 * Scanning for edit distance matches frequently takes place against a sorted dictionary.
 * When matching using a DFA, in the case where the source string does _not_ match, we can
 * generate the _successor_ string; the next matching string that is lexicographically
 * _greater_ than the source string. This string has the invariant that there are no
 * possibly matching strings within k edits ordered after the source string but before
 * the successor.
 *
 * This lets us do possibly massive leaps forward in the dictionary, turning a dictionary
 * scan into a sublinear operation.
 *
 * Note that the implemented successor algorithm is slightly different from that described
 * in the above blog post. The implemented algorithm requires zero extra data structures
 * than the DFA itself and the target string and tries to be extra clever with reducing
 * the number of code point conversions required
 *
 * ====== Unicode support ======
 *
 * Matching and successor generation is fully Unicode-aware. All input strings are expected
 * to be in UTF-8, and the generated successor is encoded as UTF-8 (with some caveats; see
 * the documentation for match()) or UTF-32, depending on the chosen `match()` overload.
 *
 * Internally, matching is done on UTF-32 code points and the DFA itself is built around
 * UTF-32. This is unlike Lucene, which converts a UTF-32 DFA to an equivalent UTF-8 DFA.
 *
 * ====== Memory usage ======
 *
 * There is always a baseline DFA memory usage O(n) in the target string, as the
 * underlying DFA needs to convert the input UTF-8 string to explicit UTF-32 chars.
 *
 * Aside from the baseline, memory usage depends on whether an explicit or implicit DFA
 * is used.
 *
 * ------ Explicit DFA ------
 *
 * The explicit DFA graph takes up quite a bit more memory than the original string
 * representation (one reason is the use of UTF-32 characters under the hood).
 *
 * Expected upper bound memory usage for a string of length n with max edits k is
 *
 *   (2k+1) * N(k) * n * W(k)
 *
 * where N(1) is expected to be 32 and N(2) is 48, W(1) is 1.34 and W(2) is 3.2 (empirically
 * derived).
 *
 * Memory usage during building is higher due to keeping track of the set of generated
 * states in a hash table, but still linear in input size. This extra memory is freed
 * once building is complete.
 *
 * ------ Implicit DFA ------
 *
 * Implicit DFAs have a O(1) memory usage during evaluation, which all lives on the stack
 * or in registers (this does not include the successor string, which is provided by the
 * caller).
 *
 * Since the sparse state stepping is currently not as fast as explicit DFA node traversal,
 * string matching is slower than with the explicit DFA.
 *
 * ====== In short ======
 *
 *  - Immutable; build once, run many times.
 *  - Explicit DFA build time is amortized linear in target string size.
 *  - Implicit DFA build time is O(1) (aside from initial UTF-32 conversion)
 *  - Zero-allocation matching.
 *  - Matching takes in raw UTF-8 input, no need to pre-convert.
 *    - Streaming UTF-8 to UTF-32 conversion; fully unicode-aware (DFA uses UTF-32 code
 *      points internally).
 *    - If required, it's possible (but not currently implemented) to bake case
 *      insensitive matching semantics into the generated DFA itself.
 *  - Allows for dictionary forward-skipping via successor algorithm.
 *  - Amortized zero allocations for successor string building when reusing string
 *    between matches.
 *  - Successor string is generated in-place as UTF-8 and can be directly used as input
 *    to a byte-wise dictionary seek.
 */
class LevenshteinDfa {
public:
    class MatchResult {
        uint8_t _max_edits;
        uint8_t _edits;
    public:
        constexpr MatchResult(uint8_t max_edits, uint8_t edits) noexcept
            : _max_edits(max_edits),
              _edits(edits)
        {}

        static constexpr MatchResult make_match(uint8_t max_edits, uint8_t edits) noexcept {
            return {max_edits, edits};
        }

        static constexpr MatchResult make_mismatch(uint8_t max_edits) noexcept {
            return {max_edits, static_cast<uint8_t>(max_edits + 1)};
        }

        [[nodiscard]] constexpr bool matches() const noexcept { return _edits <= _max_edits; }
        [[nodiscard]] constexpr uint8_t edits() const noexcept { return _edits; }
        [[nodiscard]] constexpr uint8_t max_edits() const noexcept { return _max_edits; }
    };

    struct Impl {
        virtual ~Impl() = default;
        [[nodiscard]] virtual MatchResult match(std::string_view u8str) const = 0;
        [[nodiscard]] virtual MatchResult match(std::string_view u8str, std::string& successor_out) const = 0;
        [[nodiscard]] virtual MatchResult match(std::string_view u8str, std::vector<uint32_t>& successor_out) const = 0;
        [[nodiscard]] virtual size_t memory_usage() const noexcept = 0;
        virtual void dump_as_graphviz(std::ostream& out) const = 0;
    };

private:
    std::unique_ptr<Impl> _impl;
public:
    explicit LevenshteinDfa(std::unique_ptr<Impl> impl) noexcept;
    LevenshteinDfa(LevenshteinDfa&&) noexcept;
    LevenshteinDfa& operator=(LevenshteinDfa&&) noexcept;
    LevenshteinDfa(const LevenshteinDfa&) = delete;
    LevenshteinDfa& operator=(const LevenshteinDfa&) = delete;
    ~LevenshteinDfa();

    /**
     * Attempts to match the source string `source` with the target string this DFA was
     * built with.
     *
     * `source` must not contain any null UTF-8 chars.
     *
     * Match case:
     * Iff `source` is _within_ the maximum edit distance, returns a MatchResult with
     * matches() == true and edits() == the actual edit distance. If `successor_out`
     * is not nullptr, the string pointed to is _not_ modified.
     *
     * Mismatch case:
     * Iff `source` is _beyond_ the maximum edit distance, returns a MatchResult with
     * matches() == false.
     *
     */
    [[nodiscard]] MatchResult match(std::string_view source) const;

    /**
     * Attempts to match the source string `source` with the target string this DFA was
     * built with, emitting a successor string into `successor_out` on mismatch.
     *
     * In the case of a _match_, the following holds:
     *
     *   - The returned MatchResult has the same semantics as `match(source)`.
     *   - `successor_out` has a _prefix_ equal to its value that was originally passed
     *     in at the time of match() being called. The _suffix_ of the string is unspecified,
     *     i.e. it may or may not have been modified.
     *
     * In the case of a _mismatch_, the following holds:
     *
     *   - `successor_out` has a _prefix_ equal to its value that was originally passed
     *      in at the time of match() being called.
     *   - `successor_out` has a _suffix_ that contains the next (in byte-wise ordering)
     *     possible _matching_ string S so that there exists no other matching string S'
     *     that is greater than `source` but smaller than S.
     *     The caller must explicitly be aware of any prefixes it sends in, as it is
     *     entirely ignored for the purposes of ordering the successor string vis-a-vis
     *     the input source string.
     *   - `successor_out` contains UTF-8 bytes that are within what UTF-8 can legally
     *     encode in bitwise form, but the _code points_ they encode may not be valid.
     *     In particular, surrogate pair ranges and U+10FFFF+1 may be encoded, neither of
     *     which are valid UTF-8.
     *
     * It is expected that the consumer of `successor_out` is only interested in the
     * memcmp()-ordering of strings and not whether they are technically valid Unicode.
     * This should be the case for low-level dictionary data structures etc.
     *
     * Ordering notes: when the DFA is created as Uncased, the target and source strings
     * are treated as lowercase when matching. The successor string is in this case generated
     * _as if_ the input source string was originally lowercase. This is a special case
     * intended for case-folded dictionaries that implicitly order strings by their lowercased
     * form, but where they are stored in their original cased form. The (raw) cased form
     * is what is passed to the DFA match() function.
     *
     * Memory allocation:
     * This function does not directly or indirectly allocate any heap memory if the
     * `successor_out` string provided is large enough to fit any generated successor.
     * By reusing the successor string across many calls, this therefore amortizes memory
     * allocations down to near zero per invocation.
     */
    [[nodiscard]] MatchResult match(std::string_view source, std::string& successor_out) const;

    /**
     * Same as `match(source, successor_out)`, but where the successor string is defined in
     * terms of UTF-32, not UTF-8. This avoids the need for encoding characters to UTF-8
     * internally, and is therefore expected to be more efficient.
     *
     * The code point ordering of the UTF-32 successor string is identical to that its UTF-8
     * equivalent. This includes the special cases where the successor may contain code points
     * outside the legal Unicode range.
     */
    [[nodiscard]] MatchResult match(std::string_view source, std::vector<uint32_t>& successor_out) const;

    /**
     * Returns how much memory is used by the underlying DFA representation, in bytes.
     */
    [[nodiscard]] size_t memory_usage() const noexcept;

    enum class DfaType {
        Implicit,
        Explicit,
        Table
    };

    /**
     * Specifies the character case matching semantics of the DFA.
     */
    enum class Casing {
        /**
         * Characters are case-normalized (i.e. lowercased) prior to matching.
         * Example: In uncased mode, 'A' and 'a' are considered exact matches and do not
         * consume edits.
         *
         * See the ordering notes for `match()` on what Uncased implies when generating
         * successor strings.
         */
        Uncased,
        /**
         * Characters are not case-normalized prior to matching.
         * Example: 'A' and 'a' are considered separate characters and will consume edits.
         *
         * Cased mode preserves strict memcmp() UTF-8 ordering between source and successor
         * strings.
         */
        Cased
    };

    /**
     * Builds and returns a Levenshtein DFA that matches all strings within `max_edits`
     * edits of `target_string`. The type of DFA returned is specified by dfa_type.
     *
     * `max_edits` must be in {1, 2}. Throws std::invalid_argument if outside range.
     *
     * `target_string` must not contain any null UTF-8 chars.
     */
    [[nodiscard]] static LevenshteinDfa build(std::string_view target_string, uint8_t max_edits,
                                              Casing casing, DfaType dfa_type);

    /**
     * Same as build() but currently always returns an implicit DFA.
     */
    [[nodiscard]] static LevenshteinDfa build(std::string_view target_string, uint8_t max_edits, Casing casing);

    /**
     * Dumps the DFA as a Graphviz graph in text format to the provided output stream.
     *
     * Note: Only supported for _explicit_ DFAs. Trying to call this function on an implicit
     * DFA will throw a std::runtime_error, as there is no concrete underlying graph
     * structure to dump.
     *
     * Note that only _matching_ state transitions are present in the DFA, and therefore only
     * such transitions are present in the generated graph. Overall this makes the graph for
     * longer strings much more manageable, as the number of out-edges from a particular depth
     * in the graph depends on the max number of edits and not on the length of the string
     * itself. Otherwise, you'd have a whole bunch of nodes with out-edges to the same terminal
     * non-matching state node.
     */
    void dump_as_graphviz(std::ostream& out) const;
};

std::ostream& operator<<(std::ostream& os, const LevenshteinDfa::MatchResult& mos);
std::ostream& operator<<(std::ostream& os, LevenshteinDfa::DfaType dt);
std::ostream& operator<<(std::ostream& os, LevenshteinDfa::Casing c);

}