| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| | |
vespa-engine/havardpe/better-graphviz-for-table-dfa
dump table_dfa as actual dfa in graphviz
|
| |
| |
| |
| |
| | |
enumerate states based on best-edge-first
'*' means any character without its own edge
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a non-empty string is passed as a successor to the DFA,
the contents of the string will be preserved, i.e. the successor
will always be _appended_ to any existing data. This allows
for less manual fiddling when implementing prefix locking by the
caller (no need to concatenate a prefix with the generated successor
string).
Note: this has some added cognitive cost where the caller now has
the entire responsibility of resetting the successor between calls.
The existing fuzzy matcher has been updated to no longer require
a separation between successor prefix and suffix; it can now
safely reuse the successor prefix between calls.
|
|\
| |
| | |
use inline pre-generated tables
|
| | |
|
|\ \
| |/
|/| |
Minor code health
|
| | |
|
|\ \
| |/
|/| |
table dfa
|
| | |
|
|/ |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Avoids need to encode UTF-32 characters to UTF-8 internally, as
this is likely to be subsequently reversed by a caller that itself
operates on UTF-32 code points.
Change the `match()` API by introducing a separate overload that
does not produce a successor, and add two explicit successor string
type overloads that take the string by ref, not pointer.
|
|
|
|
| |
to make a new comparator which is used for lookup.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds matching modes `Cased` and `Uncased`.
`Cased` requires UTF-32 code points to match exactly, and successor
strings are guaranteed to be strictly higher than the source (candidate)
string in `memcmp` order. This mirrors the behavior of the current
DFA implementation.
`Uncased` treats all characters as if they were lowercased, both
for the target and source strings. The target (query) string is
explicitly lowercased at DFA build-time to avoid duplicate work.
Source strings are implicitly lowercased character by character
on-demand during matching.
Important ordering note: Successor strings for `Uncased` are generated
_as if_ the source string was originally all in lowercase form.
This requires some extra added handling when emitting successor
prefixes, as we can't just blindly copy UTF-8 bytes from the source
string as we do when matching in `Cased` mode.
A new casing-dimension has been added to most parameterized unit
tests.
|
| |
|
|
|
|
|
| |
This should make it easier to observe bottlenecks in one of the
underlying executor threads used in the "field writer" SequencedTaskExecutor.
|
|
|
|
| |
follow std::vector by making it undefined for empty vectors
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements code for building and evaluating Levenshtein
Deterministic Finite Automata, where the resulting DFA efficiently
matches all possible source strings that can be transformed to the
target string within k max edits. This allows for O(n) matching
of strings. We currently support k in {1, 2}.
Additionally, when matching using a DFA, in the case where the
source string does _not_ match, we can generate the _successor_ string;
the next matching string that is lexicographically _greater_ than
the source string. This string has the invariant that there are no
possibly matching strings within k edits ordered after the source
string but before the successor.
This lets us do possibly massive leaps forward in an ordered
dictionary, turning a scan for matches into a sublinear operation.
Matching and successor generation is fully Unicode-aware. All input
strings are expected to be in UTF-8 (without nulls), and the generated
successor is also encoded as UTF-8. Internally, matching is done on
UTF-32 code points and the DFA itself is built around UTF-32.
UTF-8 decoding of source strings is done in a streaming fashion and
does not require any allocations.
This commit includes a templated core Levenshtein DFA matching (and
successor generation) algorithm and two separate DFA implementations
that can be used; one explicit and one implicit.
The explicit DFA is an immutable DAG built up-front that represents
all DFA states and transitions as explicit nodes and edges in a
graph. This is currently the fastest to evaluate, but the build time
and memory usage means its usage should be preferred for shorter
strings (up to a few hundred chars).
The implicit DFA does not build any graph up-front, but rather
evaluates state transitions on-demand for any given source string.
This is currently slower than the explicit DFA, but its O(1)
memory usage (aside from the memory used by the target string itself)
means that it can be used for arbitrary string lengths.
This code currently exists as a freestanding vespalib utility, and
is not yet wired to any production code (fuzzy matching or similar).
Future optimizations:
* Redesign sparse state representation and stepping logic to be
much less branching, in turn making the code much less likely to
stall the CPU pipeline.
* Emit as much as possible of the successor string suffix by copying
directly from the target string UTF-8 representation instead of
following the DFA and encoding UTF-32 to UTF-8 chars.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| | |
use estimated 80 percentile as benchmark result
|
| | |
|
| |
| |
| |
| | |
also simplify somewhat
|
| | |
|