diff options
author | Tor Brede Vekterli <vekterli@yahooinc.com> | 2023-09-14 14:48:13 +0000 |
---|---|---|
committer | Tor Brede Vekterli <vekterli@yahooinc.com> | 2023-09-15 09:28:44 +0000 |
commit | 6d696cd634a6bd4ea1b4c046ade9bfc5b5d246a8 (patch) | |
tree | 69f72bd68348313c7ddbeb96b5811fc03d9ed48d /vespabase | |
parent | a4360988d590db4a568b71bd6d2bf7f5c81a5a54 (diff) |
Add support for case-insensitive matching to Levenshtein DFAs
Adds matching modes `Cased` and `Uncased`.
`Cased` requires UTF-32 code points to match exactly, and successor
strings are guaranteed to be strictly higher than the source (candidate)
string in `memcmp` order. This mirrors the behavior of the current
DFA implementation.
`Uncased` treats all characters as if they were lowercased, both
for the target and source strings. The target (query) string is
explicitly lowercased at DFA build-time to avoid duplicate work.
Source strings are implicitly lowercased character by character
on-demand during matching.
Important ordering note: Successor strings for `Uncased` are generated
_as if_ the source string was originally all in lowercase form.
This requires some extra added handling when emitting successor
prefixes, as we can't just blindly copy UTF-8 bytes from the source
string as we do when matching in `Cased` mode.
A new casing-dimension has been added to most parameterized unit
tests.
Diffstat (limited to 'vespabase')
0 files changed, 0 insertions, 0 deletions