summaryrefslogtreecommitdiffstats
path: root/vespabase
diff options
context:
space:
mode:
authorTor Brede Vekterli <vekterli@yahooinc.com>2023-09-14 14:48:13 +0000
committerTor Brede Vekterli <vekterli@yahooinc.com>2023-09-15 09:28:44 +0000
commit6d696cd634a6bd4ea1b4c046ade9bfc5b5d246a8 (patch)
tree69f72bd68348313c7ddbeb96b5811fc03d9ed48d /vespabase
parenta4360988d590db4a568b71bd6d2bf7f5c81a5a54 (diff)
Add support for case-insensitive matching to Levenshtein DFAs
Adds matching modes `Cased` and `Uncased`. `Cased` requires UTF-32 code points to match exactly, and successor strings are guaranteed to be strictly higher than the source (candidate) string in `memcmp` order. This mirrors the behavior of the current DFA implementation. `Uncased` treats all characters as if they were lowercased, both for the target and source strings. The target (query) string is explicitly lowercased at DFA build-time to avoid duplicate work. Source strings are implicitly lowercased character by character on-demand during matching. Important ordering note: Successor strings for `Uncased` are generated _as if_ the source string was originally all in lowercase form. This requires some extra added handling when emitting successor prefixes, as we can't just blindly copy UTF-8 bytes from the source string as we do when matching in `Cased` mode. A new casing-dimension has been added to most parameterized unit tests.
Diffstat (limited to 'vespabase')
0 files changed, 0 insertions, 0 deletions