Add support for case-insensitive matching to Levenshtein DFAs - vespa - An engine for low-latency computation over large data sets

diff options

author	Tor Brede Vekterli <vekterli@yahooinc.com>	2023-09-14 14:48:13 +0000
committer	Tor Brede Vekterli <vekterli@yahooinc.com>	2023-09-15 09:28:44 +0000
commit	6d696cd634a6bd4ea1b4c046ade9bfc5b5d246a8 (patch)
tree	69f72bd68348313c7ddbeb96b5811fc03d9ed48d /vespabase
parent	a4360988d590db4a568b71bd6d2bf7f5c81a5a54 (diff)

Add support for case-insensitive matching to Levenshtein DFAs

Adds matching modes `Cased` and `Uncased`. `Cased` requires UTF-32 code points to match exactly, and successor strings are guaranteed to be strictly higher than the source (candidate) string in `memcmp` order. This mirrors the behavior of the current DFA implementation. `Uncased` treats all characters as if they were lowercased, both for the target and source strings. The target (query) string is explicitly lowercased at DFA build-time to avoid duplicate work. Source strings are implicitly lowercased character by character on-demand during matching. Important ordering note: Successor strings for `Uncased` are generated _as if_ the source string was originally all in lowercase form. This requires some extra added handling when emitting successor prefixes, as we can't just blindly copy UTF-8 bytes from the source string as we do when matching in `Cased` mode. A new casing-dimension has been added to most parameterized unit tests.

Diffstat (limited to 'vespabase')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: