vespa - An engine for low-latency computation over large data sets

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge pull request #28714 from ↵	Håvard Pettersen	2023-09-29	1	-8/+8
\|\ \| \| \| \| \| \| \| \|	vespa-engine/havardpe/better-graphviz-for-table-dfa dump table_dfa as actual dfa in graphviz
\| *	dump table_dfa as actual dfa in graphviz	Håvard Pettersen	2023-09-28	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	enumerate states based on best-edge-first '*' means any character without its own edge
* \|	Preserve prefix of input DFA successor string	Tor Brede Vekterli	2023-09-27	1	-4/+37
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a non-empty string is passed as a successor to the DFA, the contents of the string will be preserved, i.e. the successor will always be _appended_ to any existing data. This allows for less manual fiddling when implementing prefix locking by the caller (no need to concatenate a prefix with the generated successor string). Note: this has some added cognitive cost where the caller now has the entire responsibility of resetting the successor between calls. The existing fuzzy matcher has been updated to no longer require a separation between successor prefix and suffix; it can now safely reuse the successor prefix between calls.
*	Merge pull request #28677 from vespa-engine/havardpe/inline-table-dfa	Håvard Pettersen	2023-09-27	1	-6/+103
\|\ \| \| \| \|	use inline pre-generated tables
\| *	use inline pre-generated tables	Håvard Pettersen	2023-09-26	1	-6/+103
\| \|
* \|	Merge pull request #28674 from vespa-engine/balder/minor-code-health	Geir Storli	2023-09-26	1	-1/+1
\|\ \ \| \|/ \|/\|	Minor code health
\| *	Minor code health	Henning Baldersheim	2023-09-26	1	-1/+1
\| \|
* \|	Merge pull request #28652 from vespa-engine/havardpe/table-dfa	Tor Brede Vekterli	2023-09-26	3	-3/+313
\|\ \ \| \|/ \|/\|	table dfa
\| *	table dfa	Håvard Pettersen	2023-09-25	3	-3/+313
\| \|
* \|	- Use stash instead of the single use of VariableSizeVector.	Henning Baldersheim	2023-09-25	2	-50/+12
\|/
*	Use the Guard when testing bundle pool	Henning Baldersheim	2023-09-20	1	-20/+15
\|
*	Support raw UTF-32 successor string output	Tor Brede Vekterli	2023-09-18	1	-8/+19
\| \| \| \| \| \| \| \| \| \|	Avoids need to encode UTF-32 characters to UTF-8 internally, as this is likely to be subsequently reversed by a caller that itself operates on UTF-32 code points. Change the `match()` API by introducing a separate overload that does not produce a successor, and add two explicit successor string type overloads that take the string by ref, not pointer.
*	Use make_for_lookup() member function on existing comparator	Tor Egge	2023-09-18	2	-15/+17
\| \| \| \|	to make a new comparator which is used for lookup.
*	Add comparator to unique store.	Tor Egge	2023-09-18	1	-4/+4
\|
*	Add support for case-insensitive matching to Levenshtein DFAs	Tor Brede Vekterli	2023-09-15	1	-57/+165
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds matching modes `Cased` and `Uncased`. `Cased` requires UTF-32 code points to match exactly, and successor strings are guaranteed to be strictly higher than the source (candidate) string in `memcmp` order. This mirrors the behavior of the current DFA implementation. `Uncased` treats all characters as if they were lowercased, both for the target and source strings. The target (query) string is explicitly lowercased at DFA build-time to avoid duplicate work. Source strings are implicitly lowercased character by character on-demand during matching. Important ordering note: Successor strings for `Uncased` are generated _as if_ the source string was originally all in lowercase form. This requires some extra added handling when emitting successor prefixes, as we can't just blindly copy UTF-8 bytes from the source string as we do when matching in `Cased` mode. A new casing-dimension has been added to most parameterized unit tests.
*	Remove FastOS_DirectoryScan	Tor Egge	2023-09-01	1	-10/+0
\|
*	Add saturation metric for executors.	Geir Storli	2023-08-31	1	-0/+21
\| \| \| \| \|	This should make it easier to observe bottlenecks in one of the underlying executor threads used in the "field writer" SequencedTaskExecutor.
*	added pop_back function to SmallVector	Håvard Pettersen	2023-08-28	1	-0/+26
\| \| \| \|	follow std::vector by making it undefined for empty vectors
*	Use 128 bytes alignment for small allocations in MmapFileAllocator.	Tor Egge	2023-08-25	1	-8/+8
\|
*	Extend test for reusing file offset.	Tor Egge	2023-08-24	1	-2/+15
\|
*	Use premmapped areas for smaller allocations than _small_limit.	Tor Egge	2023-08-24	1	-14/+39
\|
*	Add premmapped areas to file area freelist.	Tor Egge	2023-08-24	1	-0/+29
\|
*	Revert "Sample datastore stash memory usage in write thread."	Tor Egge	2023-08-22	2	-4/+4
\|
*	Sample datastore stash memory usage in write thread.	Tor Egge	2023-08-22	2	-4/+4
\|
*	Disable two alloc unit tests when using any sanitizer.	Tor Egge	2023-08-21	1	-3/+3
\|
*	GC unused direct IO support	Henning Baldersheim	2023-08-15	1	-9/+0
\|
*	GC and clean up more unused code	Henning Baldersheim	2023-08-15	1	-33/+0
\|
*	GC unused File code and other fallout.	Henning Baldersheim	2023-08-15	2	-189/+4
\|
*	Add support for creating ConstArrayRef from std::array	Henning Baldersheim	2023-08-08	1	-0/+20
\|
*	Prefer std::filesystem::exists over FastOS_StatInfo	Henning Baldersheim	2023-07-25	1	-31/+11
\|
*	Use std::filesystem::current_path	Tor Egge	2023-07-21	1	-10/+0
\|
*	Remove vespalib::stat and vespalib::getFileSize.	Tor Egge	2023-07-20	1	-10/+4
\|
*	Use std::filesystem::is_directory and std::filesystem::exists	Tor Egge	2023-07-20	3	-7/+13
\|
*	Remove vespalib::pathExists, vespalib::isPlainFile and vespalib::isSymLink.	Tor Egge	2023-07-20	1	-2/+0
\|
*	Remove vespalib::symlink and vespalib::readLink	Tor Egge	2023-07-20	1	-62/+0
\|
*	Remove vespalib::unlink.	Tor Egge	2023-07-20	1	-33/+4
\|
*	Remove vespalib::copy and vespalib::rename.	Tor Egge	2023-07-20	1	-127/+0
\|
*	Use std::filesystem::rename instead of vespalib::rename.	Tor Egge	2023-07-19	1	-2/+3
\|
*	Drop non ancient non const GetSize/GetPosition	Henning Baldersheim	2023-07-18	1	-13/+13
\|
*	GC unused OpenExisting	Henning Baldersheim	2023-07-18	1	-1/+1
\|
*	Implement Levenshtein DFAs with successor string generation	Tor Brede Vekterli	2023-07-18	2	-0/+516
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements code for building and evaluating Levenshtein Deterministic Finite Automata, where the resulting DFA efficiently matches all possible source strings that can be transformed to the target string within k max edits. This allows for O(n) matching of strings. We currently support k in {1, 2}. Additionally, when matching using a DFA, in the case where the source string does _not_ match, we can generate the _successor_ string; the next matching string that is lexicographically _greater_ than the source string. This string has the invariant that there are no possibly matching strings within k edits ordered after the source string but before the successor. This lets us do possibly massive leaps forward in an ordered dictionary, turning a scan for matches into a sublinear operation. Matching and successor generation is fully Unicode-aware. All input strings are expected to be in UTF-8 (without nulls), and the generated successor is also encoded as UTF-8. Internally, matching is done on UTF-32 code points and the DFA itself is built around UTF-32. UTF-8 decoding of source strings is done in a streaming fashion and does not require any allocations. This commit includes a templated core Levenshtein DFA matching (and successor generation) algorithm and two separate DFA implementations that can be used; one explicit and one implicit. The explicit DFA is an immutable DAG built up-front that represents all DFA states and transitions as explicit nodes and edges in a graph. This is currently the fastest to evaluate, but the build time and memory usage means its usage should be preferred for shorter strings (up to a few hundred chars). The implicit DFA does not build any graph up-front, but rather evaluates state transitions on-demand for any given source string. This is currently slower than the explicit DFA, but its O(1) memory usage (aside from the memory used by the target string itself) means that it can be used for arbitrary string lengths. This code currently exists as a freestanding vespalib utility, and is not yet wired to any production code (fuzzy matching or similar). Future optimizations: * Redesign sparse state representation and stepping logic to be much less branching, in turn making the code much less likely to stall the CPU pipeline. * Emit as much as possible of the successor string suffix by copying directly from the target string UTF-8 representation instead of following the DFA and encoding UTF-32 to UTF-8 chars.
*	Remove FastOS_File::Delete().	Tor Egge	2023-07-17	1	-2/+2
\|
*	Use std::filesystem::remove in unit tests.	Tor Egge	2023-07-14	1	-12/+19
\|
*	Use std::filesystem in buffered file unit test.	Tor Egge	2023-07-14	1	-10/+15
\|
*	Avoid livelock when running rcu vector unit test with valgrind.	Tor Egge	2023-07-10	1	-0/+17
\|
*	Use provided memory allocator for large arrays.	Tor Egge	2023-07-05	1	-0/+10
\|
*	Merge pull request #27547 from vespa-engine/havardpe/est-80-percentile-as-result	Henning Baldersheim	2023-06-26	1	-39/+84
\|\ \| \| \| \|	use estimated 80 percentile as benchmark result
\| *	add comment	Håvard Pettersen	2023-06-26	1	-0/+17
\| \|
\| *	use estimated 80 percentile as benchmark result	Håvard Pettersen	2023-06-26	1	-39/+67
\| \| \| \| \| \| \| \|	also simplify somewhat
* \|	Add max buffer size parameter to array store dynamic type mapper.	Tor Egge	2023-06-26	2	-18/+30
\| \|