summaryrefslogtreecommitdiffstats
path: root/linguistics
Commit message (Collapse)AuthorAgeFilesLines
* Fix CR commentsMariusArhaug2024-04-307-46/+34
|
* Update significance model field and logic from architect meetingMariusArhaug2024-04-2411-117/+246
|
* Merge pull request #30871 from vespa-engine/marius/add-significance-searcherMarius Arhaug2024-04-244-11/+18
|\ | | | | Add significance searcher
| * update abi-specMariusArhaug2024-04-161-1/+1
| |
| * fix cr failuresMariusArhaug2024-04-163-10/+17
| |
* | Replace all usages of Arrays.asList with List.of where possible.Henning Baldersheim2024-04-126-30/+25
| |
* | Merge pull request #30809 from vespa-engine/jobergum/add-context-cachingJo Kristian Bergum2024-04-102-9/+17
|\ \ | |/ |/| Add onnx output caching to embedder (allow different post-processing of model outputs)
| * Key by embedder id and don't recompute inputsJon Bratseth2024-04-072-10/+11
| |
| * Add equivalent to `Map.computeIfAbsent()` to simplify typical usage of the cacheBjørn Christian Seime2024-04-042-2/+9
| | | | | | | | Current interface requires a lot of boilerplate code.
* | Merge pull request #30816 from ↵Marius Arhaug2024-04-0912-1/+385
|\ \ | | | | | | | | | | | | vespa-engine/marius/add-significance-model-registry Add significance model registry to linguistics
| * | add missing beta annotationMariusArhaug2024-04-091-0/+4
| | |
| * | add illegal arg exception for languages not registeredMariusArhaug2024-04-092-1/+8
| | |
| * | fix cr failuresMariusArhaug2024-04-0912-52/+104
| | |
| * | add significance model registry to linguisticsMariusArhaug2024-04-0410-1/+322
| |/
* | add comment for intention in determineScript functionMariusArhaug2024-04-041-0/+1
| |
* | Add SimpleTokenScript to SimpleTokenizerMariusArhaug2024-04-034-1/+124
|/ | | | | | | | When parsing datasets such as WikiDumps to a significance model, we want to only keep characters of that language script within our model. So when adding the script value to our tokenizer we are able to use this to filter out non-latin words when creating an english significance model for example.
* Expose cache to embeddersJon Bratseth2024-04-012-1/+27
|
* Update ABI specJon Bratseth2024-02-161-3/+1
|
* Pass context when resolving propertiesJon Bratseth2024-02-151-9/+0
|
* ChainedMap can't be copiedJon Bratseth2024-01-201-1/+1
|
* Revert "Merge pull request #29905 from ↵Jon Bratseth2024-01-202-1/+13
| | | | | | | vespa-engine/revert-29884-bratseth/param-refs-in-embed" This reverts commit c6b547c0c2898a324983356aa677ea3082533f7d, reversing changes made to 8c7f8c17ad5e1de5adcc71ee34f2a3c1cd36d6bd.
* Revert "Support parameter references in embed"Henning Baldersheim2024-01-152-13/+1
|
* Support parameter references in embedJon Bratseth2024-01-122-1/+13
| | | | Support embed(@myParameter) in addition to embed('text to embed')
* Revert "Merge pull request #29328 from ↵Jon Bratseth2023-11-144-13/+30
| | | | | | | vespa-engine/revert-29314-bratseth/casing-take-2" This reverts commit a72e949533a46d665440a9c72ca2b8fb58f3a9c3, reversing changes made to 944d635d00e165166508ef23399e9ed65a87a9c8.
* Revert "Bratseth/casing take 2"Harald Musum2023-11-134-30/+13
|
* Prefer first stem to original if non equalJon Bratseth2023-11-102-11/+28
|
* Revert "Revert "Don't lowercase linguistics annotations""Jon Bratseth2023-11-092-2/+2
| | | | This reverts commit 0dfd4fe4c6ddbded490da36e71f27c4b70aa4226.
* Revert "Don't lowercase linguistics annotations"Jon Bratseth2023-11-092-2/+2
|
* Don't lowercase linguistics annotationsJon Bratseth2023-11-092-2/+2
| | | | | | Tokens are already lowercased by our bundled linguistics components. Lowercasing again when annotating precludes plugging in a lingustics component which preserves casing.
* Avoid cutting surrogate pairs when tokenisingjonmv2023-10-201-1/+1
|
* Update copyrightJon Bratseth2023-10-0973-73/+73
|
* Use Guice 6.0Bjørn Christian Seime2023-09-041-1/+1
| | | | | | https://github.com/google/guice/wiki/Guice600 We cannot upgrade to 7.x as we export javax.inject from container. 6.x supports both the old javax.inject and the new jakarta.inject replacement.
* Allow sampling of fractional millisBjørn Christian Seime2023-08-252-4/+3
|
* Add generic metrics for embeddersBjørn Christian Seime2023-08-042-1/+56
|
* Add necessary options to use failOnWarningsgjoranv2023-06-051-0/+1
|
* Don't remove indexable symbols when stemmingJon Bratseth2023-06-025-8/+17
|
* Add bundle type to all CORE bundles.gjoranv2023-05-251-0/+3
|
* Update ABI specJon Bratseth2023-05-221-0/+1
|
* Always treat each symbol as a separate tokenJon Bratseth2023-05-224-20/+56
|
* Threat 'other symbols' as lettersJon Bratseth2023-05-222-2/+10
| | | | | The unicode class 'other symbols' contains emojis, math symbols, etc. Treat these as letter characters to support searching for them.
* Use dollar and hour base unitsJon Bratseth2023-05-191-2/+2
|
* Use metric enums everywhereJon Bratseth2023-03-061-1/+1
|
* Add abi specLester Solbakken2023-02-101-0/+1
|
* Add decoding of sentencepiece token sequence to textLester Solbakken2023-02-101-0/+11
|
* Compute code points in whole string only when neededjonmv2022-12-062-6/+17
|
* Split out opennlp-linguisticsHenning Baldersheim2022-11-2614-783/+0
|
* Update ABI spec format, and update all specsjonmv2022-10-251-198/+198
|
* much simpler CharSequenceNormalizerArne Juul2022-10-063-9/+100
|
* Merge pull request #24007 from vespa-engine/bratseth/cleanup-082Jon Bratseth2022-09-252-13/+11
|\ | | | | No functional changes
| * No functional changesJon Bratseth2022-09-112-13/+11
| |