path: root/lucene-linguistics/
diff options
authorDainius Jocas <>2023-07-31 13:27:43 +0300
committerDainius Jocas <>2023-07-31 13:27:43 +0300
commit5a60f6f3ae8e99f1f3de10e22a1f055d03fb37db (patch)
tree0f7cc48efba4b6661036a509269868d7354d6af2 /lucene-linguistics/
parentd488a7482e93ae233be571d61946caa796aba588 (diff)
integrate Lucene Linguistics into the vespa project
Diffstat (limited to 'lucene-linguistics/')
1 files changed, 93 insertions, 0 deletions
diff --git a/lucene-linguistics/ b/lucene-linguistics/
new file mode 100644
index 00000000000..6329811e458
--- /dev/null
+++ b/lucene-linguistics/
@@ -0,0 +1,93 @@
+# Vespa Lucene Linguistics
+Linguistics implementation based on Apache Lucene.
+- a list of default analyzers per language;
+- building custom analyzers through the configuration of the linguistics component;
+- building custom analyzers in Java code and declaring them as `components`.
+## Development
+mvn clean test -U package
+To compile configuration classes so that Intellij doesn't complain:
+- right click on `pom.xml`
+- then `Maven`
+- then `Generate Sources and Update Folders`
+## Usage
+Add `<component>` to `services.xml` of your application package, e.g.:
+<component id="" bundle="lucene-linguistics">
+ <config name="">
+ <configDir>linguistics</configDir>
+ <analysis>
+ <item key="en">
+ <tokenizer>
+ <name>standard</name>
+ </tokenizer>
+ <tokenFilters>
+ <item>
+ <name>reverseString</name>
+ </item>
+ </tokenFilters>
+ </item>
+ </analysis>
+ </config>
+into `container` clusters that has `<document-processing/>` and/or `<search>` specified.
+And then package and deploy, e.g.:
+(mvn clean -DskipTests=true -U package && vespa deploy -w 100)
+### Configuration of Lucene Analyzers
+Read the Lucene docs of subclasses of:
+- [TokenizerFactory](org.apache.lucene.analysis.TokenizerFactory), e.g. [StandardTokenizerFactory](
+- [CharFilterFactory](, e.g. [PatternReplaceCharFilterFactory](
+- [TokenFilterFactory](, e.g. [ReverseStringFilterFactory](
+E.g. tokenizer `StandardTokenizerFactory` has this config [snippet](
+ <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
+ <analyzer>
+ <tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/>
+ </analyzer>
+ </fieldType>
+Then go to the [source code]( of the class on Github.
+Copy value of the `public static final String NAME` into the `<name>` and observe the names used for configuring the tokenizer (in this case only `maxTokenLength`).
+ <name>standard</name>
+ <config>
+ <item key="maxTokenLength">255</item>
+ </config>
+The `AnalyzerFactory` constructor logs the available analysis components.
+The analysis components are discovered through Java Service Provider Interface (SPI).
+To add more analysis components it should be enough to put a Lucene analyzer dependency into your application package `pom.xml`
+or register services and create classes directly in the application package.
+### Resource files
+The resource files are relative to the component config `configDir`.
+## Inspiration
+These projects:
+- [vespa-chinese-linguistics](
+- [OpenNlp Linguistics](
+- [vespa-kuromoji-linguistics](
+- [Clojure library]( to work with Lucene analyzers