aboutsummaryrefslogtreecommitdiffstats
path: root/lucene-linguistics/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'lucene-linguistics/README.md')
-rw-r--r--lucene-linguistics/README.md93
1 files changed, 93 insertions, 0 deletions
diff --git a/lucene-linguistics/README.md b/lucene-linguistics/README.md
new file mode 100644
index 00000000000..6329811e458
--- /dev/null
+++ b/lucene-linguistics/README.md
@@ -0,0 +1,93 @@
+# Vespa Lucene Linguistics
+
+Linguistics implementation based on Apache Lucene.
+Features:
+- a list of default analyzers per language;
+- building custom analyzers through the configuration of the linguistics component;
+- building custom analyzers in Java code and declaring them as `components`.
+
+## Development
+
+Build:
+```shell
+mvn clean test -U package
+```
+
+To compile configuration classes so that Intellij doesn't complain:
+- right click on `pom.xml`
+- then `Maven`
+- then `Generate Sources and Update Folders`
+
+## Usage
+
+Add `<component>` to `services.xml` of your application package, e.g.:
+```xml
+<component id="com.yahoo.language.lucene.LuceneLinguistics" bundle="lucene-linguistics">
+ <config name="com.yahoo.language.lucene.lucene-analysis">
+ <configDir>linguistics</configDir>
+ <analysis>
+ <item key="en">
+ <tokenizer>
+ <name>standard</name>
+ </tokenizer>
+ <tokenFilters>
+ <item>
+ <name>reverseString</name>
+ </item>
+ </tokenFilters>
+ </item>
+ </analysis>
+ </config>
+</component>
+```
+into `container` clusters that has `<document-processing/>` and/or `<search>` specified.
+
+And then package and deploy, e.g.:
+```shell
+(mvn clean -DskipTests=true -U package && vespa deploy -w 100)
+```
+
+### Configuration of Lucene Analyzers
+
+Read the Lucene docs of subclasses of:
+- [TokenizerFactory](org.apache.lucene.analysis.TokenizerFactory), e.g. [StandardTokenizerFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html)
+- [CharFilterFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/CharFilterFactory.html), e.g. [PatternReplaceCharFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html)
+- [TokenFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html), e.g. [ReverseStringFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilterFactory.html)
+
+E.g. tokenizer `StandardTokenizerFactory` has this config [snippet](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html):
+```xml
+ <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
+ <analyzer>
+ <tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/>
+ </analyzer>
+ </fieldType>
+```
+
+Then go to the [source code](https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36) of the class on Github.
+Copy value of the `public static final String NAME` into the `<name>` and observe the names used for configuring the tokenizer (in this case only `maxTokenLength`).
+```xml
+<tokenizer>
+ <name>standard</name>
+ <config>
+ <item key="maxTokenLength">255</item>
+ </config>
+</tokenizer>
+```
+
+The `AnalyzerFactory` constructor logs the available analysis components.
+
+The analysis components are discovered through Java Service Provider Interface (SPI).
+To add more analysis components it should be enough to put a Lucene analyzer dependency into your application package `pom.xml`
+or register services and create classes directly in the application package.
+
+### Resource files
+
+The resource files are relative to the component config `configDir`.
+
+## Inspiration
+
+These projects:
+- [vespa-chinese-linguistics](https://github.com/vespa-engine/sample-apps/blob/master/examples/vespa-chinese-linguistics/src/main/java/com/qihoo/language/JiebaLinguistics.java).
+- [OpenNlp Linguistics](https://github.com/vespa-engine/vespa/blob/50d7555bfe7bdaec86f8b31c4d316c9ba66bb976/opennlp-linguistics/src/main/java/com/yahoo/language/opennlp/OpenNlpLinguistics.java)
+- [vespa-kuromoji-linguistics](https://github.com/yahoojapan/vespa-kuromoji-linguistics/tree/main)
+- [Clojure library](https://github.com/dainiusjocas/lucene-text-analysis) to work with Lucene analyzers