lucene-linguistics/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

# Vespa Lucene Linguistics

Linguistics implementation based on Apache Lucene.
Features:
- a list of default analyzers per language;
- building custom analyzers through the configuration of the linguistics component;
- building custom analyzers in Java code and declaring them as `components`.

## Development

Build:
```shell
mvn clean test -U package
```

To compile configuration classes so that Intellij doesn't complain:
- right click on `pom.xml` 
- then `Maven` 
- then `Generate Sources and Update Folders`

## Usage

Add `<component>` to `services.xml` of your application package, e.g.:
```xml
<component id="com.yahoo.language.lucene.LuceneLinguistics" bundle="lucene-linguistics">
  <config name="com.yahoo.language.lucene.lucene-analysis">
    <configDir>linguistics</configDir>
    <analysis>
      <item key="en">
        <tokenizer>
          <name>standard</name>
        </tokenizer>
        <tokenFilters>
          <item>
            <name>reverseString</name>
          </item>
        </tokenFilters>
      </item>
    </analysis>
  </config>
</component>
```
into `container` clusters that has `<document-processing/>` and/or `<search>` specified.

And then package and deploy, e.g.:
```shell
(mvn clean -DskipTests=true -U package && vespa deploy -w 100)
```

### Configuration of Lucene Analyzers

Read the Lucene docs of subclasses of:
- [TokenizerFactory](org.apache.lucene.analysis.TokenizerFactory), e.g. [StandardTokenizerFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html)
- [CharFilterFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/CharFilterFactory.html), e.g.  [PatternReplaceCharFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html)
- [TokenFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html), e.g. [ReverseStringFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilterFactory.html)

E.g. tokenizer `StandardTokenizerFactory` has this config [snippet](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html):
```xml
 <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/>
   </analyzer>
 </fieldType>
```

Then go to the [source code](https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36) of the class on Github.
Copy value of the `public static final String NAME` into the `<name>` and observe the names used for configuring the tokenizer (in this case only `maxTokenLength`).
```xml
<tokenizer>
  <name>standard</name>
  <config>
    <item key="maxTokenLength">255</item>
  </config>
</tokenizer>
```

The `AnalyzerFactory` constructor logs the available analysis components.

The analysis components are discovered through Java Service Provider Interface (SPI).
To add more analysis components it should be enough to put a Lucene analyzer dependency into your application package `pom.xml`
or register services and create classes directly in the application package.

### Resource files

The resource files are relative to the component config `configDir`.

## Inspiration

These projects:
- [vespa-chinese-linguistics](https://github.com/vespa-engine/sample-apps/blob/master/examples/vespa-chinese-linguistics/src/main/java/com/qihoo/language/JiebaLinguistics.java).
- [OpenNlp Linguistics](https://github.com/vespa-engine/vespa/blob/50d7555bfe7bdaec86f8b31c4d316c9ba66bb976/opennlp-linguistics/src/main/java/com/yahoo/language/opennlp/OpenNlpLinguistics.java)
- [vespa-kuromoji-linguistics](https://github.com/yahoojapan/vespa-kuromoji-linguistics/tree/main)
- [Clojure library](https://github.com/dainiusjocas/lucene-text-analysis) to work with Lucene analyzers