blob: 6329811e4589bdeb2108c71faa841efffa6363da (
plain) (
blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
|
# Vespa Lucene Linguistics
Linguistics implementation based on Apache Lucene.
Features:
- a list of default analyzers per language;
- building custom analyzers through the configuration of the linguistics component;
- building custom analyzers in Java code and declaring them as `components`.
## Development
Build:
```shell
mvn clean test -U package
```
To compile configuration classes so that Intellij doesn't complain:
- right click on `pom.xml`
- then `Maven`
- then `Generate Sources and Update Folders`
## Usage
Add `<component>` to `services.xml` of your application package, e.g.:
```xml
<component id="com.yahoo.language.lucene.LuceneLinguistics" bundle="lucene-linguistics">
<config name="com.yahoo.language.lucene.lucene-analysis">
<configDir>linguistics</configDir>
<analysis>
<item key="en">
<tokenizer>
<name>standard</name>
</tokenizer>
<tokenFilters>
<item>
<name>reverseString</name>
</item>
</tokenFilters>
</item>
</analysis>
</config>
</component>
```
into `container` clusters that has `<document-processing/>` and/or `<search>` specified.
And then package and deploy, e.g.:
```shell
(mvn clean -DskipTests=true -U package && vespa deploy -w 100)
```
### Configuration of Lucene Analyzers
Read the Lucene docs of subclasses of:
- [TokenizerFactory](org.apache.lucene.analysis.TokenizerFactory), e.g. [StandardTokenizerFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html)
- [CharFilterFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/CharFilterFactory.html), e.g. [PatternReplaceCharFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html)
- [TokenFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html), e.g. [ReverseStringFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilterFactory.html)
E.g. tokenizer `StandardTokenizerFactory` has this config [snippet](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html):
```xml
<fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/>
</analyzer>
</fieldType>
```
Then go to the [source code](https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36) of the class on Github.
Copy value of the `public static final String NAME` into the `<name>` and observe the names used for configuring the tokenizer (in this case only `maxTokenLength`).
```xml
<tokenizer>
<name>standard</name>
<config>
<item key="maxTokenLength">255</item>
</config>
</tokenizer>
```
The `AnalyzerFactory` constructor logs the available analysis components.
The analysis components are discovered through Java Service Provider Interface (SPI).
To add more analysis components it should be enough to put a Lucene analyzer dependency into your application package `pom.xml`
or register services and create classes directly in the application package.
### Resource files
The resource files are relative to the component config `configDir`.
## Inspiration
These projects:
- [vespa-chinese-linguistics](https://github.com/vespa-engine/sample-apps/blob/master/examples/vespa-chinese-linguistics/src/main/java/com/qihoo/language/JiebaLinguistics.java).
- [OpenNlp Linguistics](https://github.com/vespa-engine/vespa/blob/50d7555bfe7bdaec86f8b31c4d316c9ba66bb976/opennlp-linguistics/src/main/java/com/yahoo/language/opennlp/OpenNlpLinguistics.java)
- [vespa-kuromoji-linguistics](https://github.com/yahoojapan/vespa-kuromoji-linguistics/tree/main)
- [Clojure library](https://github.com/dainiusjocas/lucene-text-analysis) to work with Lucene analyzers
|