summaryrefslogtreecommitdiffstats
path: root/lucene-linguistics
diff options
context:
space:
mode:
authorKristian Aune <kraune@verizonmedia.com>2023-09-21 08:35:01 +0200
committerKristian Aune <kraune@verizonmedia.com>2023-09-21 08:35:01 +0200
commitd2428cbfb723fb3268499eb9b59fd21a8cf9e62e (patch)
treed9f0c643c49706cc3d9554ce015dd6afc2f64310 /lucene-linguistics
parentc21a9060f0779566bfcf9f776484d64975dee7f4 (diff)
Ignore from htmlproofer linkcheck
Diffstat (limited to 'lucene-linguistics')
-rw-r--r--lucene-linguistics/README.md21
1 files changed, 16 insertions, 5 deletions
diff --git a/lucene-linguistics/README.md b/lucene-linguistics/README.md
index feece2b2366..a3b20b94bf9 100644
--- a/lucene-linguistics/README.md
+++ b/lucene-linguistics/README.md
@@ -3,6 +3,7 @@
Linguistics implementation based on the [Apache Lucene](https://lucene.apache.org).
Features:
+
- a list of default analyzers per language;
- building custom analyzers through the configuration of the linguistics component;
- building custom analyzers in Java code and declaring them as `components`.
@@ -10,18 +11,21 @@ Features:
## Development
Build:
+
```shell
mvn clean test -U package
```
To compile configuration classes so that Intellij doesn't complain:
-- right click on `pom.xml`
-- then `Maven`
+
+- right click on `pom.xml`
+- then `Maven`
- then `Generate Sources and Update Folders`
## Usage
Add `<component>` to `services.xml` of your application package, e.g.:
+
```xml
<component id="com.yahoo.language.lucene.LuceneLinguistics" bundle="lucene-linguistics">
<config name="com.yahoo.language.lucene.lucene-analysis">
@@ -41,9 +45,11 @@ Add `<component>` to `services.xml` of your application package, e.g.:
</config>
</component>
```
+
into `container` clusters that have `<document-processing/>` and/or `<search>` specified.
And then package and deploy, e.g.:
+
```shell
(mvn clean -DskipTests=true -U package && vespa deploy -w 100)
```
@@ -51,11 +57,13 @@ And then package and deploy, e.g.:
### Configuration of Lucene Analyzers
Read the Lucene docs of subclasses of:
+
- [TokenizerFactory](org.apache.lucene.analysis.TokenizerFactory), e.g. [StandardTokenizerFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html)
-- [CharFilterFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/CharFilterFactory.html), e.g. [PatternReplaceCharFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html)
+- [CharFilterFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/CharFilterFactory.html), e.g. [PatternReplaceCharFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html)
- [TokenFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html), e.g. [ReverseStringFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilterFactory.html)
E.g. tokenizer `StandardTokenizerFactory` has this config [snippet](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html):
+
```xml
<fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
<analyzer>
@@ -64,8 +72,10 @@ E.g. tokenizer `StandardTokenizerFactory` has this config [snippet](https://luce
</fieldType>
```
-Then go to the [source code](https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36) of the class on Github.
+Then go to the <a href="https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36" data-proofer-ignore>
+source code</a> of the class on GitHub.
Copy value of the `public static final String NAME` into the `<name>` and observe the names used for configuring the tokenizer (in this case only `maxTokenLength`).
+
```xml
<tokenizer>
<name>standard</name>
@@ -92,7 +102,8 @@ If the `configDir` is not specified then files are loaded from the classpath.
## Inspiration
These projects:
+
- [vespa-chinese-linguistics](https://github.com/vespa-engine/sample-apps/blob/master/examples/vespa-chinese-linguistics/src/main/java/com/qihoo/language/JiebaLinguistics.java).
- [OpenNlp Linguistics](https://github.com/vespa-engine/vespa/blob/50d7555bfe7bdaec86f8b31c4d316c9ba66bb976/opennlp-linguistics/src/main/java/com/yahoo/language/opennlp/OpenNlpLinguistics.java)
- [vespa-kuromoji-linguistics](https://github.com/yahoojapan/vespa-kuromoji-linguistics/tree/main)
-- [Clojure library](https://github.com/dainiusjocas/lucene-text-analysis) to work with Lucene analyzers
+- [Clojure library](https://github.com/dainiusjocas/lucene-text-analysis) to work with Lucene analyzers