diff options
author | Kristian Aune <kraune@verizonmedia.com> | 2023-09-21 08:35:01 +0200 |
---|---|---|
committer | Kristian Aune <kraune@verizonmedia.com> | 2023-09-21 08:35:01 +0200 |
commit | d2428cbfb723fb3268499eb9b59fd21a8cf9e62e (patch) | |
tree | d9f0c643c49706cc3d9554ce015dd6afc2f64310 /lucene-linguistics | |
parent | c21a9060f0779566bfcf9f776484d64975dee7f4 (diff) |
Ignore from htmlproofer linkcheck
Diffstat (limited to 'lucene-linguistics')
-rw-r--r-- | lucene-linguistics/README.md | 21 |
1 files changed, 16 insertions, 5 deletions
diff --git a/lucene-linguistics/README.md b/lucene-linguistics/README.md index feece2b2366..a3b20b94bf9 100644 --- a/lucene-linguistics/README.md +++ b/lucene-linguistics/README.md @@ -3,6 +3,7 @@ Linguistics implementation based on the [Apache Lucene](https://lucene.apache.org). Features: + - a list of default analyzers per language; - building custom analyzers through the configuration of the linguistics component; - building custom analyzers in Java code and declaring them as `components`. @@ -10,18 +11,21 @@ Features: ## Development Build: + ```shell mvn clean test -U package ``` To compile configuration classes so that Intellij doesn't complain: -- right click on `pom.xml` -- then `Maven` + +- right click on `pom.xml` +- then `Maven` - then `Generate Sources and Update Folders` ## Usage Add `<component>` to `services.xml` of your application package, e.g.: + ```xml <component id="com.yahoo.language.lucene.LuceneLinguistics" bundle="lucene-linguistics"> <config name="com.yahoo.language.lucene.lucene-analysis"> @@ -41,9 +45,11 @@ Add `<component>` to `services.xml` of your application package, e.g.: </config> </component> ``` + into `container` clusters that have `<document-processing/>` and/or `<search>` specified. And then package and deploy, e.g.: + ```shell (mvn clean -DskipTests=true -U package && vespa deploy -w 100) ``` @@ -51,11 +57,13 @@ And then package and deploy, e.g.: ### Configuration of Lucene Analyzers Read the Lucene docs of subclasses of: + - [TokenizerFactory](org.apache.lucene.analysis.TokenizerFactory), e.g. [StandardTokenizerFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html) -- [CharFilterFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/CharFilterFactory.html), e.g. [PatternReplaceCharFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html) +- [CharFilterFactory](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/CharFilterFactory.html), e.g. [PatternReplaceCharFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html) - [TokenFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html), e.g. [ReverseStringFilterFactory](https://lucene.apache.org/core/8_1_1/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilterFactory.html) E.g. tokenizer `StandardTokenizerFactory` has this config [snippet](https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html): + ```xml <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100"> <analyzer> @@ -64,8 +72,10 @@ E.g. tokenizer `StandardTokenizerFactory` has this config [snippet](https://luce </fieldType> ``` -Then go to the [source code](https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36) of the class on Github. +Then go to the <a href="https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36" data-proofer-ignore> +source code</a> of the class on GitHub. Copy value of the `public static final String NAME` into the `<name>` and observe the names used for configuring the tokenizer (in this case only `maxTokenLength`). + ```xml <tokenizer> <name>standard</name> @@ -92,7 +102,8 @@ If the `configDir` is not specified then files are loaded from the classpath. ## Inspiration These projects: + - [vespa-chinese-linguistics](https://github.com/vespa-engine/sample-apps/blob/master/examples/vespa-chinese-linguistics/src/main/java/com/qihoo/language/JiebaLinguistics.java). - [OpenNlp Linguistics](https://github.com/vespa-engine/vespa/blob/50d7555bfe7bdaec86f8b31c4d316c9ba66bb976/opennlp-linguistics/src/main/java/com/yahoo/language/opennlp/OpenNlpLinguistics.java) - [vespa-kuromoji-linguistics](https://github.com/yahoojapan/vespa-kuromoji-linguistics/tree/main) -- [Clojure library](https://github.com/dainiusjocas/lucene-text-analysis) to work with Lucene analyzers +- [Clojure library](https://github.com/dainiusjocas/lucene-text-analysis) to work with Lucene analyzers |