aboutsummaryrefslogtreecommitdiffstats
path: root/juniper/doc
diff options
context:
space:
mode:
authorJon Bratseth <bratseth@yahoo-inc.com>2016-06-15 23:09:44 +0200
committerJon Bratseth <bratseth@yahoo-inc.com>2016-06-15 23:09:44 +0200
commit72231250ed81e10d66bfe70701e64fa5fe50f712 (patch)
tree2728bba1131a6f6e5bdf95afec7d7ff9358dac50 /juniper/doc
Publish
Diffstat (limited to 'juniper/doc')
-rw-r--r--juniper/doc/written/fsearchparams.html499
1 files changed, 499 insertions, 0 deletions
diff --git a/juniper/doc/written/fsearchparams.html b/juniper/doc/written/fsearchparams.html
new file mode 100644
index 00000000000..b95ba895503
--- /dev/null
+++ b/juniper/doc/written/fsearchparams.html
@@ -0,0 +1,499 @@
+<!-- Copyright 2016 Yahoo Inc. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
+<title>Juniper Configuration Documentation</title>
+<h1>Juniper Configuration Documentation</h1>
+
+<b>Note:</b> This document describes in details the functionality of Juniper v.2.1.0.
+The document has gradually become more and more for internal use for
+instance for detailed tuning
+by Professional Service. A more high level and less detailed user level
+configuration documentation is also available.
+<p>
+Juniper implements a combined proximity ranking and dynamic teaser result
+processing module.This module is intended to be interfaced to by different
+Fast software modules on demand. Currently, the only available module that
+makes use of Juniper is the Fast Server module, in which Juniper currently
+is an integrated part of <dfn>fsearch</dfn> (the search engine executable
+that runs on each search node in the system).
+
+<h2>Juniper simple description of functionality/implementation</h2>
+
+The document body is stripped for markup during document processing and
+stored as an extended document summary field.
+A max limit of how much of the document that gets stored is configurable as
+of Fast Server v.4.17 (see the Fast Server�configuration documentation for
+details). For each document
+on the result page, this document extract is retrieved and fed through
+Juniper which will perform the following steps:
+
+<ol>
+ <li> Scan the stripped document text (docsum) for matches of the query,
+ create a data structure containing information about those matches,
+ and provide a quality measure (rank boost value) that can be used as a
+ metric to determine the quality of the document wrt. proximity and
+ position of the search string in the document. The data structure
+ contains ao. a list of matches of the query ordered by quality (see below
+ for the quality measure). The document quality measure is computed from the
+ quality measure of the best of the individual matches and the total
+ number of hits within the document.
+
+ <li> Generate a dynamic teaser based on the data structure previously
+ generated. The dynamic teaser is composed of a number of text segments
+ that include the "best" matches of the query in that
+ document. The teaser is presented with the query words highlighted.
+ The definition of highlight is configurable. If the document is short
+ enough to fit completely into the configured teaser length, it will be
+ provided as is, but with highlighting of the relevant keywords.
+</ol>
+
+Step 2 is only necessary if the teaser is going to be displayed, which
+might be a decision taken on basis of the quality measure provided in step 1.
+
+<h3>Quality measure</h3>
+The text segments matching the query are ranked by (in decreasing order of
+significance):
+
+<ol>
+ <li> Completeness * keyword weight -
+ higher ranking if more search words are present in
+ the same context, and relatively higher weight on matches that
+ contains "important" terms compared to matches with stop words if
+ equal number of words.
+ <li> Proximity - query terms occurring near each other is better
+ <li> Position - earlier in document is better
+</ol>
+
+The number of matches selected is based on text segment lengths including
+a configurable amound of surrounding text, the number of matching segments
+to use (configurable) and the required total summary
+size (configurable). The final set of matches is
+returned with markup for the hits and the abbreviated sections
+(continuation symbol).
+
+The query used for teaser generation has undergone proper name recognition
+and English spell checking. Highlighting is done on individual terms of the
+query. In particular, phrases are broken down into individual terms, but
+the preference to proximal terms will maintain the phrasing in the
+generated teaser.
+
+Lemmatization by expanding documents with word inflections cannot be used
+by Juniper. In the future, Juniper would expand the query based on the
+original query and language information. This functionality is not
+available yet, thus lemmatized terms will in general not be highlighted by
+Juniper.
+
+Currently Juniper uses an alternative, simple brute force stemming
+algorithm that basically allows prefixes of the document words to match if
+the document word in question is no longer than P (configurable) bytes
+<i>longer</i> than the query keyword.
+This algorithm works well for keywords of a certain size, but not for very
+short keywords. Thus an additional configuration variable defines a lower
+bound for what lengths of keywords that will be subject to this algorithm.
+With this simple algorithm in place, typical weak form singular to plural
+mapping will get highlighted while the opposite, going from a long form to
+a shorter one will not work as might be expected. Eg. this algorithm does
+not change the keywords themselves. Consequently, the shorter forms of the
+keywords are more likely to give non-exact hits in the dynamic summary.
+
+
+<h2>Fast Search configuration of Juniper</h2>
+Enabling Juniper functionality within Fast Search is done on a per field
+basis by means of override specifications in summary.map.
+Currently the following override specs are supported by Juniper:
+<p>
+<ul>
+<li><pre>override &lt;outputfield&gt; dynamicteaser &lt;inputfield&gt;</pre>
+<li><pre>override &lt;outputfield&gt; dynamicteasermetric
+&lt;inputfield&gt;</pre>
+<li><pre>override &lt;outputfield&gt; juniperlog &lt;inputfield&gt;</pre>
+</ul>
+<p>
+Details of the override directive can be found in <i>Fast Search 4.13 -
+Dynamic Docsum Generation Framework</i>.
+The <tt>dynamicteasermetric</tt> field provides a ranking of the document
+based on a corresponding metric as that used to select between individual
+matches for dynamic teaser generation inside a document. See the section on
+using Juniper for proximity boosting <a href="#proximity">below</a>
+The <tt>juniperlog</tt> field is new as of Juniper 2.0, and is used to
+retrieve the information generated by Juniper by means of the log query
+option, see the runtime option table <a href="#dynpar">below</a>.
+
+Note that when integrated into Data Search 3.1 and later,
+this part of the configuration will be generated via
+<ol>
+<li>The index profile
+<li>The index configuration (indexConfig.xml) by the config server.
+</ol>
+
+<h3>Configuration levels</h3>
+When integrated into Fast Search, Juniper receives its default parameters from
+global settings in the <dfn>fsearchrc</dfn> config file.
+These configuration parameter settings must be preconfigured at
+<dfn>fsearch</dfn> process startup time. Two levels of system configuration
+is currently supported,
+<ol>
+<li><b>System default configuration:</b> This is the configuration settings
+exemplified by the parameter descriptions below.
+<li><b>Per field configuration:</b>By using the field names instead of the
+string "juniper" as prefix, the default setting can be overridden on a
+<dfn>result field</dfn> basis. Eg. setting for instance
+<pre>myfield.dynsum.length 512</pre> (see below) would allow the
+<tt>myfield</tt> result field to receive a different teaser length.
+</ol>
+In addition Juniper 2.x supports changing certain subset of the parameters
+on a per query basis. See <a href="#dynpar">separate section</a> on this below.
+<p>
+<b>Performance note:</b> The per field configuration possibility should be used with
+care since overriding some parameters may cause significant computation
+overhead in that Juniper would have to scan the whole text multiple times.
+Changing the <tt>dynsum</tt> group of fields is generally quite performance
+conservative (only the teaser generation phase would have to be repeated),
+while changing any of the <tt>stem</tt> or <tt>matcher</tt>
+fields would require a different text scan for each combination of
+parameters.
+
+<h3>Arbitrary byte sequences in markup parameters</h3>
+To allow arbitrary byte seqences (such as low ascii values) to be used to
+denote highlight on/off and continuation symbol(s), Juniper now accepts
+strings on the form \xNN where the N's are hex values [0-9a-fA-F].
+This will be converted into a byte value of NN. Note that Juniper exports
+UTF-8 text so this sequence should be a valid UTF-8 byte sequence. No
+checks are performed from Juniper on the validity of such strings in the
+<dfn>fsearch</dfn> domain.
+As a consequence of this, occurrences of backslash must be escaped
+accordingly (<dfn>\\</dfn>).
+
+
+<h3>Blanks in text parameters must be escaped</h3>
+Note that <dfn>fsearchrc</dfn> does not accept blanks in the
+parameters. To allow more complicated highlight markup, the sequence
+<dfn>\x20</dfn> must be used as space in text fields.
+
+<p>
+<h3><a name="em"></a>Escaping markup in the summary text</h3>
+Since Juniper may supply markup through the use of the
+highlight and continuation parameters, problems may occur if the
+analysed text itself contains markup. To avoid this, Juniper may be
+configured to escape the 5 XML/HTML markup symbols
+(<dfn>&quot;&amp;&lt;&apos;&gt;</dfn>) before adding the mentioned
+parametrized symbols. See the description of the
+juniper.dynsum.escape_markup parameter <a href="#empar">below</a>.
+<p>
+The following variables are available for static, global configuration for
+a particular search node:
+<p>
+
+<table><a name="conftable"></a>
+ <tr bgcolor="#f0f0f0"><td><b>Parameter name</b></td><td><b>Default
+ value</b></td><td><b>Description</b></td>
+ </tr>
+
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.highlight_on</td><td>&lt;b&gt;</td>
+ <td>A string to be included <i>before</i> each hit in the generated summary</td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.highlight_off</td>
+ <td>&lt;/b&gt;</td><td>A string to be included <i>after</i> each hit in
+ the generated summary</td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.continuation</td>
+ <td>...</td><td>A string to be included to denote
+ abbreviated/left out pieces of the original text in the generated summary</td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.separators</td>
+ <td>\x1D\x1F</td><td>A string containing characters that are added for
+ word separation purposes (eg.CJK languages and German/Norwegian
+ etc. word separation). This list should contain non-word characters
+ only for this to be meaningful. Also, currently only single byte
+ characters are supported. These characters wil be removed from the
+ generated teaser by Juniper.</td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.connectors</td>
+ <td>-'</td><td>A string containing characters that may connect two word
+ tokens to form a single word. Words connected by a single such
+ character will not be splitted by Juniper when generating the teaser.</td>
+ <tr bgcolor="#f0f0f0">
+ <td><a name="empar"></a>juniper.dynsum.escape_markup</td>
+ <td>auto</td><td>See <a href="#em">description</a> above. Accepted values:
+ <dfn>on</dfn>,<dfn>off</dfn> or <dfn>auto</dfn>. If <dfn>auto</dfn> is
+ used, Juniper will escape markup in the generated summary if any of the symbols
+ <dfn>highlight_on</dfn>, <dfn>highlight_off</dfn> or
+ <dfn>continuation</dfn> contains a <dfn>&lt;</dfn> as the first
+ character.
+ </td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.length</td>
+ <td>256</td><td>Length of the generated summary in bytes. This is a
+ hint to Juniper. The result may be slightly longer or shorter depending
+ on the structure of the available document text and the submitted
+ query.</td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.max_matches</td>
+ <td>4</td><td>The number of (possibly partial) set of keywords
+ matching the query, to attempt to include in the summary. The larger this
+ value compared is set relative to the <i>length</i> parameter, the more
+ dense the keywords may appear in the summary.</td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.min_length</td>
+ <td>128</td><td>Minimal desired length of the generated summary in
+ bytes. This is the shortest summary length for which the number of
+ matches will be respected. Eg. if
+ a summary appear to become shorter than <i>min_length</i> bytes with
+ <i>max_matches</i> matches, then additional matches will be used if available.</td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.dynsum.surround_max</td>
+ <td>80</td><td>The maximal number of bytes of context to prepend and append to
+ each of the selected query keyword hits. This parameter defines the
+ max size a summary would become if there are few keyword hits
+ (max_matches set low or document contained few matches of the
+ keywords.</td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.stem.min_length</td>
+ <td>5</td><td>The minimal number of bytes in a query keyword for
+ it to be subject to the simple Juniper stemming algorithm. Keywords
+ that are shorter than or equal to this limit will only yield exact
+ matches in the dynamic summaries.
+ </td>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.stem.max_extend</td>
+ <td>3</td><td>The maximal number of bytes that a word in the document
+ can be <i>longer</i> than the keyword itself to yield a match. Eg. for
+ the default values, if the keyword is 7 bytes long, it will match any
+ word with length less than or equal to 10 for which the keyword is a prefix.
+ </td>
+ </tr>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.matcher.winsize</td>
+ <td>400</td><td>The size of the sliding window used to determine if
+ multiple query terms occur together. The larger the value, the more
+ likely the system will find (and present in dynamic summary) complete
+ matches containing all the search terms. The downside is a potential
+ performance overhead of keeping candidates for matches longer during
+ matching, and consequently updating more candidates that eventually
+ gets thrown.
+ </td>
+ </tr>
+ <tr bgcolor="#f0f0f0">
+ <td>juniper.proximity.factor</td><td>0.25</td><td>
+A factor to multiply the internal Juniper metric with when producing
+proximity metric for a given field. A real/floating point value accepted
+Note that the QRserver (see <a href="#qrserver">below</a>)
+also supports a factor that is global to all proximity
+metric fields, and that is applied in addition. </td>
+ </tr>
+
+</table>
+
+<h2><a name="dynpar"></a>Alternate behaviour of Juniper on a per query
+basis</h2>
+As of Juniper v.2.x and Fastserver v.4.17, and QRserver for Data Search
+3.2, Juniper supports a number of Juniper specific options that can be
+provided as part of the URL. The format of the option string is
+<pre>
+ juniper=&lt;param_name&gt;.&lt;value&gt;[_&lt;param_name&gt;.&lt;value&gt;]*
+</pre>
+As an example, consider the following URL addition:
+<pre>
+ juniper=near.2_dynlength.512_dynmatches.8
+</pre>
+If this string is present in the URL, Juniper would generate teasers that
+are up to twice as long and contains up to twice as many matches of the
+query compared to the default values.
+In addition, teasers (and <a href="#proximity">proximity metric</a>) will
+only be
+generated for those documents that fulfills the extra constraint that there
+exist at least one complete match of the query where the distance in words
+between the first and the last word of the query match is no more than 2 (+
+the number of words in the query).
+
+<h3>Supported per query options in Juniper 2.1.0</h3>
+
+<table>
+ <tr bgcolor="#f0f0f0"><td><b>Parameter
+ name</b></td><td><b>Corresponding config name, see <a href="#conftable">above</a></b></td><td><b>Description</b></td></tr>
+ <tr bgcolor="#f0f0f0"><td>dynlength</td><td>dynsum.length</td>
+ <td>The desired max length of the generated teaser</td></tr>
+ <tr bgcolor="#f0f0f0"><td>dynmatches</td><td>dynsum.max_matches</td>
+ <td>The number of matches to try to fit in the teaser</td>
+ </tr>
+ <tr bgcolor="#f0f0f0"><td>dynsurmax</td><td>dynsum.surround_max</td>
+ <td>The maximal amount of surrounding context per keyword hit</td>
+ </tr>
+ <tr bgcolor="#f0f0f0"><td>near</td><td><i>N/A</i></td><td>Specifies a
+ proximity search where keywords should occur closer than the specified
+ value in number of words not counting the query terms themselves.</td>
+ </tr>
+ <tr bgcolor="#f0f0f0"><td>stemext</td><td>juniper.stem.max_extend</td><td>
+ The maximal number of bytes that a word in the document can be longer
+ than the keyword itself to yield a match.</td>
+ </tr>
+ <tr bgcolor="#f0f0f0"><td>stemmin</td><td>juniper.stem.min_length</td><td>
+ The minimal number of bytes in a query keyword for it to be subject to
+ the simple Juniper stemming algorithm.</td>
+ </tr>
+ <tr bgcolor="#f0f0f0"><td>within</td><td><i>N/A</i></td><td>Same as
+ <i>near</i> with the additional constraint that matches of the query must
+ have the same order of the query words as the original query.</td>
+ </tr>
+ <tr bgcolor="#f0f0f0"><td>winsize</td><td>juniper.matcher.winsize</td><td>
+ The size of the sliding window used to determine if multiple query terms
+ occur together.</td>
+ </tr>
+ <tr bgcolor="#f0f0f0"><td>log</td><td><i>N/A</i></td><td>Internal debug
+ option (privileged port only). Value is a bitmap that allows selectively
+ enabled log output to be generated by Juniper for output into a
+ juniperlog override configured summary field. Useful only with a special
+ template that makes use of this information. Currently the only
+ supported bit is 0x8000 which will provide a html table with up to 20 of the
+ topmost matches of each document, and their identified proximity
+ (distance) and rank.
+</td>
+</table>
+
+<h3>Juniper debug template in Data Search</h3>
+Template support for the log parameter as well as extracting the whole
+juniper input document text is provided by Data Search 3.2 by means of the
+<tt>jsearch</tt> page from the qrserver port. Replace <tt>asearch</tt>
+with <tt>jsearch</tt>
+in the URL of a qrserver privileged port search.
+
+<h2>Using Juniper for proximity boosting with the QRserver</h2>
+In order to use Juniper to boost hits that have good proximity of the query
+(or to filter the hits based on NEAR or WITHIN constraints) the QRserver
+would need to be provided with the following URL addition:
+<pre>
+rpf_proximitybooster:enabled=1
+</pre>
+Note that Juniper will return 0 as proximity metric (dynamicteasermetric)
+if the query with juniper option constraints cannot be satisfied by
+the information in the configured input field. Thus if the selection of a
+hit is done solely on the basis of information not present in the Juniper input
+(such as the title in the default configuration)
+proximity boosting may demote such hits. A solution for this problem has
+been proposed for future versions of Juniper.
+
+<h3><a name="qrserver"></a>Supported QRserver options to use with proximity
+boosting via Juniper</h3>
+QRserver behaviour wrt. proximity boosting can be set both in configuration
+(at QRserver startup) or on a per query basis. In the below table, some of the
+default configuration settings are listed together with their corresponding
+runtime setting, if any. Consult QRserver documentation for the complete
+list of options.
+<p>
+<table>
+ <tr bgcolor="#f0f0f0"><td><b>Config parameter
+ name</b></td><td><b>Corresponding runtime (URL)
+ syntax</a></b></td><td><b>Description</b></td></tr>
+ <tr bgcolor="#f0f0f0"><td>rp.proximityboost.enabled=1</td><td>N/A</td><td>Configure for
+ proximity boosting in the QRserver (not necessarily enable it)</td></tr>
+ <tr
+ bgcolor="#f0f0f0"><td>#rp.proximityboost.default</td>
+ <td>rpf_proximityboost:enabled=1</td><td>Enable proximity
+ boosting</td></tr>
+ <tr
+ bgcolor="#f0f0f0"><td>rp.proximityboost.factor=0.5</td>
+ <td>N/A</td><td>A value that the combined proximity boost value
+ calculated possibly from multiple fields, scaled by their individual
+ factors are multipled with before adding it to the Fastserver rank value
+ to be used to reorder hits.</td></tr>
+ <tr bgcolor="#f0f0f0"><td>rp.proximityboost.hits=100</td>
+ <td>rpf_proximityboost:hits=100</td><td>The number of Fastserver hits to
+ retrieve as basis for the reordering.</td></tr>
+ <tr
+ bgcolor="#f0f0f0"><td>rp.proximityboost.maxoffset=100</td>
+ <td>N/A</td><td>The maximal offset within the list of hits that will be
+ subject to any proximity boost reordering/filtering. Hits above this
+ range in the original result set will not be subject to proximity boosting.</td></tr>
+</table>
+
+<h2>Configuring Juniper within Data Search</h2>
+Except where explicitly noted, configuring Juniper for Data Search is
+similar to configuring for Real-Time Search. As of Data Search 3.0 Juniper
+is by default configured and enabled in Data Search.
+
+<h2>Configuring Juniper within Fast Real-Time Search</h2>
+Juniper is provided as part of Real-Time Search (through Fast Search)
+starting with version 2.4. To enable the Fast Search integrated Juniper in
+a Real-Time Search environment, see the documentation extensions to
+Real-Time Search 2.4. Note that to configure Juniper within Real-Time
+Search, the configuration variables should be put in the
+<tt>etc/fsearch.addon*</tt> file(s) which will be used as input when Real-Time
+Search generates <tt>fsearchrc</tt> files for all configured search
+engines. Also a proper <tt>summary.map</tt> file is needed to enable the
+dynamic summaries on particular fields.
+
+<h2>Configuring Juniper for Fast Search v.4.15 and higher</h2>
+Newer versions of Fast Search provide template support to allow different
+Juniper markup depending on the type of display desired (plain,html or
+xml).
+
+All the http frontends that needs an interpretation of the highlight
+information provided by Juniper should have the following setup for Juniper:
+<p>
+<table>
+<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_on</td><td>\02</td></tr>
+<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_off</td><td>\03</td></tr>
+<tr bgcolor="#f0f0f0"><td>juniper.dynsum.continuation</td><td>\1E</td></tr>
+</table>
+<p>
+The actual frontend markup configuration then takes place by setting
+variables such as
+<p>
+<table>
+<tr bgcolor="#f0f0f0"><td>tvm.dynsum.html.highlight_on</td></tr>
+<tr bgcolor="#f0f0f0"><td>tvm.dynsum.xml.highlight_off</td></tr>
+</table>
+<p>
+in the relevant rc file.
+
+Note that for a Data Search installation, the QR server has
+the responsibility of providing dynamic teasers, consequently the
+<tt>etc/qrserver/qrserverrc</tt> file should provide the above
+configuration.
+<p>
+For a similar Web Search configuration, using a "bare-bone" top level
+fdispatch instance, the <tt>fdispatchrc</tt> file is the appropriate place.
+<p>
+For a "standalone" Real-Time Search setup, the appropriate configuration
+file(s) is the fdispatch.addon file(s).
+
+<h2>How to report bugs/errors related to Juniper</h2>
+Errors/problems related to Juniper can be divided into two categories:
+<ol>
+ <li> Problems with specific documents/teasers
+ <li> System errors/instability etc.
+</ol>
+Due to the complexity of a full, running system it is much easier for all
+parts if the particular query/document pair triggering the problem can be
+identified and analysed off-line.
+
+<h3>Problems with specific teasers</h3>
+Problems of this category is likely to occur because there are so many
+combinations of queries and documents that it is not possible to
+test for all cases. To be able to analyse such problems, it is vital that
+the exact (byte-by-byte) teaser generation source (document summary input
+to Juniper) can be made available together with the exact query as
+presented to Juniper. To determine this requires the following information:
+<ol>
+ <li> The teaser source docsum. The name of the docsum field is dependent
+ on the configuration in summary.map. The data should be provided without
+ any post processing performed, if possible, to avoid missing problems
+ related to bad input data such as malformet UTF8 characters.
+ <li> The original query as submitted by the user
+ <li> The expanded query (available under <tt>var/log/querylogs/</tt>)
+ <li> A corresponding <tt>fsearchrc</tt> (pure fastserver4) or
+ <tt>fsearch.addon</tt> (Real Time Search/DS 3.x) file used by the
+ fsearch process that performed the task.
+</ol>
+
+
+<h3>System errors/instability problems</h3>
+Problems of category 2 should, if occurring at all only be associated with
+development/beta releases. If such an unfortunate event should happen, the
+following information <dfn>in addition to the information associated with
+category 1</dfn> would be useful to pin down the problem:
+<ol>
+ <li> Core file of <dfn>fsearch</dfn> accompanied with the associated
+ <dfn>fsearch</dfn> binary.
+ <li> Log files from the crashed process (in Data Search these will be
+ present as <tt>var/fsearch-*.log</tt> and <tt>var/log/stdout.log</tt>.
+</ol>