Fold juniper into searchsummary library.

author: Henning Baldersheim <balder@yahoo-inc.com> 2022-05-19 15:15:55 +0000
committer: Henning Baldersheim <balder@yahoo-inc.com> 2022-05-19 15:15:55 +0000
commit: 5ecf186af848c4a518d7268fcb452b264fba939a (patch)
tree: 9d9e01e78f321d24a221aa2f31cfa9a70b29a818 /juniper/doc/written
parent: 3988acc5dd01647c479412d5815c37d6fa4c6a2b (diff)
1 files changed, 0 insertions, 488 deletions
diff --git a/juniper/doc/written/fsearchparams.html b/juniper/doc/written/fsearchparams.html
deleted file mode 100644
index 1a968e55c41..00000000000
--- a/juniper/doc/written/fsearchparams.html
+++ /dev/null
@@ -1,488 +0,0 @@
-<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
-<title>Juniper Configuration Documentation</title>
-<h1>Juniper Configuration Documentation</h1>
-
-<b>Note:</b> This document describes in details the functionality of Juniper v.2.1.0.
-The document has gradually become more and more for internal use for
-instance for detailed tuning
-by Professional Service. A more high level and less detailed user level
-configuration documentation is also available.
-<p>
-Juniper implements a combined proximity ranking and dynamic teaser result
-processing module.This module is intended to be interfaced to by different
-Fast software modules on demand. Currently, the only available module that
-makes use of Juniper is the Fast Server module, in which Juniper currently
-is an integrated part of <dfn>fsearch</dfn> (the search engine executable
-that runs on each search node in the system).
-
-<h2>Juniper simple description of functionality/implementation</h2>
-
-The document body is stripped for markup during document processing and
-stored as an extended document summary field. 
-A max limit of how much of the document that gets stored is configurable as
-of Fast Server v.4.17 (see the Fast Server�configuration documentation for
-details).  For each document
-on the result page, this document extract is retrieved and fed through
-Juniper which will perform the following steps:
-
-<ol>
-  <li> Scan the stripped document text (docsum) for matches of the query,
-  create a data structure containing information about those matches,
-  and provide a quality measure (rank boost value) that can be used as a
-  metric to determine the quality of the document wrt. proximity and
-  position of the search string in the document. The data structure
-  contains ao. a list of matches of the query ordered by quality (see below
-  for the quality measure). The document quality measure is computed from the
-  quality measure of the best of the individual matches and the total
-  number of hits within the document.
-
-  <li> Generate a dynamic teaser based on the data structure previously
-  generated. The dynamic teaser is composed of a number of text segments
-  that include the "best" matches of the query in that
-  document. The teaser is presented with the query words highlighted.
-  The definition of highlight is configurable. If the document is short
-  enough to fit completely into the configured teaser length, it will be
-  provided as is, but with highlighting of the relevant keywords.
-</ol>
-
-Step 2 is only necessary if the teaser is going to be displayed, which
-might be a decision taken on basis of the quality measure provided in step 1.
-
-<h3>Quality measure</h3>
-The text segments matching the query are ranked by (in decreasing order of
-significance):
-
-<ol>
-  <li> Completeness * keyword weight - 
-       higher ranking if more search words are present in
-       the same context, and relatively higher weight on matches that
-       contains "important" terms compared to matches with stop words if
-       equal number of words.
-  <li> Proximity - query terms occurring near each other is better
-  <li> Position - earlier in document is better
-</ol>
-
-The number of matches selected is based on text segment lengths including
-a configurable amound of surrounding text, the number of matching segments
-to use (configurable) and the required total summary
-size (configurable). The final set of matches is 
-returned with markup for the hits and the abbreviated sections
-(continuation symbol).
-
-The query used for teaser generation has undergone proper name recognition
-and English spell checking. Highlighting is done on individual terms of the
-query. In particular, phrases are broken down into individual terms, but
-the preference to proximal terms will maintain the phrasing in the
-generated teaser.  
-
-Lemmatization by expanding documents with word inflections cannot be used
-by Juniper. In the future, Juniper would expand the query based on the
-original query and language information. This functionality is not
-available yet, thus lemmatized terms will in general not be highlighted by
-Juniper.
-
-Currently Juniper uses an alternative, simple brute force stemming
-algorithm that basically allows prefixes of the document words to match if
-the document word in question is no longer than P (configurable) bytes
-<i>longer</i> than the query keyword. 
-This algorithm works well for keywords of a certain size, but not for very
-short keywords. Thus an additional configuration variable defines a lower
-bound for what lengths of keywords that will be subject to this algorithm.
-With this simple algorithm in place, typical weak form singular to plural
-mapping will get highlighted while the opposite, going from a long form to
-a shorter one will not work as might be expected. Eg. this algorithm does
-not change the keywords themselves. Consequently, the shorter forms of the
-keywords are more likely to give non-exact hits in the dynamic summary.
-
-
-<h2>Fast Search configuration of Juniper</h2>
-Enabling Juniper functionality within Fast Search is done on a per field
-basis by means of override specifications in summary.map. 
-Currently the following override specs are supported by Juniper:
-<p>
-<ul>
-<li><pre>override &lt;outputfield&gt; dynamicteaser &lt;inputfield&gt;</pre>
-<li><pre>override &lt;outputfield&gt; dynamicteasermetric
-&lt;inputfield&gt;</pre>
-<li><pre>override &lt;outputfield&gt; juniperlog &lt;inputfield&gt;</pre>
-</ul>
-<p>
-Details of the override directive can be found in <i>Fast Search 4.13 -
-Dynamic Docsum Generation Framework</i>.
-The <tt>dynamicteasermetric</tt> field provides a ranking of the document
-based on a corresponding metric as that used to select between individual
-matches for dynamic teaser generation inside a document. See the section on
-using Juniper for proximity boosting <a href="#proximity">below</a>
-The <tt>juniperlog</tt> field is new as of Juniper 2.0, and is used to
-retrieve the information generated by Juniper by means of the log query
-option, see the runtime option table <a href="#dynpar">below</a>.
-
-Note that when integrated into Data Search 3.1 and later, 
-this part of the configuration will be generated via
-<ol>
-<li>The index profile
-<li>The index configuration (indexConfig.xml) by the config server.
-</ol>
-
-<h3>Configuration levels</h3>
-When integrated into Fast Search, Juniper receives its default parameters from
-global settings in the <dfn>fsearchrc</dfn> config file. 
-These configuration parameter settings must be preconfigured at
-<dfn>fsearch</dfn> process startup time. Two levels of system configuration
-is currently supported, 
-<ol>
-<li><b>System default configuration:</b> This is the configuration settings
-exemplified by the parameter descriptions below.
-<li><b>Per field configuration:</b>By using the field names instead of the
-string "juniper" as prefix, the default setting can be overridden on a
-<dfn>result field</dfn> basis. Eg. setting for instance
-<pre>myfield.dynsum.length 512</pre> (see below) would allow the
-<tt>myfield</tt> result field to receive a different teaser length.
-</ol>
-In addition Juniper 2.x supports changing certain subset of the parameters
-on a per query basis. See <a href="#dynpar">separate section</a> on this below.
-<p>
-<b>Performance note:</b> The per field configuration possibility should be used with
-care since overriding some parameters may cause significant computation
-overhead in that Juniper would have to scan the whole text multiple times.
-Changing the <tt>dynsum</tt> group of fields is generally quite performance
-conservative (only the teaser generation phase would have to be repeated), 
-while changing any of the <tt>stem</tt> or <tt>matcher</tt>
-fields would require a different text scan for each combination of
-parameters.
-
-<h3>Arbitrary byte sequences in markup parameters</h3>
-To allow arbitrary byte seqences (such as low ascii values) to be used to
-denote highlight on/off and continuation symbol(s), Juniper now accepts
-strings on the form \xNN where the N's are hex values [0-9a-fA-F].
-This will be converted into a byte value of NN. Note that Juniper exports
-UTF-8 text so this sequence should be a valid UTF-8 byte sequence. No
-checks are performed from Juniper on the validity of such strings in the
-<dfn>fsearch</dfn> domain.
-As a consequence of this, occurrences of backslash must be escaped
-accordingly (<dfn>\\</dfn>).
-
-
-<h3>Blanks in text parameters must be escaped</h3>
-Note that <dfn>fsearchrc</dfn> does not accept blanks in the
-parameters. To allow more complicated highlight markup, the sequence
-<dfn>\x20</dfn> must be used as space in text fields. 
-
-<p>
-<h3><a name="em"></a>Escaping markup in the summary text</h3>
-Since Juniper may supply markup through the use of the
-highlight and continuation parameters, problems may occur if the
-analysed text itself contains markup. To avoid this, Juniper may be
-configured to escape the 5 XML/HTML markup symbols
-(<dfn>&quot;&amp;&lt;&apos;&gt;</dfn>) before adding the mentioned
-parametrized symbols. See the description of the
-juniper.dynsum.escape_markup parameter <a href="#empar">below</a>.
-<p>
-The following variables are available for static, global configuration for
-a particular search node:
-<p>
-
-<table><a name="conftable"></a>
-  <tr bgcolor="#f0f0f0"><td><b>Parameter name</b></td><td><b>Default
-  value</b></td><td><b>Description</b></td>
-  </tr>
-
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.highlight_on</td><td>&lt;b&gt;</td>
-    <td>A string to be included <i>before</i> each hit in the generated summary</td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.highlight_off</td>
-    <td>&lt;/b&gt;</td><td>A string to be included <i>after</i> each hit in
-        the generated summary</td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.continuation</td>
-    <td>...</td><td>A string to be included to denote
-        abbreviated/left out pieces of the original text in the generated summary</td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.separators</td>
-    <td>\x1D\x1F</td><td>A string containing characters that are added for
-    word separation purposes (eg.CJK languages and German/Norwegian
-    etc. word separation). This list should contain non-word characters
-    only for this to be meaningful. Also, currently only single byte
-    characters are supported. These characters wil be removed from the
-    generated teaser by Juniper.</td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.connectors</td>
-    <td>-'</td><td>A string containing characters that may connect two word
-    tokens to form a single word. Words connected by a single such
-    character will not be splitted by Juniper when generating the teaser.</td>
-  <tr bgcolor="#f0f0f0">
-    <td><a name="empar"></a>juniper.dynsum.escape_markup</td>
-    <td>auto</td><td>See <a href="#em">description</a> above. Accepted values:
-      <dfn>on</dfn>,<dfn>off</dfn> or <dfn>auto</dfn>. If <dfn>auto</dfn> is
-      used, Juniper will escape markup in the generated summary if any of the symbols
-      <dfn>highlight_on</dfn>, <dfn>highlight_off</dfn> or
-      <dfn>continuation</dfn> contains a <dfn>&lt;</dfn> as the first
-      character.
-    </td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.length</td>
-    <td>256</td><td>Length of the generated summary in bytes. This is a
-    hint to Juniper. The result may be slightly longer or shorter depending
-    on the structure of the available document text and the submitted
-    query.</td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.max_matches</td>
-    <td>4</td><td>The number of (possibly partial) set of keywords
-    matching the query, to attempt to include in the summary. The larger this
-    value compared is set relative to the <i>length</i> parameter, the more
-    dense the keywords may appear in the summary.</td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.min_length</td>
-    <td>128</td><td>Minimal desired length of the generated summary in
-    bytes. This is the shortest summary length for which the number of
-    matches will be respected. Eg. if
-    a summary appear to become shorter than <i>min_length</i> bytes with
-    <i>max_matches</i> matches, then additional matches will be used if available.</td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.dynsum.surround_max</td>
-    <td>80</td><td>The maximal number of bytes of context to prepend and append to
-    each of the selected query keyword hits. This parameter defines the
-    max size a summary would become if there are few keyword hits
-    (max_matches set low or document contained few matches of the
-    keywords.</td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.stem.min_length</td>
-    <td>5</td><td>The minimal number of bytes in a query keyword for
-    it to be subject to the simple Juniper stemming algorithm. Keywords
-    that are shorter than or equal to this limit will only yield exact
-    matches in the dynamic summaries.
-    </td>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.stem.max_extend</td>
-    <td>3</td><td>The maximal number of bytes that a word in the document
-    can be <i>longer</i> than the keyword itself to yield a match. Eg. for
-    the default values, if the keyword is 7 bytes long, it will match any
-    word with length less than or equal to 10 for which the keyword is a prefix.
-    </td>
-  </tr>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.matcher.winsize</td>
-    <td>400</td><td>The size of the sliding window used to determine if
-    multiple query terms occur together. The larger the value, the more
-    likely the system will find (and present in dynamic summary) complete
-    matches containing all the search terms. The downside is a potential
-    performance overhead of keeping candidates for matches longer during
-    matching, and consequently updating more candidates that eventually
-    gets thrown.
-    </td>
-  </tr>
-  <tr bgcolor="#f0f0f0">
-    <td>juniper.proximity.factor</td><td>0.25</td><td>
-A factor to multiply the internal Juniper metric with when producing
-proximity metric for a given field. A real/floating point value accepted
-Note that the QRserver (see <a href="#qrserver">below</a>) 
-also supports a factor that is global to all proximity
-metric fields, and that is applied in addition.	</td>
-  </tr>
-
-</table>
-
-<h2><a name="dynpar"></a>Alternate behaviour of Juniper on a per query
-basis</h2>
-As of Juniper v.2.x and Fastserver v.4.17, and QRserver for Data Search
-3.2, Juniper supports a number of Juniper specific options that can be
-provided as part of the URL. The format of the option string is 
-<pre>
-  juniper=&lt;param_name&gt;.&lt;value&gt;[_&lt;param_name&gt;.&lt;value&gt;]*
-</pre>
-As an example, consider the following URL addition:
-<pre>
-  juniper=near.2_dynlength.512_dynmatches.8
-</pre>
-If this string is present in the URL, Juniper would generate teasers that
-are up to twice as long and contains up to twice as many matches of the
-query compared to the default values. 
-In addition, teasers (and <a href="#proximity">proximity metric</a>) will
-only be
-generated for those documents that fulfills the extra constraint that there
-exist at least one complete match of the query where the distance in words
-between the first and the last word of the query match is no more than 2 (+
-the number of words in the query).
-
-<h3>Supported per query options in Juniper 2.1.0</h3>
-
-<table>
-  <tr bgcolor="#f0f0f0"><td><b>Parameter
-  name</b></td><td><b>Corresponding config name, see <a href="#conftable">above</a></b></td><td><b>Description</b></td></tr>
-  <tr bgcolor="#f0f0f0"><td>dynlength</td><td>dynsum.length</td>
-  <td>The desired max length of the generated teaser</td></tr>
-  <tr bgcolor="#f0f0f0"><td>dynmatches</td><td>dynsum.max_matches</td>
-  <td>The number of matches to try to fit in the teaser</td>
-  </tr>
-  <tr bgcolor="#f0f0f0"><td>dynsurmax</td><td>dynsum.surround_max</td>
-  <td>The maximal amount of surrounding context per keyword hit</td>
-  </tr>
-  <tr bgcolor="#f0f0f0"><td>near</td><td><i>N/A</i></td><td>Specifies a
-  proximity search where keywords should occur closer than the specified
-  value in number of words not counting the query terms themselves.</td>
-  </tr>
-  <tr bgcolor="#f0f0f0"><td>stemext</td><td>juniper.stem.max_extend</td><td>
-  The maximal number of bytes that a word in the document can be longer
-  than the keyword itself to yield a match.</td>
-  </tr>
-  <tr bgcolor="#f0f0f0"><td>stemmin</td><td>juniper.stem.min_length</td><td>
-  The minimal number of bytes in a query keyword for it to be subject to
-  the simple Juniper stemming algorithm.</td> 
-  </tr>
-  <tr bgcolor="#f0f0f0"><td>within</td><td><i>N/A</i></td><td>Same as
-  <i>near</i> with the additional constraint that matches of the query must
-  have the same order of the query words as the original query.</td>
-  </tr>
-  <tr bgcolor="#f0f0f0"><td>winsize</td><td>juniper.matcher.winsize</td><td>
-  The size of the sliding window used to determine if multiple query terms
-  occur together.</td>
-  </tr>
-  <tr bgcolor="#f0f0f0"><td>log</td><td><i>N/A</i></td><td>Internal debug
-  option (privileged port only). Value is a bitmap that allows selectively
-  enabled log output to be generated by Juniper for output into a
-  juniperlog override configured summary field. Useful only with a special
-  template that makes use of this information. Currently the only
-  supported bit is 0x8000 which will provide a html table with up to 20 of the
-  topmost matches of each document, and their identified proximity
-  (distance) and rank.
-</td>
-</table>
-
-<h3>Juniper debug template in Data Search</h3>
-Template support for the log parameter as well as extracting the whole
-juniper input document text is provided by Data Search 3.2 by means of the
-<tt>jsearch</tt> page from the qrserver port. Replace <tt>asearch</tt>
-with <tt>jsearch</tt> 
-in the URL of a qrserver privileged port search. 
-
-<h2>Using Juniper for proximity boosting with the QRserver</h2>
-In order to use Juniper to boost hits that have good proximity of the query
-(or to filter the hits based on NEAR or WITHIN constraints) the QRserver
-would need to be provided with the following URL addition:
-<pre>
-rpf_proximitybooster:enabled=1
-</pre>
-Note that Juniper will return 0 as proximity metric (dynamicteasermetric)
-if the query with juniper option constraints cannot be satisfied by
-the information in the configured input field. Thus if the selection of a
-hit is done solely on the basis of information not present in the Juniper input
-(such as the title in the default configuration)
-proximity boosting may demote such hits. A solution for this problem has
-been proposed for future versions of Juniper.
-
-<h3><a name="qrserver"></a>Supported QRserver options to use with proximity
-boosting via Juniper</h3>
-QRserver behaviour wrt. proximity boosting can be set both in configuration
-(at QRserver startup) or on a per query basis. In the below table, some of the
-default configuration settings are listed together with their corresponding
-runtime setting, if any. Consult QRserver documentation for the complete
-list of options.
-<p>
-<table>
-  <tr bgcolor="#f0f0f0"><td><b>Config parameter
-  name</b></td><td><b>Corresponding runtime (URL)
-  syntax</a></b></td><td><b>Description</b></td></tr>
-  <tr bgcolor="#f0f0f0"><td>rp.proximityboost.enabled=1</td><td>N/A</td><td>Configure for
-  proximity boosting in the QRserver (not necessarily enable it)</td></tr>
-  <tr
-  bgcolor="#f0f0f0"><td>#rp.proximityboost.default</td>
-  <td>rpf_proximityboost:enabled=1</td><td>Enable proximity
-  boosting</td></tr>
-  <tr
-  bgcolor="#f0f0f0"><td>rp.proximityboost.factor=0.5</td>
-  <td>N/A</td><td>A value that the combined proximity boost value
-  calculated possibly from multiple fields, scaled by their individual
-  factors are multipled with before adding it to the Fastserver rank value
-  to be used to reorder hits.</td></tr>
-  <tr bgcolor="#f0f0f0"><td>rp.proximityboost.hits=100</td>
-  <td>rpf_proximityboost:hits=100</td><td>The number of Fastserver hits to
-  retrieve as basis for the reordering.</td></tr>
-  <tr
-  bgcolor="#f0f0f0"><td>rp.proximityboost.maxoffset=100</td>
-  <td>N/A</td><td>The maximal offset within the list of hits that will be
-  subject to any proximity boost reordering/filtering. Hits above this
-  range in the original result set will not be subject to proximity boosting.</td></tr>
-</table>
-
-<h2>Configuring Juniper within Data Search</h2>
-Except where explicitly noted, configuring Juniper for Data Search is
-similar to configuring for Real-Time Search. As of Data Search 3.0 Juniper
-is by default configured and enabled in Data Search.
-
-<h2>Configuring Juniper within Fast Real-Time Search</h2>
-Juniper is provided as part of Real-Time Search (through Fast Search) 
-starting with version 2.4. To enable the Fast Search integrated Juniper in
-a Real-Time Search environment, see the documentation extensions to
-Real-Time Search 2.4. Note that to configure Juniper within Real-Time
-Search, the configuration variables should be put in the
-<tt>etc/fsearch.addon*</tt> file(s) which will be used as input when Real-Time
-Search generates <tt>fsearchrc</tt> files for all configured search
-engines. Also a proper <tt>summary.map</tt> file is needed to enable the
-dynamic summaries on particular fields.
-
-<h2>Configuring Juniper for Fast Search v.4.15 and higher</h2>
-Newer versions of Fast Search provide template support to allow different
-Juniper markup depending on the type of display desired (plain,html or
-xml). 
-
-All the http frontends that needs an interpretation of the highlight
-information provided by Juniper should have the following setup for Juniper:
-<p>
-<table>
-<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_on</td><td>\02</td></tr>
-<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_off</td><td>\03</td></tr>
-<tr bgcolor="#f0f0f0"><td>juniper.dynsum.continuation</td><td>\1E</td></tr>
-</table>
-<p>
-The actual frontend markup configuration then takes place by setting
-variables such as
-<p>
-<table>
-<tr bgcolor="#f0f0f0"><td>tvm.dynsum.html.highlight_on</td></tr>
-<tr bgcolor="#f0f0f0"><td>tvm.dynsum.xml.highlight_off</td></tr>
-</table>
-<p>
-in the relevant rc file. 
-
-<h2>How to report bugs/errors related to Juniper</h2>
-Errors/problems related to Juniper can be divided into two categories:
-<ol>
-  <li> Problems with specific documents/teasers
-  <li> System errors/instability etc.
-</ol>
-Due to the complexity of a full, running system it is much easier for all
-parts if the particular query/document pair triggering the problem can be
-identified and analysed off-line.
-
-<h3>Problems with specific teasers</h3>
-Problems of this category is likely to occur because there are so many
-combinations of queries and documents that it is not possible to
-test for all cases. To be able to analyse such problems, it is vital that
-the exact (byte-by-byte) teaser generation source (document summary input
-to Juniper) can be made available together with the exact query as
-presented to Juniper. To determine this requires the following information:
-<ol>
-   <li> The teaser source docsum. The name of the docsum field is dependent
-   on the configuration in summary.map. The data should be provided without
-   any post processing performed, if possible, to avoid missing problems
-   related to bad input data such as malformet UTF8 characters.
-   <li> The original query as submitted by the user
-   <li> The expanded query (available under <tt>var/log/querylogs/</tt>)
-   <li> A corresponding <tt>fsearchrc</tt> (pure fastserver4)  or
-   <tt>fsearch.addon</tt> (Real Time Search/DS 3.x) file used by the
-   fsearch process that performed the task.
-</ol>
-
-
-<h3>System errors/instability problems</h3>
-Problems of category 2 should, if occurring at all only be associated with
-development/beta releases. If such an unfortunate event should happen, the
-following information <dfn>in addition to the information associated with
-category 1</dfn> would be useful to pin down the problem:
-<ol>
-  <li> Core file of <dfn>fsearch</dfn> accompanied with the associated
-  <dfn>fsearch</dfn> binary.
-  <li> Log files from the crashed process (in Data Search these will be
-  present as <tt>var/fsearch-*.log</tt> and <tt>var/log/stdout.log</tt>.
-</ol>
author	Henning Baldersheim <balder@yahoo-inc.com>	2022-05-19 15:15:55 +0000
committer	Henning Baldersheim <balder@yahoo-inc.com>	2022-05-19 15:15:55 +0000
commit	5ecf186af848c4a518d7268fcb452b264fba939a (patch)
tree	9d9e01e78f321d24a221aa2f31cfa9a70b29a818 /juniper/doc/written
parent	3988acc5dd01647c479412d5815c37d6fa4c6a2b (diff)