diff options
author | Jon Bratseth <bratseth@yahoo-inc.com> | 2016-06-15 23:09:44 +0200 |
---|---|---|
committer | Jon Bratseth <bratseth@yahoo-inc.com> | 2016-06-15 23:09:44 +0200 |
commit | 72231250ed81e10d66bfe70701e64fa5fe50f712 (patch) | |
tree | 2728bba1131a6f6e5bdf95afec7d7ff9358dac50 /juniper/doc |
Publish
Diffstat (limited to 'juniper/doc')
-rw-r--r-- | juniper/doc/written/fsearchparams.html | 499 |
1 files changed, 499 insertions, 0 deletions
diff --git a/juniper/doc/written/fsearchparams.html b/juniper/doc/written/fsearchparams.html new file mode 100644 index 00000000000..b95ba895503 --- /dev/null +++ b/juniper/doc/written/fsearchparams.html @@ -0,0 +1,499 @@ +<!-- Copyright 2016 Yahoo Inc. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. --> +<title>Juniper Configuration Documentation</title> +<h1>Juniper Configuration Documentation</h1> + +<b>Note:</b> This document describes in details the functionality of Juniper v.2.1.0. +The document has gradually become more and more for internal use for +instance for detailed tuning +by Professional Service. A more high level and less detailed user level +configuration documentation is also available. +<p> +Juniper implements a combined proximity ranking and dynamic teaser result +processing module.This module is intended to be interfaced to by different +Fast software modules on demand. Currently, the only available module that +makes use of Juniper is the Fast Server module, in which Juniper currently +is an integrated part of <dfn>fsearch</dfn> (the search engine executable +that runs on each search node in the system). + +<h2>Juniper simple description of functionality/implementation</h2> + +The document body is stripped for markup during document processing and +stored as an extended document summary field. +A max limit of how much of the document that gets stored is configurable as +of Fast Server v.4.17 (see the Fast Server�configuration documentation for +details). For each document +on the result page, this document extract is retrieved and fed through +Juniper which will perform the following steps: + +<ol> + <li> Scan the stripped document text (docsum) for matches of the query, + create a data structure containing information about those matches, + and provide a quality measure (rank boost value) that can be used as a + metric to determine the quality of the document wrt. proximity and + position of the search string in the document. The data structure + contains ao. a list of matches of the query ordered by quality (see below + for the quality measure). The document quality measure is computed from the + quality measure of the best of the individual matches and the total + number of hits within the document. + + <li> Generate a dynamic teaser based on the data structure previously + generated. The dynamic teaser is composed of a number of text segments + that include the "best" matches of the query in that + document. The teaser is presented with the query words highlighted. + The definition of highlight is configurable. If the document is short + enough to fit completely into the configured teaser length, it will be + provided as is, but with highlighting of the relevant keywords. +</ol> + +Step 2 is only necessary if the teaser is going to be displayed, which +might be a decision taken on basis of the quality measure provided in step 1. + +<h3>Quality measure</h3> +The text segments matching the query are ranked by (in decreasing order of +significance): + +<ol> + <li> Completeness * keyword weight - + higher ranking if more search words are present in + the same context, and relatively higher weight on matches that + contains "important" terms compared to matches with stop words if + equal number of words. + <li> Proximity - query terms occurring near each other is better + <li> Position - earlier in document is better +</ol> + +The number of matches selected is based on text segment lengths including +a configurable amound of surrounding text, the number of matching segments +to use (configurable) and the required total summary +size (configurable). The final set of matches is +returned with markup for the hits and the abbreviated sections +(continuation symbol). + +The query used for teaser generation has undergone proper name recognition +and English spell checking. Highlighting is done on individual terms of the +query. In particular, phrases are broken down into individual terms, but +the preference to proximal terms will maintain the phrasing in the +generated teaser. + +Lemmatization by expanding documents with word inflections cannot be used +by Juniper. In the future, Juniper would expand the query based on the +original query and language information. This functionality is not +available yet, thus lemmatized terms will in general not be highlighted by +Juniper. + +Currently Juniper uses an alternative, simple brute force stemming +algorithm that basically allows prefixes of the document words to match if +the document word in question is no longer than P (configurable) bytes +<i>longer</i> than the query keyword. +This algorithm works well for keywords of a certain size, but not for very +short keywords. Thus an additional configuration variable defines a lower +bound for what lengths of keywords that will be subject to this algorithm. +With this simple algorithm in place, typical weak form singular to plural +mapping will get highlighted while the opposite, going from a long form to +a shorter one will not work as might be expected. Eg. this algorithm does +not change the keywords themselves. Consequently, the shorter forms of the +keywords are more likely to give non-exact hits in the dynamic summary. + + +<h2>Fast Search configuration of Juniper</h2> +Enabling Juniper functionality within Fast Search is done on a per field +basis by means of override specifications in summary.map. +Currently the following override specs are supported by Juniper: +<p> +<ul> +<li><pre>override <outputfield> dynamicteaser <inputfield></pre> +<li><pre>override <outputfield> dynamicteasermetric +<inputfield></pre> +<li><pre>override <outputfield> juniperlog <inputfield></pre> +</ul> +<p> +Details of the override directive can be found in <i>Fast Search 4.13 - +Dynamic Docsum Generation Framework</i>. +The <tt>dynamicteasermetric</tt> field provides a ranking of the document +based on a corresponding metric as that used to select between individual +matches for dynamic teaser generation inside a document. See the section on +using Juniper for proximity boosting <a href="#proximity">below</a> +The <tt>juniperlog</tt> field is new as of Juniper 2.0, and is used to +retrieve the information generated by Juniper by means of the log query +option, see the runtime option table <a href="#dynpar">below</a>. + +Note that when integrated into Data Search 3.1 and later, +this part of the configuration will be generated via +<ol> +<li>The index profile +<li>The index configuration (indexConfig.xml) by the config server. +</ol> + +<h3>Configuration levels</h3> +When integrated into Fast Search, Juniper receives its default parameters from +global settings in the <dfn>fsearchrc</dfn> config file. +These configuration parameter settings must be preconfigured at +<dfn>fsearch</dfn> process startup time. Two levels of system configuration +is currently supported, +<ol> +<li><b>System default configuration:</b> This is the configuration settings +exemplified by the parameter descriptions below. +<li><b>Per field configuration:</b>By using the field names instead of the +string "juniper" as prefix, the default setting can be overridden on a +<dfn>result field</dfn> basis. Eg. setting for instance +<pre>myfield.dynsum.length 512</pre> (see below) would allow the +<tt>myfield</tt> result field to receive a different teaser length. +</ol> +In addition Juniper 2.x supports changing certain subset of the parameters +on a per query basis. See <a href="#dynpar">separate section</a> on this below. +<p> +<b>Performance note:</b> The per field configuration possibility should be used with +care since overriding some parameters may cause significant computation +overhead in that Juniper would have to scan the whole text multiple times. +Changing the <tt>dynsum</tt> group of fields is generally quite performance +conservative (only the teaser generation phase would have to be repeated), +while changing any of the <tt>stem</tt> or <tt>matcher</tt> +fields would require a different text scan for each combination of +parameters. + +<h3>Arbitrary byte sequences in markup parameters</h3> +To allow arbitrary byte seqences (such as low ascii values) to be used to +denote highlight on/off and continuation symbol(s), Juniper now accepts +strings on the form \xNN where the N's are hex values [0-9a-fA-F]. +This will be converted into a byte value of NN. Note that Juniper exports +UTF-8 text so this sequence should be a valid UTF-8 byte sequence. No +checks are performed from Juniper on the validity of such strings in the +<dfn>fsearch</dfn> domain. +As a consequence of this, occurrences of backslash must be escaped +accordingly (<dfn>\\</dfn>). + + +<h3>Blanks in text parameters must be escaped</h3> +Note that <dfn>fsearchrc</dfn> does not accept blanks in the +parameters. To allow more complicated highlight markup, the sequence +<dfn>\x20</dfn> must be used as space in text fields. + +<p> +<h3><a name="em"></a>Escaping markup in the summary text</h3> +Since Juniper may supply markup through the use of the +highlight and continuation parameters, problems may occur if the +analysed text itself contains markup. To avoid this, Juniper may be +configured to escape the 5 XML/HTML markup symbols +(<dfn>"&<'></dfn>) before adding the mentioned +parametrized symbols. See the description of the +juniper.dynsum.escape_markup parameter <a href="#empar">below</a>. +<p> +The following variables are available for static, global configuration for +a particular search node: +<p> + +<table><a name="conftable"></a> + <tr bgcolor="#f0f0f0"><td><b>Parameter name</b></td><td><b>Default + value</b></td><td><b>Description</b></td> + </tr> + + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.highlight_on</td><td><b></td> + <td>A string to be included <i>before</i> each hit in the generated summary</td> + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.highlight_off</td> + <td></b></td><td>A string to be included <i>after</i> each hit in + the generated summary</td> + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.continuation</td> + <td>...</td><td>A string to be included to denote + abbreviated/left out pieces of the original text in the generated summary</td> + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.separators</td> + <td>\x1D\x1F</td><td>A string containing characters that are added for + word separation purposes (eg.CJK languages and German/Norwegian + etc. word separation). This list should contain non-word characters + only for this to be meaningful. Also, currently only single byte + characters are supported. These characters wil be removed from the + generated teaser by Juniper.</td> + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.connectors</td> + <td>-'</td><td>A string containing characters that may connect two word + tokens to form a single word. Words connected by a single such + character will not be splitted by Juniper when generating the teaser.</td> + <tr bgcolor="#f0f0f0"> + <td><a name="empar"></a>juniper.dynsum.escape_markup</td> + <td>auto</td><td>See <a href="#em">description</a> above. Accepted values: + <dfn>on</dfn>,<dfn>off</dfn> or <dfn>auto</dfn>. If <dfn>auto</dfn> is + used, Juniper will escape markup in the generated summary if any of the symbols + <dfn>highlight_on</dfn>, <dfn>highlight_off</dfn> or + <dfn>continuation</dfn> contains a <dfn><</dfn> as the first + character. + </td> + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.length</td> + <td>256</td><td>Length of the generated summary in bytes. This is a + hint to Juniper. The result may be slightly longer or shorter depending + on the structure of the available document text and the submitted + query.</td> + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.max_matches</td> + <td>4</td><td>The number of (possibly partial) set of keywords + matching the query, to attempt to include in the summary. The larger this + value compared is set relative to the <i>length</i> parameter, the more + dense the keywords may appear in the summary.</td> + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.min_length</td> + <td>128</td><td>Minimal desired length of the generated summary in + bytes. This is the shortest summary length for which the number of + matches will be respected. Eg. if + a summary appear to become shorter than <i>min_length</i> bytes with + <i>max_matches</i> matches, then additional matches will be used if available.</td> + <tr bgcolor="#f0f0f0"> + <td>juniper.dynsum.surround_max</td> + <td>80</td><td>The maximal number of bytes of context to prepend and append to + each of the selected query keyword hits. This parameter defines the + max size a summary would become if there are few keyword hits + (max_matches set low or document contained few matches of the + keywords.</td> + <tr bgcolor="#f0f0f0"> + <td>juniper.stem.min_length</td> + <td>5</td><td>The minimal number of bytes in a query keyword for + it to be subject to the simple Juniper stemming algorithm. Keywords + that are shorter than or equal to this limit will only yield exact + matches in the dynamic summaries. + </td> + <tr bgcolor="#f0f0f0"> + <td>juniper.stem.max_extend</td> + <td>3</td><td>The maximal number of bytes that a word in the document + can be <i>longer</i> than the keyword itself to yield a match. Eg. for + the default values, if the keyword is 7 bytes long, it will match any + word with length less than or equal to 10 for which the keyword is a prefix. + </td> + </tr> + <tr bgcolor="#f0f0f0"> + <td>juniper.matcher.winsize</td> + <td>400</td><td>The size of the sliding window used to determine if + multiple query terms occur together. The larger the value, the more + likely the system will find (and present in dynamic summary) complete + matches containing all the search terms. The downside is a potential + performance overhead of keeping candidates for matches longer during + matching, and consequently updating more candidates that eventually + gets thrown. + </td> + </tr> + <tr bgcolor="#f0f0f0"> + <td>juniper.proximity.factor</td><td>0.25</td><td> +A factor to multiply the internal Juniper metric with when producing +proximity metric for a given field. A real/floating point value accepted +Note that the QRserver (see <a href="#qrserver">below</a>) +also supports a factor that is global to all proximity +metric fields, and that is applied in addition. </td> + </tr> + +</table> + +<h2><a name="dynpar"></a>Alternate behaviour of Juniper on a per query +basis</h2> +As of Juniper v.2.x and Fastserver v.4.17, and QRserver for Data Search +3.2, Juniper supports a number of Juniper specific options that can be +provided as part of the URL. The format of the option string is +<pre> + juniper=<param_name>.<value>[_<param_name>.<value>]* +</pre> +As an example, consider the following URL addition: +<pre> + juniper=near.2_dynlength.512_dynmatches.8 +</pre> +If this string is present in the URL, Juniper would generate teasers that +are up to twice as long and contains up to twice as many matches of the +query compared to the default values. +In addition, teasers (and <a href="#proximity">proximity metric</a>) will +only be +generated for those documents that fulfills the extra constraint that there +exist at least one complete match of the query where the distance in words +between the first and the last word of the query match is no more than 2 (+ +the number of words in the query). + +<h3>Supported per query options in Juniper 2.1.0</h3> + +<table> + <tr bgcolor="#f0f0f0"><td><b>Parameter + name</b></td><td><b>Corresponding config name, see <a href="#conftable">above</a></b></td><td><b>Description</b></td></tr> + <tr bgcolor="#f0f0f0"><td>dynlength</td><td>dynsum.length</td> + <td>The desired max length of the generated teaser</td></tr> + <tr bgcolor="#f0f0f0"><td>dynmatches</td><td>dynsum.max_matches</td> + <td>The number of matches to try to fit in the teaser</td> + </tr> + <tr bgcolor="#f0f0f0"><td>dynsurmax</td><td>dynsum.surround_max</td> + <td>The maximal amount of surrounding context per keyword hit</td> + </tr> + <tr bgcolor="#f0f0f0"><td>near</td><td><i>N/A</i></td><td>Specifies a + proximity search where keywords should occur closer than the specified + value in number of words not counting the query terms themselves.</td> + </tr> + <tr bgcolor="#f0f0f0"><td>stemext</td><td>juniper.stem.max_extend</td><td> + The maximal number of bytes that a word in the document can be longer + than the keyword itself to yield a match.</td> + </tr> + <tr bgcolor="#f0f0f0"><td>stemmin</td><td>juniper.stem.min_length</td><td> + The minimal number of bytes in a query keyword for it to be subject to + the simple Juniper stemming algorithm.</td> + </tr> + <tr bgcolor="#f0f0f0"><td>within</td><td><i>N/A</i></td><td>Same as + <i>near</i> with the additional constraint that matches of the query must + have the same order of the query words as the original query.</td> + </tr> + <tr bgcolor="#f0f0f0"><td>winsize</td><td>juniper.matcher.winsize</td><td> + The size of the sliding window used to determine if multiple query terms + occur together.</td> + </tr> + <tr bgcolor="#f0f0f0"><td>log</td><td><i>N/A</i></td><td>Internal debug + option (privileged port only). Value is a bitmap that allows selectively + enabled log output to be generated by Juniper for output into a + juniperlog override configured summary field. Useful only with a special + template that makes use of this information. Currently the only + supported bit is 0x8000 which will provide a html table with up to 20 of the + topmost matches of each document, and their identified proximity + (distance) and rank. +</td> +</table> + +<h3>Juniper debug template in Data Search</h3> +Template support for the log parameter as well as extracting the whole +juniper input document text is provided by Data Search 3.2 by means of the +<tt>jsearch</tt> page from the qrserver port. Replace <tt>asearch</tt> +with <tt>jsearch</tt> +in the URL of a qrserver privileged port search. + +<h2>Using Juniper for proximity boosting with the QRserver</h2> +In order to use Juniper to boost hits that have good proximity of the query +(or to filter the hits based on NEAR or WITHIN constraints) the QRserver +would need to be provided with the following URL addition: +<pre> +rpf_proximitybooster:enabled=1 +</pre> +Note that Juniper will return 0 as proximity metric (dynamicteasermetric) +if the query with juniper option constraints cannot be satisfied by +the information in the configured input field. Thus if the selection of a +hit is done solely on the basis of information not present in the Juniper input +(such as the title in the default configuration) +proximity boosting may demote such hits. A solution for this problem has +been proposed for future versions of Juniper. + +<h3><a name="qrserver"></a>Supported QRserver options to use with proximity +boosting via Juniper</h3> +QRserver behaviour wrt. proximity boosting can be set both in configuration +(at QRserver startup) or on a per query basis. In the below table, some of the +default configuration settings are listed together with their corresponding +runtime setting, if any. Consult QRserver documentation for the complete +list of options. +<p> +<table> + <tr bgcolor="#f0f0f0"><td><b>Config parameter + name</b></td><td><b>Corresponding runtime (URL) + syntax</a></b></td><td><b>Description</b></td></tr> + <tr bgcolor="#f0f0f0"><td>rp.proximityboost.enabled=1</td><td>N/A</td><td>Configure for + proximity boosting in the QRserver (not necessarily enable it)</td></tr> + <tr + bgcolor="#f0f0f0"><td>#rp.proximityboost.default</td> + <td>rpf_proximityboost:enabled=1</td><td>Enable proximity + boosting</td></tr> + <tr + bgcolor="#f0f0f0"><td>rp.proximityboost.factor=0.5</td> + <td>N/A</td><td>A value that the combined proximity boost value + calculated possibly from multiple fields, scaled by their individual + factors are multipled with before adding it to the Fastserver rank value + to be used to reorder hits.</td></tr> + <tr bgcolor="#f0f0f0"><td>rp.proximityboost.hits=100</td> + <td>rpf_proximityboost:hits=100</td><td>The number of Fastserver hits to + retrieve as basis for the reordering.</td></tr> + <tr + bgcolor="#f0f0f0"><td>rp.proximityboost.maxoffset=100</td> + <td>N/A</td><td>The maximal offset within the list of hits that will be + subject to any proximity boost reordering/filtering. Hits above this + range in the original result set will not be subject to proximity boosting.</td></tr> +</table> + +<h2>Configuring Juniper within Data Search</h2> +Except where explicitly noted, configuring Juniper for Data Search is +similar to configuring for Real-Time Search. As of Data Search 3.0 Juniper +is by default configured and enabled in Data Search. + +<h2>Configuring Juniper within Fast Real-Time Search</h2> +Juniper is provided as part of Real-Time Search (through Fast Search) +starting with version 2.4. To enable the Fast Search integrated Juniper in +a Real-Time Search environment, see the documentation extensions to +Real-Time Search 2.4. Note that to configure Juniper within Real-Time +Search, the configuration variables should be put in the +<tt>etc/fsearch.addon*</tt> file(s) which will be used as input when Real-Time +Search generates <tt>fsearchrc</tt> files for all configured search +engines. Also a proper <tt>summary.map</tt> file is needed to enable the +dynamic summaries on particular fields. + +<h2>Configuring Juniper for Fast Search v.4.15 and higher</h2> +Newer versions of Fast Search provide template support to allow different +Juniper markup depending on the type of display desired (plain,html or +xml). + +All the http frontends that needs an interpretation of the highlight +information provided by Juniper should have the following setup for Juniper: +<p> +<table> +<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_on</td><td>\02</td></tr> +<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_off</td><td>\03</td></tr> +<tr bgcolor="#f0f0f0"><td>juniper.dynsum.continuation</td><td>\1E</td></tr> +</table> +<p> +The actual frontend markup configuration then takes place by setting +variables such as +<p> +<table> +<tr bgcolor="#f0f0f0"><td>tvm.dynsum.html.highlight_on</td></tr> +<tr bgcolor="#f0f0f0"><td>tvm.dynsum.xml.highlight_off</td></tr> +</table> +<p> +in the relevant rc file. + +Note that for a Data Search installation, the QR server has +the responsibility of providing dynamic teasers, consequently the +<tt>etc/qrserver/qrserverrc</tt> file should provide the above +configuration. +<p> +For a similar Web Search configuration, using a "bare-bone" top level +fdispatch instance, the <tt>fdispatchrc</tt> file is the appropriate place. +<p> +For a "standalone" Real-Time Search setup, the appropriate configuration +file(s) is the fdispatch.addon file(s). + +<h2>How to report bugs/errors related to Juniper</h2> +Errors/problems related to Juniper can be divided into two categories: +<ol> + <li> Problems with specific documents/teasers + <li> System errors/instability etc. +</ol> +Due to the complexity of a full, running system it is much easier for all +parts if the particular query/document pair triggering the problem can be +identified and analysed off-line. + +<h3>Problems with specific teasers</h3> +Problems of this category is likely to occur because there are so many +combinations of queries and documents that it is not possible to +test for all cases. To be able to analyse such problems, it is vital that +the exact (byte-by-byte) teaser generation source (document summary input +to Juniper) can be made available together with the exact query as +presented to Juniper. To determine this requires the following information: +<ol> + <li> The teaser source docsum. The name of the docsum field is dependent + on the configuration in summary.map. The data should be provided without + any post processing performed, if possible, to avoid missing problems + related to bad input data such as malformet UTF8 characters. + <li> The original query as submitted by the user + <li> The expanded query (available under <tt>var/log/querylogs/</tt>) + <li> A corresponding <tt>fsearchrc</tt> (pure fastserver4) or + <tt>fsearch.addon</tt> (Real Time Search/DS 3.x) file used by the + fsearch process that performed the task. +</ol> + + +<h3>System errors/instability problems</h3> +Problems of category 2 should, if occurring at all only be associated with +development/beta releases. If such an unfortunate event should happen, the +following information <dfn>in addition to the information associated with +category 1</dfn> would be useful to pin down the problem: +<ol> + <li> Core file of <dfn>fsearch</dfn> accompanied with the associated + <dfn>fsearch</dfn> binary. + <li> Log files from the crashed process (in Data Search these will be + present as <tt>var/fsearch-*.log</tt> and <tt>var/log/stdout.log</tt>. +</ol> |