diff options
Diffstat (limited to 'juniper/doc/written')
-rw-r--r-- | juniper/doc/written/fsearchparams.html | 488 |
1 files changed, 0 insertions, 488 deletions
diff --git a/juniper/doc/written/fsearchparams.html b/juniper/doc/written/fsearchparams.html deleted file mode 100644 index 1a968e55c41..00000000000 --- a/juniper/doc/written/fsearchparams.html +++ /dev/null @@ -1,488 +0,0 @@ -<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. --> -<title>Juniper Configuration Documentation</title> -<h1>Juniper Configuration Documentation</h1> - -<b>Note:</b> This document describes in details the functionality of Juniper v.2.1.0. -The document has gradually become more and more for internal use for -instance for detailed tuning -by Professional Service. A more high level and less detailed user level -configuration documentation is also available. -<p> -Juniper implements a combined proximity ranking and dynamic teaser result -processing module.This module is intended to be interfaced to by different -Fast software modules on demand. Currently, the only available module that -makes use of Juniper is the Fast Server module, in which Juniper currently -is an integrated part of <dfn>fsearch</dfn> (the search engine executable -that runs on each search node in the system). - -<h2>Juniper simple description of functionality/implementation</h2> - -The document body is stripped for markup during document processing and -stored as an extended document summary field. -A max limit of how much of the document that gets stored is configurable as -of Fast Server v.4.17 (see the Fast Server�configuration documentation for -details). For each document -on the result page, this document extract is retrieved and fed through -Juniper which will perform the following steps: - -<ol> - <li> Scan the stripped document text (docsum) for matches of the query, - create a data structure containing information about those matches, - and provide a quality measure (rank boost value) that can be used as a - metric to determine the quality of the document wrt. proximity and - position of the search string in the document. The data structure - contains ao. a list of matches of the query ordered by quality (see below - for the quality measure). The document quality measure is computed from the - quality measure of the best of the individual matches and the total - number of hits within the document. - - <li> Generate a dynamic teaser based on the data structure previously - generated. The dynamic teaser is composed of a number of text segments - that include the "best" matches of the query in that - document. The teaser is presented with the query words highlighted. - The definition of highlight is configurable. If the document is short - enough to fit completely into the configured teaser length, it will be - provided as is, but with highlighting of the relevant keywords. -</ol> - -Step 2 is only necessary if the teaser is going to be displayed, which -might be a decision taken on basis of the quality measure provided in step 1. - -<h3>Quality measure</h3> -The text segments matching the query are ranked by (in decreasing order of -significance): - -<ol> - <li> Completeness * keyword weight - - higher ranking if more search words are present in - the same context, and relatively higher weight on matches that - contains "important" terms compared to matches with stop words if - equal number of words. - <li> Proximity - query terms occurring near each other is better - <li> Position - earlier in document is better -</ol> - -The number of matches selected is based on text segment lengths including -a configurable amound of surrounding text, the number of matching segments -to use (configurable) and the required total summary -size (configurable). The final set of matches is -returned with markup for the hits and the abbreviated sections -(continuation symbol). - -The query used for teaser generation has undergone proper name recognition -and English spell checking. Highlighting is done on individual terms of the -query. In particular, phrases are broken down into individual terms, but -the preference to proximal terms will maintain the phrasing in the -generated teaser. - -Lemmatization by expanding documents with word inflections cannot be used -by Juniper. In the future, Juniper would expand the query based on the -original query and language information. This functionality is not -available yet, thus lemmatized terms will in general not be highlighted by -Juniper. - -Currently Juniper uses an alternative, simple brute force stemming -algorithm that basically allows prefixes of the document words to match if -the document word in question is no longer than P (configurable) bytes -<i>longer</i> than the query keyword. -This algorithm works well for keywords of a certain size, but not for very -short keywords. Thus an additional configuration variable defines a lower -bound for what lengths of keywords that will be subject to this algorithm. -With this simple algorithm in place, typical weak form singular to plural -mapping will get highlighted while the opposite, going from a long form to -a shorter one will not work as might be expected. Eg. this algorithm does -not change the keywords themselves. Consequently, the shorter forms of the -keywords are more likely to give non-exact hits in the dynamic summary. - - -<h2>Fast Search configuration of Juniper</h2> -Enabling Juniper functionality within Fast Search is done on a per field -basis by means of override specifications in summary.map. -Currently the following override specs are supported by Juniper: -<p> -<ul> -<li><pre>override <outputfield> dynamicteaser <inputfield></pre> -<li><pre>override <outputfield> dynamicteasermetric -<inputfield></pre> -<li><pre>override <outputfield> juniperlog <inputfield></pre> -</ul> -<p> -Details of the override directive can be found in <i>Fast Search 4.13 - -Dynamic Docsum Generation Framework</i>. -The <tt>dynamicteasermetric</tt> field provides a ranking of the document -based on a corresponding metric as that used to select between individual -matches for dynamic teaser generation inside a document. See the section on -using Juniper for proximity boosting <a href="#proximity">below</a> -The <tt>juniperlog</tt> field is new as of Juniper 2.0, and is used to -retrieve the information generated by Juniper by means of the log query -option, see the runtime option table <a href="#dynpar">below</a>. - -Note that when integrated into Data Search 3.1 and later, -this part of the configuration will be generated via -<ol> -<li>The index profile -<li>The index configuration (indexConfig.xml) by the config server. -</ol> - -<h3>Configuration levels</h3> -When integrated into Fast Search, Juniper receives its default parameters from -global settings in the <dfn>fsearchrc</dfn> config file. -These configuration parameter settings must be preconfigured at -<dfn>fsearch</dfn> process startup time. Two levels of system configuration -is currently supported, -<ol> -<li><b>System default configuration:</b> This is the configuration settings -exemplified by the parameter descriptions below. -<li><b>Per field configuration:</b>By using the field names instead of the -string "juniper" as prefix, the default setting can be overridden on a -<dfn>result field</dfn> basis. Eg. setting for instance -<pre>myfield.dynsum.length 512</pre> (see below) would allow the -<tt>myfield</tt> result field to receive a different teaser length. -</ol> -In addition Juniper 2.x supports changing certain subset of the parameters -on a per query basis. See <a href="#dynpar">separate section</a> on this below. -<p> -<b>Performance note:</b> The per field configuration possibility should be used with -care since overriding some parameters may cause significant computation -overhead in that Juniper would have to scan the whole text multiple times. -Changing the <tt>dynsum</tt> group of fields is generally quite performance -conservative (only the teaser generation phase would have to be repeated), -while changing any of the <tt>stem</tt> or <tt>matcher</tt> -fields would require a different text scan for each combination of -parameters. - -<h3>Arbitrary byte sequences in markup parameters</h3> -To allow arbitrary byte seqences (such as low ascii values) to be used to -denote highlight on/off and continuation symbol(s), Juniper now accepts -strings on the form \xNN where the N's are hex values [0-9a-fA-F]. -This will be converted into a byte value of NN. Note that Juniper exports -UTF-8 text so this sequence should be a valid UTF-8 byte sequence. No -checks are performed from Juniper on the validity of such strings in the -<dfn>fsearch</dfn> domain. -As a consequence of this, occurrences of backslash must be escaped -accordingly (<dfn>\\</dfn>). - - -<h3>Blanks in text parameters must be escaped</h3> -Note that <dfn>fsearchrc</dfn> does not accept blanks in the -parameters. To allow more complicated highlight markup, the sequence -<dfn>\x20</dfn> must be used as space in text fields. - -<p> -<h3><a name="em"></a>Escaping markup in the summary text</h3> -Since Juniper may supply markup through the use of the -highlight and continuation parameters, problems may occur if the -analysed text itself contains markup. To avoid this, Juniper may be -configured to escape the 5 XML/HTML markup symbols -(<dfn>"&<'></dfn>) before adding the mentioned -parametrized symbols. See the description of the -juniper.dynsum.escape_markup parameter <a href="#empar">below</a>. -<p> -The following variables are available for static, global configuration for -a particular search node: -<p> - -<table><a name="conftable"></a> - <tr bgcolor="#f0f0f0"><td><b>Parameter name</b></td><td><b>Default - value</b></td><td><b>Description</b></td> - </tr> - - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.highlight_on</td><td><b></td> - <td>A string to be included <i>before</i> each hit in the generated summary</td> - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.highlight_off</td> - <td></b></td><td>A string to be included <i>after</i> each hit in - the generated summary</td> - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.continuation</td> - <td>...</td><td>A string to be included to denote - abbreviated/left out pieces of the original text in the generated summary</td> - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.separators</td> - <td>\x1D\x1F</td><td>A string containing characters that are added for - word separation purposes (eg.CJK languages and German/Norwegian - etc. word separation). This list should contain non-word characters - only for this to be meaningful. Also, currently only single byte - characters are supported. These characters wil be removed from the - generated teaser by Juniper.</td> - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.connectors</td> - <td>-'</td><td>A string containing characters that may connect two word - tokens to form a single word. Words connected by a single such - character will not be splitted by Juniper when generating the teaser.</td> - <tr bgcolor="#f0f0f0"> - <td><a name="empar"></a>juniper.dynsum.escape_markup</td> - <td>auto</td><td>See <a href="#em">description</a> above. Accepted values: - <dfn>on</dfn>,<dfn>off</dfn> or <dfn>auto</dfn>. If <dfn>auto</dfn> is - used, Juniper will escape markup in the generated summary if any of the symbols - <dfn>highlight_on</dfn>, <dfn>highlight_off</dfn> or - <dfn>continuation</dfn> contains a <dfn><</dfn> as the first - character. - </td> - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.length</td> - <td>256</td><td>Length of the generated summary in bytes. This is a - hint to Juniper. The result may be slightly longer or shorter depending - on the structure of the available document text and the submitted - query.</td> - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.max_matches</td> - <td>4</td><td>The number of (possibly partial) set of keywords - matching the query, to attempt to include in the summary. The larger this - value compared is set relative to the <i>length</i> parameter, the more - dense the keywords may appear in the summary.</td> - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.min_length</td> - <td>128</td><td>Minimal desired length of the generated summary in - bytes. This is the shortest summary length for which the number of - matches will be respected. Eg. if - a summary appear to become shorter than <i>min_length</i> bytes with - <i>max_matches</i> matches, then additional matches will be used if available.</td> - <tr bgcolor="#f0f0f0"> - <td>juniper.dynsum.surround_max</td> - <td>80</td><td>The maximal number of bytes of context to prepend and append to - each of the selected query keyword hits. This parameter defines the - max size a summary would become if there are few keyword hits - (max_matches set low or document contained few matches of the - keywords.</td> - <tr bgcolor="#f0f0f0"> - <td>juniper.stem.min_length</td> - <td>5</td><td>The minimal number of bytes in a query keyword for - it to be subject to the simple Juniper stemming algorithm. Keywords - that are shorter than or equal to this limit will only yield exact - matches in the dynamic summaries. - </td> - <tr bgcolor="#f0f0f0"> - <td>juniper.stem.max_extend</td> - <td>3</td><td>The maximal number of bytes that a word in the document - can be <i>longer</i> than the keyword itself to yield a match. Eg. for - the default values, if the keyword is 7 bytes long, it will match any - word with length less than or equal to 10 for which the keyword is a prefix. - </td> - </tr> - <tr bgcolor="#f0f0f0"> - <td>juniper.matcher.winsize</td> - <td>400</td><td>The size of the sliding window used to determine if - multiple query terms occur together. The larger the value, the more - likely the system will find (and present in dynamic summary) complete - matches containing all the search terms. The downside is a potential - performance overhead of keeping candidates for matches longer during - matching, and consequently updating more candidates that eventually - gets thrown. - </td> - </tr> - <tr bgcolor="#f0f0f0"> - <td>juniper.proximity.factor</td><td>0.25</td><td> -A factor to multiply the internal Juniper metric with when producing -proximity metric for a given field. A real/floating point value accepted -Note that the QRserver (see <a href="#qrserver">below</a>) -also supports a factor that is global to all proximity -metric fields, and that is applied in addition. </td> - </tr> - -</table> - -<h2><a name="dynpar"></a>Alternate behaviour of Juniper on a per query -basis</h2> -As of Juniper v.2.x and Fastserver v.4.17, and QRserver for Data Search -3.2, Juniper supports a number of Juniper specific options that can be -provided as part of the URL. The format of the option string is -<pre> - juniper=<param_name>.<value>[_<param_name>.<value>]* -</pre> -As an example, consider the following URL addition: -<pre> - juniper=near.2_dynlength.512_dynmatches.8 -</pre> -If this string is present in the URL, Juniper would generate teasers that -are up to twice as long and contains up to twice as many matches of the -query compared to the default values. -In addition, teasers (and <a href="#proximity">proximity metric</a>) will -only be -generated for those documents that fulfills the extra constraint that there -exist at least one complete match of the query where the distance in words -between the first and the last word of the query match is no more than 2 (+ -the number of words in the query). - -<h3>Supported per query options in Juniper 2.1.0</h3> - -<table> - <tr bgcolor="#f0f0f0"><td><b>Parameter - name</b></td><td><b>Corresponding config name, see <a href="#conftable">above</a></b></td><td><b>Description</b></td></tr> - <tr bgcolor="#f0f0f0"><td>dynlength</td><td>dynsum.length</td> - <td>The desired max length of the generated teaser</td></tr> - <tr bgcolor="#f0f0f0"><td>dynmatches</td><td>dynsum.max_matches</td> - <td>The number of matches to try to fit in the teaser</td> - </tr> - <tr bgcolor="#f0f0f0"><td>dynsurmax</td><td>dynsum.surround_max</td> - <td>The maximal amount of surrounding context per keyword hit</td> - </tr> - <tr bgcolor="#f0f0f0"><td>near</td><td><i>N/A</i></td><td>Specifies a - proximity search where keywords should occur closer than the specified - value in number of words not counting the query terms themselves.</td> - </tr> - <tr bgcolor="#f0f0f0"><td>stemext</td><td>juniper.stem.max_extend</td><td> - The maximal number of bytes that a word in the document can be longer - than the keyword itself to yield a match.</td> - </tr> - <tr bgcolor="#f0f0f0"><td>stemmin</td><td>juniper.stem.min_length</td><td> - The minimal number of bytes in a query keyword for it to be subject to - the simple Juniper stemming algorithm.</td> - </tr> - <tr bgcolor="#f0f0f0"><td>within</td><td><i>N/A</i></td><td>Same as - <i>near</i> with the additional constraint that matches of the query must - have the same order of the query words as the original query.</td> - </tr> - <tr bgcolor="#f0f0f0"><td>winsize</td><td>juniper.matcher.winsize</td><td> - The size of the sliding window used to determine if multiple query terms - occur together.</td> - </tr> - <tr bgcolor="#f0f0f0"><td>log</td><td><i>N/A</i></td><td>Internal debug - option (privileged port only). Value is a bitmap that allows selectively - enabled log output to be generated by Juniper for output into a - juniperlog override configured summary field. Useful only with a special - template that makes use of this information. Currently the only - supported bit is 0x8000 which will provide a html table with up to 20 of the - topmost matches of each document, and their identified proximity - (distance) and rank. -</td> -</table> - -<h3>Juniper debug template in Data Search</h3> -Template support for the log parameter as well as extracting the whole -juniper input document text is provided by Data Search 3.2 by means of the -<tt>jsearch</tt> page from the qrserver port. Replace <tt>asearch</tt> -with <tt>jsearch</tt> -in the URL of a qrserver privileged port search. - -<h2>Using Juniper for proximity boosting with the QRserver</h2> -In order to use Juniper to boost hits that have good proximity of the query -(or to filter the hits based on NEAR or WITHIN constraints) the QRserver -would need to be provided with the following URL addition: -<pre> -rpf_proximitybooster:enabled=1 -</pre> -Note that Juniper will return 0 as proximity metric (dynamicteasermetric) -if the query with juniper option constraints cannot be satisfied by -the information in the configured input field. Thus if the selection of a -hit is done solely on the basis of information not present in the Juniper input -(such as the title in the default configuration) -proximity boosting may demote such hits. A solution for this problem has -been proposed for future versions of Juniper. - -<h3><a name="qrserver"></a>Supported QRserver options to use with proximity -boosting via Juniper</h3> -QRserver behaviour wrt. proximity boosting can be set both in configuration -(at QRserver startup) or on a per query basis. In the below table, some of the -default configuration settings are listed together with their corresponding -runtime setting, if any. Consult QRserver documentation for the complete -list of options. -<p> -<table> - <tr bgcolor="#f0f0f0"><td><b>Config parameter - name</b></td><td><b>Corresponding runtime (URL) - syntax</a></b></td><td><b>Description</b></td></tr> - <tr bgcolor="#f0f0f0"><td>rp.proximityboost.enabled=1</td><td>N/A</td><td>Configure for - proximity boosting in the QRserver (not necessarily enable it)</td></tr> - <tr - bgcolor="#f0f0f0"><td>#rp.proximityboost.default</td> - <td>rpf_proximityboost:enabled=1</td><td>Enable proximity - boosting</td></tr> - <tr - bgcolor="#f0f0f0"><td>rp.proximityboost.factor=0.5</td> - <td>N/A</td><td>A value that the combined proximity boost value - calculated possibly from multiple fields, scaled by their individual - factors are multipled with before adding it to the Fastserver rank value - to be used to reorder hits.</td></tr> - <tr bgcolor="#f0f0f0"><td>rp.proximityboost.hits=100</td> - <td>rpf_proximityboost:hits=100</td><td>The number of Fastserver hits to - retrieve as basis for the reordering.</td></tr> - <tr - bgcolor="#f0f0f0"><td>rp.proximityboost.maxoffset=100</td> - <td>N/A</td><td>The maximal offset within the list of hits that will be - subject to any proximity boost reordering/filtering. Hits above this - range in the original result set will not be subject to proximity boosting.</td></tr> -</table> - -<h2>Configuring Juniper within Data Search</h2> -Except where explicitly noted, configuring Juniper for Data Search is -similar to configuring for Real-Time Search. As of Data Search 3.0 Juniper -is by default configured and enabled in Data Search. - -<h2>Configuring Juniper within Fast Real-Time Search</h2> -Juniper is provided as part of Real-Time Search (through Fast Search) -starting with version 2.4. To enable the Fast Search integrated Juniper in -a Real-Time Search environment, see the documentation extensions to -Real-Time Search 2.4. Note that to configure Juniper within Real-Time -Search, the configuration variables should be put in the -<tt>etc/fsearch.addon*</tt> file(s) which will be used as input when Real-Time -Search generates <tt>fsearchrc</tt> files for all configured search -engines. Also a proper <tt>summary.map</tt> file is needed to enable the -dynamic summaries on particular fields. - -<h2>Configuring Juniper for Fast Search v.4.15 and higher</h2> -Newer versions of Fast Search provide template support to allow different -Juniper markup depending on the type of display desired (plain,html or -xml). - -All the http frontends that needs an interpretation of the highlight -information provided by Juniper should have the following setup for Juniper: -<p> -<table> -<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_on</td><td>\02</td></tr> -<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_off</td><td>\03</td></tr> -<tr bgcolor="#f0f0f0"><td>juniper.dynsum.continuation</td><td>\1E</td></tr> -</table> -<p> -The actual frontend markup configuration then takes place by setting -variables such as -<p> -<table> -<tr bgcolor="#f0f0f0"><td>tvm.dynsum.html.highlight_on</td></tr> -<tr bgcolor="#f0f0f0"><td>tvm.dynsum.xml.highlight_off</td></tr> -</table> -<p> -in the relevant rc file. - -<h2>How to report bugs/errors related to Juniper</h2> -Errors/problems related to Juniper can be divided into two categories: -<ol> - <li> Problems with specific documents/teasers - <li> System errors/instability etc. -</ol> -Due to the complexity of a full, running system it is much easier for all -parts if the particular query/document pair triggering the problem can be -identified and analysed off-line. - -<h3>Problems with specific teasers</h3> -Problems of this category is likely to occur because there are so many -combinations of queries and documents that it is not possible to -test for all cases. To be able to analyse such problems, it is vital that -the exact (byte-by-byte) teaser generation source (document summary input -to Juniper) can be made available together with the exact query as -presented to Juniper. To determine this requires the following information: -<ol> - <li> The teaser source docsum. The name of the docsum field is dependent - on the configuration in summary.map. The data should be provided without - any post processing performed, if possible, to avoid missing problems - related to bad input data such as malformet UTF8 characters. - <li> The original query as submitted by the user - <li> The expanded query (available under <tt>var/log/querylogs/</tt>) - <li> A corresponding <tt>fsearchrc</tt> (pure fastserver4) or - <tt>fsearch.addon</tt> (Real Time Search/DS 3.x) file used by the - fsearch process that performed the task. -</ol> - - -<h3>System errors/instability problems</h3> -Problems of category 2 should, if occurring at all only be associated with -development/beta releases. If such an unfortunate event should happen, the -following information <dfn>in addition to the information associated with -category 1</dfn> would be useful to pin down the problem: -<ol> - <li> Core file of <dfn>fsearch</dfn> accompanied with the associated - <dfn>fsearch</dfn> binary. - <li> Log files from the crashed process (in Data Search these will be - present as <tt>var/fsearch-*.log</tt> and <tt>var/log/stdout.log</tt>. -</ol> |