Juniper implements a combined proximity ranking and dynamic teaser result processing module.This module is intended to be interfaced to by different Fast software modules on demand. Currently, the only available module that makes use of Juniper is the Fast Server module, in which Juniper currently is an integrated part of fsearch (the search engine executable that runs on each search node in the system).
override <outputfield> dynamicteaser <inputfield>
override <outputfield> dynamicteasermetric <inputfield>
override <outputfield> juniperlog <inputfield>
Details of the override directive can be found in Fast Search 4.13 - Dynamic Docsum Generation Framework. The dynamicteasermetric field provides a ranking of the document based on a corresponding metric as that used to select between individual matches for dynamic teaser generation inside a document. See the section on using Juniper for proximity boosting below The juniperlog field is new as of Juniper 2.0, and is used to retrieve the information generated by Juniper by means of the log query option, see the runtime option table below. Note that when integrated into Data Search 3.1 and later, this part of the configuration will be generated via
myfield.dynsum.length 512(see below) would allow the myfield result field to receive a different teaser length.
Performance note: The per field configuration possibility should be used with care since overriding some parameters may cause significant computation overhead in that Juniper would have to scan the whole text multiple times. Changing the dynsum group of fields is generally quite performance conservative (only the teaser generation phase would have to be repeated), while changing any of the stem or matcher fields would require a different text scan for each combination of parameters.
The following variables are available for static, global configuration for a particular search node:
Parameter name | Default value | Description |
juniper.dynsum.highlight_on | <b> | A string to be included before each hit in the generated summary |
juniper.dynsum.highlight_off | </b> | A string to be included after each hit in the generated summary |
juniper.dynsum.continuation | ... | A string to be included to denote abbreviated/left out pieces of the original text in the generated summary |
juniper.dynsum.separators | \x1D\x1F | A string containing characters that are added for word separation purposes (eg.CJK languages and German/Norwegian etc. word separation). This list should contain non-word characters only for this to be meaningful. Also, currently only single byte characters are supported. These characters wil be removed from the generated teaser by Juniper. |
juniper.dynsum.connectors | -' | A string containing characters that may connect two word tokens to form a single word. Words connected by a single such character will not be splitted by Juniper when generating the teaser. |
juniper.dynsum.escape_markup | auto | See description above. Accepted values: on,off or auto. If auto is used, Juniper will escape markup in the generated summary if any of the symbols highlight_on, highlight_off or continuation contains a < as the first character. |
juniper.dynsum.length | 256 | Length of the generated summary in bytes. This is a hint to Juniper. The result may be slightly longer or shorter depending on the structure of the available document text and the submitted query. |
juniper.dynsum.max_matches | 4 | The number of (possibly partial) set of keywords matching the query, to attempt to include in the summary. The larger this value compared is set relative to the length parameter, the more dense the keywords may appear in the summary. |
juniper.dynsum.min_length | 128 | Minimal desired length of the generated summary in bytes. This is the shortest summary length for which the number of matches will be respected. Eg. if a summary appear to become shorter than min_length bytes with max_matches matches, then additional matches will be used if available. |
juniper.dynsum.surround_max | 80 | The maximal number of bytes of context to prepend and append to each of the selected query keyword hits. This parameter defines the max size a summary would become if there are few keyword hits (max_matches set low or document contained few matches of the keywords. |
juniper.stem.min_length | 5 | The minimal number of bytes in a query keyword for it to be subject to the simple Juniper stemming algorithm. Keywords that are shorter than or equal to this limit will only yield exact matches in the dynamic summaries. |
juniper.stem.max_extend | 3 | The maximal number of bytes that a word in the document can be longer than the keyword itself to yield a match. Eg. for the default values, if the keyword is 7 bytes long, it will match any word with length less than or equal to 10 for which the keyword is a prefix. |
juniper.matcher.winsize | 400 | The size of the sliding window used to determine if multiple query terms occur together. The larger the value, the more likely the system will find (and present in dynamic summary) complete matches containing all the search terms. The downside is a potential performance overhead of keeping candidates for matches longer during matching, and consequently updating more candidates that eventually gets thrown. |
juniper.proximity.factor | 0.25 | A factor to multiply the internal Juniper metric with when producing proximity metric for a given field. A real/floating point value accepted Note that the QRserver (see below) also supports a factor that is global to all proximity metric fields, and that is applied in addition. |
juniper=<param_name>.<value>[_<param_name>.<value>]*As an example, consider the following URL addition:
juniper=near.2_dynlength.512_dynmatches.8If this string is present in the URL, Juniper would generate teasers that are up to twice as long and contains up to twice as many matches of the query compared to the default values. In addition, teasers (and proximity metric) will only be generated for those documents that fulfills the extra constraint that there exist at least one complete match of the query where the distance in words between the first and the last word of the query match is no more than 2 (+ the number of words in the query).
Parameter name | Corresponding config name, see above | Description |
dynlength | dynsum.length | The desired max length of the generated teaser |
dynmatches | dynsum.max_matches | The number of matches to try to fit in the teaser |
dynsurmax | dynsum.surround_max | The maximal amount of surrounding context per keyword hit |
near | N/A | Specifies a proximity search where keywords should occur closer than the specified value in number of words not counting the query terms themselves. |
stemext | juniper.stem.max_extend | The maximal number of bytes that a word in the document can be longer than the keyword itself to yield a match. |
stemmin | juniper.stem.min_length | The minimal number of bytes in a query keyword for it to be subject to the simple Juniper stemming algorithm. |
within | N/A | Same as near with the additional constraint that matches of the query must have the same order of the query words as the original query. |
winsize | juniper.matcher.winsize | The size of the sliding window used to determine if multiple query terms occur together. |
log | N/A | Internal debug option (privileged port only). Value is a bitmap that allows selectively enabled log output to be generated by Juniper for output into a juniperlog override configured summary field. Useful only with a special template that makes use of this information. Currently the only supported bit is 0x8000 which will provide a html table with up to 20 of the topmost matches of each document, and their identified proximity (distance) and rank. |
rpf_proximitybooster:enabled=1Note that Juniper will return 0 as proximity metric (dynamicteasermetric) if the query with juniper option constraints cannot be satisfied by the information in the configured input field. Thus if the selection of a hit is done solely on the basis of information not present in the Juniper input (such as the title in the default configuration) proximity boosting may demote such hits. A solution for this problem has been proposed for future versions of Juniper.
Config parameter name | Corresponding runtime (URL) syntax | Description |
rp.proximityboost.enabled=1 | N/A | Configure for proximity boosting in the QRserver (not necessarily enable it) |
#rp.proximityboost.default | rpf_proximityboost:enabled=1 | Enable proximity boosting |
rp.proximityboost.factor=0.5 | N/A | A value that the combined proximity boost value calculated possibly from multiple fields, scaled by their individual factors are multipled with before adding it to the Fastserver rank value to be used to reorder hits. |
rp.proximityboost.hits=100 | rpf_proximityboost:hits=100 | The number of Fastserver hits to retrieve as basis for the reordering. |
rp.proximityboost.maxoffset=100 | N/A | The maximal offset within the list of hits that will be subject to any proximity boost reordering/filtering. Hits above this range in the original result set will not be subject to proximity boosting. |
juniper.dynsum.highlight_on | \02 |
juniper.dynsum.highlight_off | \03 |
juniper.dynsum.continuation | \1E |
The actual frontend markup configuration then takes place by setting variables such as
tvm.dynsum.html.highlight_on |
tvm.dynsum.xml.highlight_off |
in the relevant rc file.