Juniper Configuration Documentation

Note: This document describes in details the functionality of Juniper v.2.1.0. The document has gradually become more and more for internal use for instance for detailed tuning by Professional Service. A more high level and less detailed user level configuration documentation is also available.

Juniper implements a combined proximity ranking and dynamic teaser result processing module.This module is intended to be interfaced to by different Fast software modules on demand. Currently, the only available module that makes use of Juniper is the Fast Server module, in which Juniper currently is an integrated part of fsearch (the search engine executable that runs on each search node in the system).

Juniper simple description of functionality/implementation

The document body is stripped for markup during document processing and stored as an extended document summary field. A max limit of how much of the document that gets stored is configurable as of Fast Server v.4.17 (see the Fast Server�configuration documentation for details). For each document on the result page, this document extract is retrieved and fed through Juniper which will perform the following steps:

Scan the stripped document text (docsum) for matches of the query, create a data structure containing information about those matches, and provide a quality measure (rank boost value) that can be used as a metric to determine the quality of the document wrt. proximity and position of the search string in the document. The data structure contains ao. a list of matches of the query ordered by quality (see below for the quality measure). The document quality measure is computed from the quality measure of the best of the individual matches and the total number of hits within the document.
Generate a dynamic teaser based on the data structure previously generated. The dynamic teaser is composed of a number of text segments that include the "best" matches of the query in that document. The teaser is presented with the query words highlighted. The definition of highlight is configurable. If the document is short enough to fit completely into the configured teaser length, it will be provided as is, but with highlighting of the relevant keywords.

Step 2 is only necessary if the teaser is going to be displayed, which might be a decision taken on basis of the quality measure provided in step 1.

Quality measure

The text segments matching the query are ranked by (in decreasing order of significance):

Completeness * keyword weight - higher ranking if more search words are present in the same context, and relatively higher weight on matches that contains "important" terms compared to matches with stop words if equal number of words.
Proximity - query terms occurring near each other is better
Position - earlier in document is better

The number of matches selected is based on text segment lengths including a configurable amound of surrounding text, the number of matching segments to use (configurable) and the required total summary size (configurable). The final set of matches is returned with markup for the hits and the abbreviated sections (continuation symbol). The query used for teaser generation has undergone proper name recognition and English spell checking. Highlighting is done on individual terms of the query. In particular, phrases are broken down into individual terms, but the preference to proximal terms will maintain the phrasing in the generated teaser. Lemmatization by expanding documents with word inflections cannot be used by Juniper. In the future, Juniper would expand the query based on the original query and language information. This functionality is not available yet, thus lemmatized terms will in general not be highlighted by Juniper. Currently Juniper uses an alternative, simple brute force stemming algorithm that basically allows prefixes of the document words to match if the document word in question is no longer than P (configurable) bytes longer than the query keyword. This algorithm works well for keywords of a certain size, but not for very short keywords. Thus an additional configuration variable defines a lower bound for what lengths of keywords that will be subject to this algorithm. With this simple algorithm in place, typical weak form singular to plural mapping will get highlighted while the opposite, going from a long form to a shorter one will not work as might be expected. Eg. this algorithm does not change the keywords themselves. Consequently, the shorter forms of the keywords are more likely to give non-exact hits in the dynamic summary.

Fast Search configuration of Juniper

Enabling Juniper functionality within Fast Search is done on a per field basis by means of override specifications in summary.map. Currently the following override specs are supported by Juniper:

override <outputfield> dynamicteaser <inputfield>

override <outputfield> dynamicteasermetric
<inputfield>

override <outputfield> juniperlog <inputfield>

Details of the override directive can be found in Fast Search 4.13 - Dynamic Docsum Generation Framework. The dynamicteasermetric field provides a ranking of the document based on a corresponding metric as that used to select between individual matches for dynamic teaser generation inside a document. See the section on using Juniper for proximity boosting below The juniperlog field is new as of Juniper 2.0, and is used to retrieve the information generated by Juniper by means of the log query option, see the runtime option table below. Note that when integrated into Data Search 3.1 and later, this part of the configuration will be generated via

The index profile
The index configuration (indexConfig.xml) by the config server.

Configuration levels

When integrated into Fast Search, Juniper receives its default parameters from global settings in the fsearchrc config file. These configuration parameter settings must be preconfigured at fsearch process startup time. Two levels of system configuration is currently supported,

System default configuration: This is the configuration settings exemplified by the parameter descriptions below.
Per field configuration:By using the field names instead of the string "juniper" as prefix, the default setting can be overridden on a result field basis. Eg. setting for instance
```
myfield.dynsum.length 512
```
(see below) would allow the myfield result field to receive a different teaser length.

In addition Juniper 2.x supports changing certain subset of the parameters on a per query basis. See separate section on this below.

Performance note: The per field configuration possibility should be used with care since overriding some parameters may cause significant computation overhead in that Juniper would have to scan the whole text multiple times. Changing the dynsum group of fields is generally quite performance conservative (only the teaser generation phase would have to be repeated), while changing any of the stem or matcher fields would require a different text scan for each combination of parameters.

Arbitrary byte sequences in markup parameters

To allow arbitrary byte seqences (such as low ascii values) to be used to denote highlight on/off and continuation symbol(s), Juniper now accepts strings on the form \xNN where the N's are hex values [0-9a-fA-F]. This will be converted into a byte value of NN. Note that Juniper exports UTF-8 text so this sequence should be a valid UTF-8 byte sequence. No checks are performed from Juniper on the validity of such strings in the fsearch domain. As a consequence of this, occurrences of backslash must be escaped accordingly (\\).

Blanks in text parameters must be escaped

Note that fsearchrc does not accept blanks in the parameters. To allow more complicated highlight markup, the sequence \x20 must be used as space in text fields.

Escaping markup in the summary text

Since Juniper may supply markup through the use of the highlight and continuation parameters, problems may occur if the analysed text itself contains markup. To avoid this, Juniper may be configured to escape the 5 XML/HTML markup symbols ("&<'>) before adding the mentioned parametrized symbols. See the description of the juniper.dynsum.escape_markup parameter below.

The following variables are available for static, global configuration for a particular search node:

Parameter name Default value Description

juniper.dynsum.highlight_on <b> A string to be included before each hit in the generated summary
juniper.dynsum.highlight_off </b> A string to be included after each hit in the generated summary
juniper.dynsum.continuation ... A string to be included to denote abbreviated/left out pieces of the original text in the generated summary
juniper.dynsum.separators \x1D\x1F A string containing characters that are added for word separation purposes (eg.CJK languages and German/Norwegian etc. word separation). This list should contain non-word characters only for this to be meaningful. Also, currently only single byte characters are supported. These characters wil be removed from the generated teaser by Juniper.
juniper.dynsum.connectors -' A string containing characters that may connect two word tokens to form a single word. Words connected by a single such character will not be splitted by Juniper when generating the teaser.
juniper.dynsum.escape_markup auto See description above. Accepted values: on,off or auto. If auto is used, Juniper will escape markup in the generated summary if any of the symbols highlight_on, highlight_off or continuation contains a < as the first character.
juniper.dynsum.length 256 Length of the generated summary in bytes. This is a hint to Juniper. The result may be slightly longer or shorter depending on the structure of the available document text and the submitted query.
juniper.dynsum.max_matches 4 The number of (possibly partial) set of keywords matching the query, to attempt to include in the summary. The larger this value compared is set relative to the length parameter, the more dense the keywords may appear in the summary.
juniper.dynsum.min_length 128 Minimal desired length of the generated summary in bytes. This is the shortest summary length for which the number of matches will be respected. Eg. if a summary appear to become shorter than min_length bytes with max_matches matches, then additional matches will be used if available.
juniper.dynsum.surround_max 80 The maximal number of bytes of context to prepend and append to each of the selected query keyword hits. This parameter defines the max size a summary would become if there are few keyword hits (max_matches set low or document contained few matches of the keywords.
juniper.stem.min_length 5 The minimal number of bytes in a query keyword for it to be subject to the simple Juniper stemming algorithm. Keywords that are shorter than or equal to this limit will only yield exact matches in the dynamic summaries.
juniper.stem.max_extend 3 The maximal number of bytes that a word in the document can be longer than the keyword itself to yield a match. Eg. for the default values, if the keyword is 7 bytes long, it will match any word with length less than or equal to 10 for which the keyword is a prefix.

juniper.matcher.winsize 400 The size of the sliding window used to determine if multiple query terms occur together. The larger the value, the more likely the system will find (and present in dynamic summary) complete matches containing all the search terms. The downside is a potential performance overhead of keeping candidates for matches longer during matching, and consequently updating more candidates that eventually gets thrown.

juniper.proximity.factor 0.25 A factor to multiply the internal Juniper metric with when producing proximity metric for a given field. A real/floating point value accepted Note that the QRserver (see below) also supports a factor that is global to all proximity metric fields, and that is applied in addition.

Alternate behaviour of Juniper on a per query basis

As of Juniper v.2.x and Fastserver v.4.17, and QRserver for Data Search 3.2, Juniper supports a number of Juniper specific options that can be provided as part of the URL. The format of the option string is

  juniper=<param_name>.<value>[_<param_name>.<value>]*

As an example, consider the following URL addition:

  juniper=near.2_dynlength.512_dynmatches.8

If this string is present in the URL, Juniper would generate teasers that are up to twice as long and contains up to twice as many matches of the query compared to the default values. In addition, teasers (and proximity metric) will only be generated for those documents that fulfills the extra constraint that there exist at least one complete match of the query where the distance in words between the first and the last word of the query match is no more than 2 (+ the number of words in the query).

Supported per query options in Juniper 2.1.0

Parameter name	Corresponding config name, see above	Description
dynlength	dynsum.length	The desired max length of the generated teaser
dynmatches	dynsum.max_matches	The number of matches to try to fit in the teaser
dynsurmax	dynsum.surround_max	The maximal amount of surrounding context per keyword hit
near	N/A	Specifies a proximity search where keywords should occur closer than the specified value in number of words not counting the query terms themselves.
stemext	juniper.stem.max_extend	The maximal number of bytes that a word in the document can be longer than the keyword itself to yield a match.
stemmin	juniper.stem.min_length	The minimal number of bytes in a query keyword for it to be subject to the simple Juniper stemming algorithm.
within	N/A	Same as near with the additional constraint that matches of the query must have the same order of the query words as the original query.
winsize	juniper.matcher.winsize	The size of the sliding window used to determine if multiple query terms occur together.
log	N/A	Internal debug option (privileged port only). Value is a bitmap that allows selectively enabled log output to be generated by Juniper for output into a juniperlog override configured summary field. Useful only with a special template that makes use of this information. Currently the only supported bit is 0x8000 which will provide a html table with up to 20 of the topmost matches of each document, and their identified proximity (distance) and rank.

Juniper debug template in Data Search

Template support for the log parameter as well as extracting the whole juniper input document text is provided by Data Search 3.2 by means of the jsearch page from the qrserver port. Replace asearch with jsearch in the URL of a qrserver privileged port search.

Using Juniper for proximity boosting with the QRserver

In order to use Juniper to boost hits that have good proximity of the query (or to filter the hits based on NEAR or WITHIN constraints) the QRserver would need to be provided with the following URL addition:

rpf_proximitybooster:enabled=1

Note that Juniper will return 0 as proximity metric (dynamicteasermetric) if the query with juniper option constraints cannot be satisfied by the information in the configured input field. Thus if the selection of a hit is done solely on the basis of information not present in the Juniper input (such as the title in the default configuration) proximity boosting may demote such hits. A solution for this problem has been proposed for future versions of Juniper.

Supported QRserver options to use with proximity boosting via Juniper

QRserver behaviour wrt. proximity boosting can be set both in configuration (at QRserver startup) or on a per query basis. In the below table, some of the default configuration settings are listed together with their corresponding runtime setting, if any. Consult QRserver documentation for the complete list of options.

Config parameter name Corresponding runtime (URL) syntax Description

rp.proximityboost.enabled=1 N/A Configure for proximity boosting in the QRserver (not necessarily enable it)

#rp.proximityboost.default rpf_proximityboost:enabled=1 Enable proximity boosting

rp.proximityboost.factor=0.5 N/A A value that the combined proximity boost value calculated possibly from multiple fields, scaled by their individual factors are multipled with before adding it to the Fastserver rank value to be used to reorder hits.

rp.proximityboost.hits=100 rpf_proximityboost:hits=100 The number of Fastserver hits to retrieve as basis for the reordering.

rp.proximityboost.maxoffset=100 N/A The maximal offset within the list of hits that will be subject to any proximity boost reordering/filtering. Hits above this range in the original result set will not be subject to proximity boosting.

Configuring Juniper within Data Search

Except where explicitly noted, configuring Juniper for Data Search is similar to configuring for Real-Time Search. As of Data Search 3.0 Juniper is by default configured and enabled in Data Search.

Configuring Juniper within Fast Real-Time Search

Juniper is provided as part of Real-Time Search (through Fast Search) starting with version 2.4. To enable the Fast Search integrated Juniper in a Real-Time Search environment, see the documentation extensions to Real-Time Search 2.4. Note that to configure Juniper within Real-Time Search, the configuration variables should be put in the etc/fsearch.addon* file(s) which will be used as input when Real-Time Search generates fsearchrc files for all configured search engines. Also a proper summary.map file is needed to enable the dynamic summaries on particular fields.

Configuring Juniper for Fast Search v.4.15 and higher

Newer versions of Fast Search provide template support to allow different Juniper markup depending on the type of display desired (plain,html or xml). All the http frontends that needs an interpretation of the highlight information provided by Juniper should have the following setup for Juniper:

juniper.dynsum.highlight_on \02

juniper.dynsum.highlight_off \03

juniper.dynsum.continuation \1E

The actual frontend markup configuration then takes place by setting variables such as

tvm.dynsum.html.highlight_on

tvm.dynsum.xml.highlight_off

in the relevant rc file.

How to report bugs/errors related to Juniper

Errors/problems related to Juniper can be divided into two categories:

Problems with specific documents/teasers
System errors/instability etc.

Due to the complexity of a full, running system it is much easier for all parts if the particular query/document pair triggering the problem can be identified and analysed off-line.

Problems with specific teasers

Problems of this category is likely to occur because there are so many combinations of queries and documents that it is not possible to test for all cases. To be able to analyse such problems, it is vital that the exact (byte-by-byte) teaser generation source (document summary input to Juniper) can be made available together with the exact query as presented to Juniper. To determine this requires the following information:

The teaser source docsum. The name of the docsum field is dependent on the configuration in summary.map. The data should be provided without any post processing performed, if possible, to avoid missing problems related to bad input data such as malformet UTF8 characters.
The original query as submitted by the user
The expanded query (available under var/log/querylogs/)
A corresponding fsearchrc (pure fastserver4) or fsearch.addon (Real Time Search/DS 3.x) file used by the fsearch process that performed the task.

System errors/instability problems

Problems of category 2 should, if occurring at all only be associated with development/beta releases. If such an unfortunate event should happen, the following information in addition to the information associated with category 1 would be useful to pin down the problem:

Core file of fsearch accompanied with the associated fsearch binary.
Log files from the crashed process (in Data Search these will be present as var/fsearch-*.log and var/log/stdout.log.

Parameter name	Default value	Description
juniper.dynsum.highlight_on	<b>	A string to be included before each hit in the generated summary
juniper.dynsum.highlight_off	</b>	A string to be included after each hit in the generated summary
juniper.dynsum.continuation	...	A string to be included to denote abbreviated/left out pieces of the original text in the generated summary
juniper.dynsum.separators	\x1D\x1F	A string containing characters that are added for word separation purposes (eg.CJK languages and German/Norwegian etc. word separation). This list should contain non-word characters only for this to be meaningful. Also, currently only single byte characters are supported. These characters wil be removed from the generated teaser by Juniper.
juniper.dynsum.connectors	-'	A string containing characters that may connect two word tokens to form a single word. Words connected by a single such character will not be splitted by Juniper when generating the teaser.
juniper.dynsum.escape_markup	auto	See description above. Accepted values: on,off or auto. If auto is used, Juniper will escape markup in the generated summary if any of the symbols highlight_on, highlight_off or continuation contains a < as the first character.
juniper.dynsum.length	256	Length of the generated summary in bytes. This is a hint to Juniper. The result may be slightly longer or shorter depending on the structure of the available document text and the submitted query.
juniper.dynsum.max_matches	4	The number of (possibly partial) set of keywords matching the query, to attempt to include in the summary. The larger this value compared is set relative to the length parameter, the more dense the keywords may appear in the summary.
juniper.dynsum.min_length	128	Minimal desired length of the generated summary in bytes. This is the shortest summary length for which the number of matches will be respected. Eg. if a summary appear to become shorter than min_length bytes with max_matches matches, then additional matches will be used if available.
juniper.dynsum.surround_max	80	The maximal number of bytes of context to prepend and append to each of the selected query keyword hits. This parameter defines the max size a summary would become if there are few keyword hits (max_matches set low or document contained few matches of the keywords.
juniper.stem.min_length	5	The minimal number of bytes in a query keyword for it to be subject to the simple Juniper stemming algorithm. Keywords that are shorter than or equal to this limit will only yield exact matches in the dynamic summaries.
juniper.stem.max_extend	3	The maximal number of bytes that a word in the document can be longer than the keyword itself to yield a match. Eg. for the default values, if the keyword is 7 bytes long, it will match any word with length less than or equal to 10 for which the keyword is a prefix.
juniper.matcher.winsize	400	The size of the sliding window used to determine if multiple query terms occur together. The larger the value, the more likely the system will find (and present in dynamic summary) complete matches containing all the search terms. The downside is a potential performance overhead of keeping candidates for matches longer during matching, and consequently updating more candidates that eventually gets thrown.
juniper.proximity.factor	0.25	A factor to multiply the internal Juniper metric with when producing proximity metric for a given field. A real/floating point value accepted Note that the QRserver (see below) also supports a factor that is global to all proximity metric fields, and that is applied in addition.

Config parameter name	Corresponding runtime (URL) syntax	Description
rp.proximityboost.enabled=1	N/A	Configure for proximity boosting in the QRserver (not necessarily enable it)
#rp.proximityboost.default	rpf_proximityboost:enabled=1	Enable proximity boosting
rp.proximityboost.factor=0.5	N/A	A value that the combined proximity boost value calculated possibly from multiple fields, scaled by their individual factors are multipled with before adding it to the Fastserver rank value to be used to reorder hits.
rp.proximityboost.hits=100	rpf_proximityboost:hits=100	The number of Fastserver hits to retrieve as basis for the reordering.
rp.proximityboost.maxoffset=100	N/A	The maximal offset within the list of hits that will be subject to any proximity boost reordering/filtering. Hits above this range in the original result set will not be subject to proximity boosting.