diff options
author | Jon Bratseth <bratseth@yahoo-inc.com> | 2016-06-15 23:09:44 +0200 |
---|---|---|
committer | Jon Bratseth <bratseth@yahoo-inc.com> | 2016-06-15 23:09:44 +0200 |
commit | 72231250ed81e10d66bfe70701e64fa5fe50f712 (patch) | |
tree | 2728bba1131a6f6e5bdf95afec7d7ff9358dac50 /document/doc |
Publish
Diffstat (limited to 'document/doc')
-rw-r--r-- | document/doc/.gitignore | 4 | ||||
-rw-r--r-- | document/doc/document-format.html | 564 |
2 files changed, 568 insertions, 0 deletions
diff --git a/document/doc/.gitignore b/document/doc/.gitignore new file mode 100644 index 00000000000..72ec51b65a0 --- /dev/null +++ b/document/doc/.gitignore @@ -0,0 +1,4 @@ +document-api.html +html +javadoc +latex diff --git a/document/doc/document-format.html b/document/doc/document-format.html new file mode 100644 index 00000000000..214447d6d51 --- /dev/null +++ b/document/doc/document-format.html @@ -0,0 +1,564 @@ +<!-- Copyright 2016 Yahoo Inc. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. --> +<html> +<head> +<title>Developers guide to the serialized document format</title> +</head> +<body> +<h1>Developers guide to the serialized document format</h1> + +<p>When a Vespa document is stored or transferred from one application to +another, it is serialized. The serialization format tries to achieve +serialization robustness and speed. The most important fields are kept in a +header that is accessible at low cost. The other fields are located by table +look-ups.</p> + +<h2>Purpose</h2> + +<p>The purpose of the serialized format is +<ul> +<li><b>Robustness</b>. The format shall detect errors gracefully.</li> +<li><b>Speed</b>. Deserialization shall be fast, especially for basic fields like <b>DocumentId</b>.</li> +<li><b>Size</b>. The serialized format shall be compact and allow for efficient storage and transfer. +That is partly achieved by allowing different kinds of compression. As of now <b>lz4</b> are supported.</li> +</ul> +</p> + +<p><strong>All fields are in network byte order.</strong></p> + +<h2>Changelog</h2> + +<h3>Current version: 8</h3> + +<ul> +<li>CRC removed from document format. There used to be a 4 byte CRC in the end +of a header or header + body serialization, calculated as a crc32 of all the +other data in the serialization. This CRC was included in the document length. +</ul> + + +<h3>Version: 7</h3> + +<ul> +<li>The document length is now a static sized 4 byte value, instead of a variable 2,4,8 byte value. +<li>(Anything else? I wrote this changelog when bopping from 7 to 8. Dunno if more was changed in 7.) +</ul> + +<h3>Version: 6</h3> + +This is the oldest version that we currently support. No known installation stores documents with a version smaller than this. + +<h2>Document Format</h2> + +<p>This is the description of the serialized document format.</p> + +<table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Document serialization format</em></caption> +<tr> +<td width="10%"><b>Field</td> +<td width="10%"><b>Type</td> +<td width="10%"><b>Length</td> +<td><b>Description</td> +</tr> +<tr><td>Version</td> +<td>Short integer</td> +<td>2</td> +<td>Version number. Current is 6.</td> +</tr> +<tr><td>Length</td> +<td>Integer</td> +<td>4 bytes</td> +<td>Total length of object (excluding this field and version).</td> +</tr> +<tr><td>Document ID</td> +<td>Bytes</td> +<td> </td> +<td>Unique ID for document. 0-terminated string, UTF-8 encoding.</td> +</tr> +<tr><td>Field Map</td> +<td>Bytes</td> +<td>See below</td><td>Placeholder for fields. (Note: Fieldmaps may contain other fieldmaps)</td> +</tr> +</table> + +<p>Field maps are serialized like this</p></p><p> + +<table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Fieldmap serialization format</em></caption> +<tr> +<td width="10%"><b>Field</td> +<td width="10%"><b>Type</td> +<td width="10%"><b>Length (bytes)</td> +<td><b>Description</td> +</tr> +<tr><td>Inventory bit mask</td> +<td>Byte</td> +<td>1</td> +<td> +Inventory bits describing the FieldMap element with data:<br> + Bit 0 set: FieldMap has document type <br> + Bit 1 set: FieldMap has header fields <br> + Bit 2 set: FieldMap has body fields <br> + Bit 3 set: FieldMap has external body fields<br> +</tr> +<tr><td colspan = "4"><b>Below section is present when bit 0 of inventory is set</b></td></tr> +<tr><td>Document Type</td> +<td>Bytes</td> +<td> </td> +<td>Document type. (0-terminated string, UTF-8 encoding.)</td> +</tr> +<tr><td>Version</td> +<td>Short integer</td> +<td>2</td> +<td>Document type version number.</td></tr> +<tr><td colspan = "4"><b>Below section is present when bit 1 of inventory is set</b></td></tr> +<tr><td>Header data</td> +<td>Data array</td> +<td>See below</td> +<td>Header data packed in data array</td></tr> +<tr><td colspan = "4"><b>Below section is present when bit 2 of inventory is set</b></td></tr> +<tr><td>Body data</td> +<td>Data array</td> +<td>See below</td> +<td>Body data packed in data array</td></tr> +</table> + + +<p> +<table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Data array serialization</em></caption> +<tr> +<td width="10%"><b>Field</td> +<td width="10%"><b>Type</td> +<td width="10%"><b>Length (bytes)</td> +<td><b>Description</b></td> +</tr> +<tr><td>Data length</td> +<td>Integer_2_4_8</td> +<td>2, 4 or 8</td> +<td>Length of data block (see below). NOTE THAT THIS LENGTH INCLUDE ITSELF.</td> +</tr> +<tr><td>Compression</td> +<td>Byte</td> +<td>1</td> +<td>Compression method +<br> + 0: No compression<br> + 5: Uncompressable<br> + 6: lz4 <br> +<p>Note that the uncompressable flag is not a configurable option. Rather it +will be used in document instances who are configured for compression, but +where compression yields negative results, to avoid later serializations to +retry compression.</p> +</td> +<tr><td>Number of fields<td>Integer_1_4</td> +<td>1 or 4</td> +<td>Number of fields in data array</td> +<tr><td colspan = "4"><b>Below item is present if compression method is not uncompressed or uncompressable</b></td></tr> +<tr><td>Uncompressed data length<td>Integer_2_4_8</td> +<td>2, 4 or 8</td> +<td>Length of data block after decompression</td> +<tr><td colspan = "4"><b>Below block is repeated "Number of fields" times</b></td></tr> +<tr><td>Field ID<td>Integer_1_4</td> +<td>1 or 4</td> +<td>ID of field.</td> +<tr><td>Field Size<td>Integer_1_2_4</td> +<td>1, 2 or 4</td> +<td>Length of field.</td> +</td> +<tr><td colspan = "4"><b>End of repeated block </b></td></tr> +<tr><td>Data block<td>Bytes</td> +<td> </td> +<td>The data block.<br> + - Items are ordered the same way field array is sorted.<br> + - Use lengths from field array above to find item offset and length.<br> + - If the block is compressed, lengths refer to decompressed version</td> +</table> + + +<table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Data type serialization</em></caption> +<tr> +<td width="15%"><b>Data type</td> +<td width="10%"><b>Length</td> +<td><b>Serialization</td> +</tr> +<tr><td>Integer (ID 0)</td> +<td>4</td> +<td>Signed integer, two's complement notation, network byte order.</td> +</tr> +<tr><td>Floating point number (ID 1)</td> +<td>4</td> +<td>IEEE 754, single precision, network byte order.</td> +</tr> +<tr><td>String (ID 2)</td> +<td>1 + (1 or 4) + length + 1</td> +<td>Strings are serialization format:<br /> + - First byte represents coding. This has traditionally denoted the maximum number of bits + per character in the UTF-8 encoded string, but has never been used in deserialization code. +<ul> + <li>Set to 32 if not used.</li> + <li>Set to <32 if you know the UTF-8 string uses less bits per character; e.g. ASCII could use 8.</li> + <li>Set bit 6 (decimal 64) if the string has an <a href="#annotations">annotation tree</a>.</li> +</ul> +<br /> + - Integer_1_4 with length of string. <br /> + - The string (UTF-8 encoding), including 0-terminating byte.<br> + - An annotation tree, if bit 6 (decimal 64) of coding byte is set: +<ul> + <li>total length of all span trees excl. itself: <strong>uint32</strong></li> + <li>number of span trees <strong>int_1_2_4</strong></li> + <li>for each root node: + <ol> + <li>tree name serialized as String</li> + <li>serialized SpanNode as given below, see <a href="#annotations">annotation serialization</a></li> + </ol> +</ul> +</td> +</tr> +<tr><td>Raw bytes (ID 3)</td> +<td>Length of buffer</td> +<td>Byte for byte copy</td> +</tr> +<tr><td>Long integer (ID 4)</td> +<td>8</td> +<td>Signed integer, two's complement notation, network byte order.</td> +</tr> +<tr><td>Double floating point number (ID 5)</td> +<td>8</td> +<td>IEEE 754, double precision, network byte order.</td> +</tr> +<tr><td>Array (ID 6)</td> +<td>At least 8 bytes</td> +<td>Arrays of any fields are serialized like this:<br> + - 4 bytes: Data type array consists of <br> + - 4 bytes: Number of elements in array<br> + Below sequence is repeated "number of element" times<br> + - 4 bytes: Length of element<br> + - Serialized element<br> +</td></tr> +<tr><td>Fieldmap (ID 7)</td> +<td> </td> +<td>Field maps (embedded or not) are defined above</td> +</tr> +<tr><td>Document (ID 8)</td> +<td> </td> +<td>Document objects (embedded or not) are defined above</td> +</tr> +<tr><td>Timestamp (ID 9)</td> +<td> </td> +<td>Same as long integer</td> +</tr> +<tr><td>Uri (ID 10)</td> +<td> </td> +<td>Same as string</td> +</tr> +<tr><td>Exact string (ID 11)</td> +<td> </td> +<td>Same as string</td> +</tr> +<tr><td>Content (ID 12)</td> +<td>At least 11 bytes</td> +<td>Content fields are serialized like this:<br> + - Content type length (1 byte)<br> + - Content type (0 terminated string, UTF-8 encoding.)<br> + - Content encoding length (1 byte)<br> + - Content encoding (0 terminated string, UTF-8 encoding.)<br> + - Content language length (1 byte)<br> + - Content language (0 terminated string, UTF-8 encoding.)<br> + - Content length (Integer, 4 bytes)<br> + - Content (including 0-terminating char)<br> +</td> +</tr> +<tr><td>Content meta (ID 13)</td> +<td>At least 12 bytes</td> +<td>Content (attachment) meta data are serialized like this:<br> + - Attachment size (Integer, 4 bytes)<br> + - Attachment name (0 terminated string, UTF-8 encoding.)<br> + - Attachment encoding (0 terminated string, UTF-8 encoding.)<br> + - Attachment content type (0 terminated string, UTF-8 encoding.)<br> + - Attachment part (0 terminated string, UTF-8 encoding.)<br> + - Attachment flag (Integer, 4 bytes)<br> +</td> +</tr> +<tr><td>Term boost (ID 15)</td> +<td> </td> +<td>Same as string</td> +</tr> +<tr><td>Byte (ID 16)</td> +<td>1</td> +<td>One single byte</td> +</tr> +<tr><td>Set (ID 17)</td> +<td>At least 8 bytes</td> +<td>Set of any fields are serialized like this:<br> + - Integer (4 bytes): Data type set is made up of<br> + - Integer (4 bytes): Number of elements in set<br> + Below sequence is repeated "number of element" times<br> + - Serialized element<br> +</td></tr> +</table> + + +<a name="annotations"><table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Annotation tree serialization</em></caption> +<tr> +<td width="15%"><b>Data type</td> +<td width="10%"><b>Length</td> +<td><b>Serialization</td> +</tr> +<tr> +<td>SpanNode (base class)</td> +<td>1 + (1, 2 or 4) + Annotation serialization + subclass payload</td> +<td> + <ul> +<li> type <strong>byte</strong> (1: Span, 2: SpanList, 4: AlternateSpanList) +</li> <li> number of annotations <strong>int_1_2_4</strong> +</li> <li> each annotation as given below +</li> <li> (remaining payload serialized as given below by subclasses Span, SpanList and AlternateSpanList) +</li></ul> +</td> +</tr> +<tr> +<td>Annotation</td> +<td>4 + (1, 2 or 4) + (possibly 4 + FieldValue serialization)</td> +<td> +<ul> +<li>MD5 name hash (4 LSBytes) <strong>uint32</strong> (NOTE: 0-127 reserved for internal Vespa usage.) +</li> <li> length <strong>int_1_2_4</strong> +</li> <li> the following fields are <em>only</em> present if length > 0: <ul> +<li> data type id <strong>uint32</strong> +</li> <li> NOTE: no sequence id, as we will rely on annotations being serialized/deserialized in particular order, so we don't need to write this explicitly +</li> <li> FieldValue as given by its own serialization +</li></ul> +</li></ul> +</td> +</tr> +<tr> +<td>Span</td> +<td>SpanNode serialization + (1, 2 or 4) + (1, 2 or 4)</td> +<td><ul> +<li>serialization from SpanNode base class</li> +<li> from index, as given by Java String (UTF-16) <strong>int_1_2_4</strong> +</li> <li> length, as given by Java String (UTF-16) <strong>int_1_2_4</strong> +</li></ul> +</td> +</tr> +<tr> +<td>SpanList</td> +<td>SpanNode serialization + (1, 2 or 4) + n times SpanNode serialization</td> +<td> +<ul> +<li>serialization from SpanNode base class</li> +<li> number of children <strong>int_1_2_4</strong> +</li> <li> each child node serialized as SpanNode (Span, SpanList, AlternateSpanList) +</li></ul> +</td> +</tr> +<tr> +<td>AlternateSpanList</td> +<td>SpanNode serialization + (1, 2 or 4) + n times (8 + SpanList serialization)</td> +<td><ul> +<li>serialization from SpanNode base class</li> +<li> number of child trees <strong>int_1_2_4</strong> +</li> <li> for each child tree: <ul> +<li> probability <strong>double</strong> +</li> <li> serialization as given by SpanList above +</li></ul> +</li></ul> +</td> +</tr> +<tr> +<td>AnnotationRef</td> +<td>1, 2 or 4</td> +<td>AnnotationRef serialization <ul> +<li> unique sequence id of annotation being referred to <strong>int1_2_4</strong> +</li></ul> +</td> +</tr> +</table> + +<table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Data types used in serialized format</em></caption> +<tr> +<td width="15%"><b>Data type</td> +<td><b>Serialization</b></td> +</tr> +<tr><td>Integer_1_4</td> +<td>If bit 7 of first byte is unset, coded using 1 byte.<br /> + If bit 7 of first byte is set, coded using 4 bytes (bit 7 of first byte must be masked away).<br /> + <em>Range: 0 - 2**31-1.</em></td> +</tr> +<tr><td>Integer_1_2_4</td> +<td>If bit 7 of first byte is unset, coded using 1 byte.<br /> + If bit 7 of first byte is set and bit 6 of first byte is unset, coded using 2 bytes (bit 7 and 6 of first byte must be masked away).<br /> + If bit 7 and 6 of first byte are set, coded using 4 bytes (bit 7 and 6 of first byte must be masked away).<br /> + <em>Range: 0 - 2**30-1.</em></td> +</td> +</tr> +<tr><td>Integer_2_4_8</td> +<td>If bit 7 of first byte is unset, coded using 2 byte.<br /> + If bit 7 of first byte is set and bit 6 of first byte is unset, coded using 4 bytes (bit 7 and 6 of first byte must be masked away).<br /> + If bit 7 and 6 of first byte are set, coded using 8 bytes (bit 7 and 6 of first byte must be masked away).<br /> + <em>Range: 0 - 2**62-1.</em></td> +</td> +</tr> +</table> + + +<h2>Document Update Format</h2> + +<p>This is the description of the serialized document update format.</p> + +<table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Document update serialization format</em></caption> +<tr> +<td width="10%"><b>Field</td> +<td width="10%"><b>Type</td> +<td width="10%"><b>Length</td> +<td><b>Description</td> +</tr> +<tr><td>Document ID</td> +<td>Bytes</td> +<td> </td> +<td>Unique ID for document. 0-terminated string, UTF-8 encoding.</td> +</tr> +<tr><td>Content byte</td> +<td>Byte</td> +<td>1 byte</td> +<td>Always set to 1</td> +</tr> +<tr><td>Document Type</td> +<td>Bytes</td> +<td> </td> +<td>Document type. (0-terminated string, UTF-8 encoding.)</td> +</tr> +<tr><td>Number of fields to update</td> +<td>Integer</td> +<td>4 bytes</td> +<td>The number of fields to update</td> +</tr> +<tr><td>Serialized field updates</td> +<td>Field Update</td> +<td> </td> +<td>The serialized field updates. See below.</td> +</tr> +</table> + +<table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Document update serialization format</em></caption> +<tr> +<td width="10%"><b>Field</td> +<td width="10%"><b>Type</td> +<td width="10%"><b>Length</td> +<td><b>Description</td> +</tr> +<tr><td>Field Id</td> +<td>Integer</td> +<td>4 bytes</td> +<td>Field id within document type.</td> +</tr> +<tr><td>Number of value updates</td> +<td>Integer</td> +<td>4 bytes</td> +<td>Numer of value updates to this field.</td> +</tr> +<tr><td>Serialized field update values</td> +<td>Bytes</td> +<td> </td> +<td>The serialized field update values. See below.</td> +</tr> +</table> + +<table border="1" cellspacing="0" cellpadding="1%" width="100%"> +<caption><em>Document update value serialization format</em></caption> +<tr> +<td width="10%"><b>Field</td> +<td width="10%"><b>Type</td> +<td width="10%"><b>Length</td> +<td><b>Description</td> +</tr> +<tr><th colspan="4">Add Value Update</th></tr> +<tr><td>Add Value Update ID</td> +<td>Integer</td> +<td>4 bytes</td> +<td>25 + 0x1000 for value updates.</td> +</tr> +<tr><td>Field serialization</td> +<td>FieldValue</td> +<td> </td> +<td>Serialization of the field to add.</td> +</tr> +<tr><td>Weight</td> +<td>Integer</td> +<td>4 bytes</td> +<td>Weight. Used if update applies to weighted set.</td> +</tr> +<tr><th colspan="4">Arithmetic Update</th></tr> +<tr><td>Arithmetic Update ID</td> +<td>Integer</td> +<td>4 bytes</td> +<td>26 + 0x1000 for arithmetic updates.</td> +</tr> +<tr><td>Operator ID</td> +<td>Integer</td> +<td>4 bytes</td> +<td>Identifies whether this does add, subtract, multiply or divide.</td> +</tr> +<tr><td>Operand</td> +<td>Double</td> +<td>8 bytes</td> +<td>The right operand to use in the arithmetic operation.</td> +</tr> +<tr><th colspan="4">Assign Update</th></tr> +<tr><td>Assign Update ID</td> +<td>Integer</td> +<td>4 bytes</td> +<td>27 + 0x1000 for assign updates.</td> +</tr> +<tr><td>Content flag</td> +<td>Byte</td> +<td>1 bytes</td> +<td>Contains 1 if we have content, 0 if not.</td> +</tr> +<tr><td>Field serialization</td> +<td>FieldValue</td> +<td> </td> +<td>Serialization of the field to assign.</td> +</tr> +<tr><th colspan="4">Clear Update</th></tr> +<tr><td>Clear Update ID</td> +<td>Integer</td> +<td>4 bytes</td> +<td>28 + 0x1000 for clear updates.</td> +</tr> +<tr><th colspan="4">Map Value Update</th></tr> +<tr><td>Map Value Update ID</td> +<td>Integer</td> +<td>4 bytes</td> +<td>29 + 0x1000 for map value updates.</td> +</tr> +<tr><td>Field serialization</td> +<td>FieldValue</td> +<td>4 bytes</td> +<td>The field indicating what entry to update.</td> +</tr> +<tr><td>Value Update</td> +<td>Document Value Update</td> +<td> </td> +<td>The update operation to apply to the field indicated above.</td> +</tr> +<tr><th colspan="4">Remove Value Update</th></tr> +<tr><td>Remove Update ID</td> +<td>Integer</td> +<td>4 bytes</td> +<td>30 + 0x1000 for remove updates.</td> +</tr> +<tr><td>Field serialization</td> +<td>FieldValue</td> +<td> </td> +<td>The field indicating what entry to update.</td> +</tr> +</table> + +</body> +</html> |