Developers guide to the serialized document format

When a Vespa document is stored or transferred from one application to another, it is serialized. The serialization format tries to achieve serialization robustness and speed. The most important fields are kept in a header that is accessible at low cost. The other fields are located by table look-ups.

Purpose

The purpose of the serialized format is

Robustness. The format shall detect errors gracefully.
Speed. Deserialization shall be fast, especially for basic fields like DocumentId.
Size. The serialized format shall be compact and allow for efficient storage and transfer.

All fields are in network byte order.

Changelog

Current version: 8

CRC removed from document format. There used to be a 4 byte CRC in the end of a header or header + body serialization, calculated as a crc32 of all the other data in the serialization. This CRC was included in the document length.

Version: 7

The document length is now a static sized 4 byte value, instead of a variable 2,4,8 byte value.
(Anything else? I wrote this changelog when bopping from 7 to 8. Dunno if more was changed in 7.)

Version: 6

This is the oldest version that we currently support. No known installation stores documents with a version smaller than this.

Document Format

This is the description of the serialized document format.

*Document serialization format*
Field	Type	Length	Description
Version	Short integer	2	Version number. Current is 6.
Length	Integer	4 bytes	Total length of object (excluding this field and version).
Document ID	Bytes		Unique ID for document. 0-terminated string, UTF-8 encoding.
Field Map	Bytes	See below	Placeholder for fields. (Note: Fieldmaps may contain other fieldmaps)

Field maps are serialized like this

*Fieldmap serialization format*
Field	Type	Length (bytes)	Description
Inventory bit mask	Byte	1	Inventory bits describing the FieldMap element with data: Bit 0 set: FieldMap has document type Bit 1 set: FieldMap has header fields Bit 2 set: FieldMap has body fields Bit 3 set: FieldMap has external body fields
Below section is present when bit 0 of inventory is set
Document Type	Bytes		Document type. (0-terminated string, UTF-8 encoding.)
Version	Short integer	2	Document type version number.
Below section is present when bit 1 of inventory is set
Header data	Data array	See below	Header data packed in data array
Below section is present when bit 2 of inventory is set
Body data	Data array	See below	Body data packed in data array

*Data array serialization*
Field	Type	Length (bytes)	Description
Data length	Integer_2_4_8	2, 4 or 8	Length of data block (see below). NOTE THAT THIS LENGTH INCLUDE ITSELF.
Number of fields	Integer_1_4	1 or 4	Number of fields in data array
Below block is repeated "Number of fields" times
Field ID	Integer_1_4	1 or 4	ID of field.
Field Size	Integer_1_2_4	1, 2 or 4	Length of field.
End of repeated block
Data block	Bytes		The data block. - Items are ordered the same way field array is sorted. - Use lengths from field array above to find item offset and length. - If the block is compressed, lengths refer to decompressed version

*Data type serialization*
Data type	Length	Serialization
Integer (ID 0)	4	Signed integer, two's complement notation, network byte order.
Floating point number (ID 1)	4	IEEE 754, single precision, network byte order.
String (ID 2)	1 + (1 or 4) + length + 1	Strings are serialization format: - First byte represents coding. This has traditionally denoted the maximum number of bits per character in the UTF-8 encoded string, but has never been used in deserialization code. Set to 32 if not used. Set to <32 if you know the UTF-8 string uses less bits per character; e.g. ASCII could use 8. Set bit 6 (decimal 64) if the string has an annotation tree. - Integer_1_4 with length of string. - The string (UTF-8 encoding), including 0-terminating byte. - An annotation tree, if bit 6 (decimal 64) of coding byte is set: total length of all span trees excl. itself: uint32 number of span trees int_1_2_4 for each root node: tree name serialized as String serialized SpanNode as given below, see annotation serialization
Raw bytes (ID 3)	Length of buffer	Byte for byte copy
Long integer (ID 4)	8	Signed integer, two's complement notation, network byte order.
Double floating point number (ID 5)	8	IEEE 754, double precision, network byte order.
Array (ID 6)	At least 8 bytes	Arrays of any fields are serialized like this: - 4 bytes: Data type array consists of - 4 bytes: Number of elements in array Below sequence is repeated "number of element" times - 4 bytes: Length of element - Serialized element
Fieldmap (ID 7)		Field maps (embedded or not) are defined above
Document (ID 8)		Document objects (embedded or not) are defined above
Timestamp (ID 9)		Same as long integer
Uri (ID 10)		Same as string
Exact string (ID 11)		Same as string
Content (ID 12)	At least 11 bytes	Content fields are serialized like this: - Content type length (1 byte) - Content type (0 terminated string, UTF-8 encoding.) - Content encoding length (1 byte) - Content encoding (0 terminated string, UTF-8 encoding.) - Content language length (1 byte) - Content language (0 terminated string, UTF-8 encoding.) - Content length (Integer, 4 bytes) - Content (including 0-terminating char)
Content meta (ID 13)	At least 12 bytes	Content (attachment) meta data are serialized like this: - Attachment size (Integer, 4 bytes) - Attachment name (0 terminated string, UTF-8 encoding.) - Attachment encoding (0 terminated string, UTF-8 encoding.) - Attachment content type (0 terminated string, UTF-8 encoding.) - Attachment part (0 terminated string, UTF-8 encoding.) - Attachment flag (Integer, 4 bytes)
Term boost (ID 15)		Same as string
Byte (ID 16)	1	One single byte
Set (ID 17)	At least 8 bytes	Set of any fields are serialized like this: - Integer (4 bytes): Data type set is made up of - Integer (4 bytes): Number of elements in set Below sequence is repeated "number of element" times - Serialized element

*Annotation tree serialization*
Data type	Length	Serialization
SpanNode (base class)	1 + (1, 2 or 4) + Annotation serialization + subclass payload	type byte (1: Span, 2: SpanList, 4: AlternateSpanList) number of annotations int_1_2_4 each annotation as given below (remaining payload serialized as given below by subclasses Span, SpanList and AlternateSpanList)
Annotation	4 + (1, 2 or 4) + (possibly 4 + FieldValue serialization)	MD5 name hash (4 LSBytes) uint32 (NOTE: 0-127 reserved for internal Vespa usage.) length int_1_2_4 the following fields are only present if length > 0: data type id uint32 NOTE: no sequence id, as we will rely on annotations being serialized/deserialized in particular order, so we don't need to write this explicitly FieldValue as given by its own serialization
Span	SpanNode serialization + (1, 2 or 4) + (1, 2 or 4)	serialization from SpanNode base class from index, as given by Java String (UTF-16) int_1_2_4 length, as given by Java String (UTF-16) int_1_2_4
SpanList	SpanNode serialization + (1, 2 or 4) + n times SpanNode serialization	serialization from SpanNode base class number of children int_1_2_4 each child node serialized as SpanNode (Span, SpanList, AlternateSpanList)
AlternateSpanList	SpanNode serialization + (1, 2 or 4) + n times (8 + SpanList serialization)	serialization from SpanNode base class number of child trees int_1_2_4 for each child tree: probability double serialization as given by SpanList above
AnnotationRef	1, 2 or 4	AnnotationRef serialization unique sequence id of annotation being referred to int1_2_4

*Data types used in serialized format*
Data type	Serialization
Integer_1_4	If bit 7 of first byte is unset, coded using 1 byte. If bit 7 of first byte is set, coded using 4 bytes (bit 7 of first byte must be masked away). Range: 0 - 231-1.
Integer_1_2_4	If bit 7 of first byte is unset, coded using 1 byte. If bit 7 of first byte is set and bit 6 of first byte is unset, coded using 2 bytes (bit 7 and 6 of first byte must be masked away). If bit 7 and 6 of first byte are set, coded using 4 bytes (bit 7 and 6 of first byte must be masked away). Range: 0 - 230-1.
Integer_2_4_8	If bit 7 of first byte is unset, coded using 2 byte. If bit 7 of first byte is set and bit 6 of first byte is unset, coded using 4 bytes (bit 7 and 6 of first byte must be masked away). If bit 7 and 6 of first byte are set, coded using 8 bytes (bit 7 and 6 of first byte must be masked away). Range: 0 - 262-1.

Document Update Format

This is the description of the serialized document update format.

*Document update serialization format*
Field	Type	Length	Description
Document ID	Bytes		Unique ID for document. 0-terminated string, UTF-8 encoding.
Content byte	Byte	1 byte	Always set to 1
Document Type	Bytes		Document type. (0-terminated string, UTF-8 encoding.)
Number of fields to update	Integer	4 bytes	The number of fields to update
Serialized field updates	Field Update		The serialized field updates. See below.

*Document update serialization format*
Field	Type	Length	Description
Field Id	Integer	4 bytes	Field id within document type.
Number of value updates	Integer	4 bytes	Numer of value updates to this field.
Serialized field update values	Bytes		The serialized field update values. See below.

*Document update value serialization format*
Add Value Update
Field	Type	Length	Description
Add Value Update ID	Integer	4 bytes	25 + 0x1000 for value updates.
Field serialization	FieldValue		Serialization of the field to add.
Weight	Integer	4 bytes	Weight. Used if update applies to weighted set.
Arithmetic Update
Arithmetic Update ID	Integer	4 bytes	26 + 0x1000 for arithmetic updates.
Operator ID	Integer	4 bytes	Identifies whether this does add, subtract, multiply or divide.
Operand	Double	8 bytes	The right operand to use in the arithmetic operation.
Assign Update
Assign Update ID	Integer	4 bytes	27 + 0x1000 for assign updates.
Content flag	Byte	1 bytes	Contains 1 if we have content, 0 if not.
Field serialization	FieldValue		Serialization of the field to assign.
Clear Update
Clear Update ID	Integer	4 bytes	28 + 0x1000 for clear updates.
Map Value Update
Map Value Update ID	Integer	4 bytes	29 + 0x1000 for map value updates.
Field serialization	FieldValue	4 bytes	The field indicating what entry to update.
Value Update	Document Value Update		The update operation to apply to the field indicated above.
Remove Value Update
Remove Update ID	Integer	4 bytes	30 + 0x1000 for remove updates.
Field serialization	FieldValue		The field indicating what entry to update.