Merge pull request #2340 from yahoo/havardpe/mixed-tensor-binary-format-spec

added description of new mixed serialization format
author: Geir Storli <geirstorli@yahoo.no> 2017-05-03 10:53:30 +0200
committer: GitHub <noreply@github.com> 2017-05-03 10:53:30 +0200
commit: 478f3db664db1a335cf90bf022297371783cdd77 (patch)
tree: 21ee513bbf3fd883341bf13746919823ec229b86 /eval
parent: 3cd9c747e62ee82edfb5fedf4415f5daa45b3ab3 (diff)
parent: b4af39d746ad0993b646fabf5114f3cdfb49030e (diff)
1 files changed, 42 insertions, 0 deletions
diff --git a/eval/src/vespa/eval/tensor/serialization/format.txt b/eval/src/vespa/eval/tensor/serialization/format.txt
new file mode 100644
index 00000000000..8c5d3b331d2
--- /dev/null
+++ b/eval/src/vespa/eval/tensor/serialization/format.txt
@@ -0,0 +1,42 @@
+This file explains how the typed binary formats of serialized tensors
+for different archetypes (sparse[1], dense[2] and mixed[3]) can be
+interpreted as a single unified binary format. The description below
+uses data types defined by document serialization (nbostream) combined
+with some comments and python-inspired flow-control. The mixed[3]
+binary format is defined in such a way that it overlays as
+effortlessly as possible with both existing formats.
+
+//-----------------------------------------------------------------------------
+
+byte: type (1:sparse, 2:dense, 3:mixed)
+  bit 0 -> 'sparse'
+  bit 1 -> 'dense'
+  (mixed tensors are tagged as both 'sparse' and 'dense')
+
+if ('sparse'):
+  1_4_int: number of mapped dimensions -> 'n_mapped'
+  'n_mapped' times: (sorted by dimension name)
+    small_string: dimension name
+
+if ('dense'):
+  1_4_int: number of indexed dimensions -> 'n_indexed'
+  'n_indexed' times: (sorted by dimension name)
+    small_string: dimensions name
+    1_4_int: dimensions size (must be at least 1) -> 'size_i'
+
+if ('n_mapped' > 0 || !'dense'):
+  1_4_int: number of named dense sub-spaces -> 'n_blocks'
+else:
+  'n_blocks' = 1 (a single dense space)
+
+'n_blocks' times:
+  'n_mapped' times:
+    small_string: dimension label (same order as dimension names)
+  prod('size_i') times: (product of all indexed dimension sizes)
+    double: cell value (last indexed dimension is nested innermost)
+
+//-----------------------------------------------------------------------------
+
+Note: A tensor with no dimensions should not be serialized as
+sparse[1], but when it is, it will contain an integer indicating the
+number of cells.
author	Geir Storli <geirstorli@yahoo.no>	2017-05-03 10:53:30 +0200
committer	GitHub <noreply@github.com>	2017-05-03 10:53:30 +0200
commit	478f3db664db1a335cf90bf022297371783cdd77 (patch)
tree	21ee513bbf3fd883341bf13746919823ec229b86 /eval
parent	3cd9c747e62ee82edfb5fedf4415f5daa45b3ab3 (diff)
parent	b4af39d746ad0993b646fabf5114f3cdfb49030e (diff)