diff options
-rw-r--r-- | eval/src/vespa/eval/tensor/serialization/format.txt | 37 |
1 files changed, 37 insertions, 0 deletions
diff --git a/eval/src/vespa/eval/tensor/serialization/format.txt b/eval/src/vespa/eval/tensor/serialization/format.txt new file mode 100644 index 00000000000..9d0a387c36a --- /dev/null +++ b/eval/src/vespa/eval/tensor/serialization/format.txt @@ -0,0 +1,37 @@ +This file explains how the typed binary formats of serialized tensors +for different archetypes (sparse[1], dense[2] and mixed[3]) can be +interpreted as a single unified binary format. The description below +uses data types defined by document serialization (nbostream) combined +with some comments and python-inspired flow-control. The mixed[3] +binary format is defined in such a way that it overlays as +effortlessly as possible with both existing formats. The only thing +needed to go from sparse[1] or dense[2] binary formats to the mixed[3] +format for a specific tensor is to add a single byte indicating there +are no dimensions of the other kind (mapped/indexed). + +byte: type (1:sparse, 2:dense, 3:mixed) + bit 0 -> 'sparse' + bit 1 -> 'dense' + (mixed tensors are tagged as both 'sparse' and 'dense') + +if ('sparse'): + 1_4_int: number of mapped dimensions -> ''n_mapped' + 'n_mapped' times: (sorted by dimension name) + small_string: dimension name + +if ('dense'): + 1_4_int: number of indexed dimensions -> 'n_indexed' + 'n_indexed' times: (sorted by dimension name) + small_string: dimensions name + 1_4_int: dimensions size (must be at least 1) -> 'size_i' + +if ('n_mapped > 0'): + 1_4_int: number of named dense sub-spaces -> 'n_blocks' +else: + 'n_blocks' = 1 (a single dense space) + +'n_blocks' times: + 'n_mapped' times: + small_string: dimension label (same order as dimension names) + prod('size_i') times: (product of all indexed dimension sizes) + double: cell value (last indexed dimension is nested innermost) |