summaryrefslogtreecommitdiffstats
path: root/ERRATA.md
diff options
context:
space:
mode:
authorTor Brede Vekterli <vekterli@verizonmedia.com>2021-01-06 11:55:02 +0100
committerTor Brede Vekterli <vekterli@verizonmedia.com>2021-01-06 11:55:02 +0100
commitfd8201edc418228da13104d7e7888f96276e94f4 (patch)
treea4bf8224a3bf78706c65a8f719ea2be0aba85968 /ERRATA.md
parent4dff64a9dda3bbc86d8877116d3e2c7467322c21 (diff)
Add errata entry for source-only replica merge edge case bug
Diffstat (limited to 'ERRATA.md')
-rw-r--r--ERRATA.md24
1 files changed, 24 insertions, 0 deletions
diff --git a/ERRATA.md b/ERRATA.md
index 953eed357a7..3c739e3f790 100644
--- a/ERRATA.md
+++ b/ERRATA.md
@@ -2,6 +2,30 @@
## Errata
+### 2021-01-06: Data loss during data migration
+This bug was introduced more than 10 years ago and remained unobserved until very recently. Fixed in Vespa-7.306.16.
+
+The following needs to happen to trigger the bug:
+
+* The replica placement for a given data bucket is non-ideal on at least 2 nodes.
+ This can happen during mass node retirement or if several nodes come back up after being down.
+ In this scenario the system must move all data away from a subset of replicas. This triggers
+ a special internal optimization that tries to handle many such moves in one go to
+ minimize the required number of message roundtrips. These are known internally as "source-only" replicas.
+* The stored document sets on the source-only bucket replicas are disjoint.
+* The total size of document data contained in a source-only bucket replica exceeds the configured
+ data merge transfer chunk size (default 4 MiB)
+
+Tracking of successfully merged documents is done by exchanging bitmasks between the nodes, where bit positions
+correspond to document presence on the nth node involved in the merge operation.
+The bug caused bitmasks for sub-operations optimized to focus on source-only nodes to not be correctly transformed
+onto the bitmask tracking documents across all nodes.
+Even when not all documents could be transferred from the source-only node due to exceeding chunk transfer limits,
+the system would believe it had transferred _all_ remaining such documents due to the erroneous bitmask transform.
+
+This is a silent data loss bug and would be observed by the global document count in the system decreasing
+during a period of data merging.
+
### 2020-11-30: Document inconsistency
This bug was introduced in Vespa-7.277.38, fixed in Vespa-7.292.82.
The following needs to happen to trigger the bug: