diff options
author | Tor Brede Vekterli <vekterli@verizonmedia.com> | 2021-01-06 11:55:02 +0100 |
---|---|---|
committer | Tor Brede Vekterli <vekterli@verizonmedia.com> | 2021-01-06 11:55:02 +0100 |
commit | fd8201edc418228da13104d7e7888f96276e94f4 (patch) | |
tree | a4bf8224a3bf78706c65a8f719ea2be0aba85968 /ERRATA.md | |
parent | 4dff64a9dda3bbc86d8877116d3e2c7467322c21 (diff) |
Add errata entry for source-only replica merge edge case bug
Diffstat (limited to 'ERRATA.md')
-rw-r--r-- | ERRATA.md | 24 |
1 files changed, 24 insertions, 0 deletions
diff --git a/ERRATA.md b/ERRATA.md index 953eed357a7..3c739e3f790 100644 --- a/ERRATA.md +++ b/ERRATA.md @@ -2,6 +2,30 @@ ## Errata +### 2021-01-06: Data loss during data migration +This bug was introduced more than 10 years ago and remained unobserved until very recently. Fixed in Vespa-7.306.16. + +The following needs to happen to trigger the bug: + +* The replica placement for a given data bucket is non-ideal on at least 2 nodes. + This can happen during mass node retirement or if several nodes come back up after being down. + In this scenario the system must move all data away from a subset of replicas. This triggers + a special internal optimization that tries to handle many such moves in one go to + minimize the required number of message roundtrips. These are known internally as "source-only" replicas. +* The stored document sets on the source-only bucket replicas are disjoint. +* The total size of document data contained in a source-only bucket replica exceeds the configured + data merge transfer chunk size (default 4 MiB) + +Tracking of successfully merged documents is done by exchanging bitmasks between the nodes, where bit positions +correspond to document presence on the nth node involved in the merge operation. +The bug caused bitmasks for sub-operations optimized to focus on source-only nodes to not be correctly transformed +onto the bitmask tracking documents across all nodes. +Even when not all documents could be transferred from the source-only node due to exceeding chunk transfer limits, +the system would believe it had transferred _all_ remaining such documents due to the erroneous bitmask transform. + +This is a silent data loss bug and would be observed by the global document count in the system decreasing +during a period of data merging. + ### 2020-11-30: Document inconsistency This bug was introduced in Vespa-7.277.38, fixed in Vespa-7.292.82. The following needs to happen to trigger the bug: |