aboutsummaryrefslogtreecommitdiffstats
path: root/storage
Commit message (Collapse)AuthorAgeFilesLines
* - Add support for using an unbound Q -> nonblocking.Henning Baldersheim2022-01-131-1/+1
| | | | | - It uses a synchronized overflow Q if the main Q is full. - Long term it is the intention that the blocking option will be removed.
* Set defaults in config defs and ModelContext api to improve bucket merge ↵Geir Storli2022-01-122-5/+5
| | | | | | performance on content nodes. These are the same defaults set for the feature flags in https://github.com/vespa-engine/vespa/pull/20759.
* Add metrics that offer insight into how throttling affects schedulingTor Brede Vekterli2022-01-113-3/+21
| | | | | | | | | | | | Adds the following stripe-level metrics: * `throttled_rpc_direct_dispatches` - number of times an RPC thread could not directly dispatch an async operation directly to Proton because it was disallowed by the throttle policy. * `throttled_persistence_thread_polls` - number of times a persistence thread could not immediately dispatch a queued async operation because it was disallowed by the throttle policy. * `timeouts_waiting_for_throttle_token` - number of times a persistence thread timed out waiting for an available throttle policy token.
* Add missing pragma onceTor Brede Vekterli2022-01-111-0/+2
|
* Move dummy mbus messages to shared header and use in MergeThrottler as wellTor Brede Vekterli2022-01-113-39/+50
| | | | | | | Explicitly override `mbus::Message::getApproxSize()` to return 0 (instead of the default 1) to avoid mis-counting size between requests and responses. This has not mattered in practice since we haven't set any size-based limits (only message count limits), but we should ensure symmetry anyway.
* Only wake up a waiting thread if doing so would possibly result in successTor Brede Vekterli2022-01-111-1/+2
|
* Clean up DynamicOperationThrottler to make it easier to readTor Brede Vekterli2022-01-112-19/+40
|
* Support dynamic throttling of async persistence operationsTor Brede Vekterli2022-01-1024-81/+614
| | | | | | | | | | | | | | | | | | | | | | | Adds an operation throttler that is intended to provide global throttling of async operations across all persistence stripe threads. A throttler wraps a logical max pending window size of in-flight operations. Depending on the throttler implementation, the window size may expand and shrink dynamically. Exactly how and when this happens is unspecified. Commit adds two throttler implementations: * An unlimited throttler that is no-op and never blocks. * A throttler built around the mbus `DynamicThrottlePolicy` and defers all window decisions to it. Current config default is to use the unlimited throttler. Config changes require a process restart. Offers both polling and (timed, non-timed) blocking calls for acquiring a throttle token. If the returned token is valid, the caller may proceed to invoke the asynchronous operation. The window slot taken up by a valid throttle token is implicitly freed up when the token is destroyed.
* Revert "Revert "Balder/refactor docentry""Henning Baldersheim2022-01-0725-91/+101
|
* Revert "Balder/refactor docentry"Arnstein Ressem2022-01-0725-101/+91
|
* - Flags -> Enum.Henning Baldersheim2022-01-065-26/+22
| | | | - Consistently use DocEntryList as type for std::vector<spi::DocEntry::UP>
* Only care about size of payload. Also add payload containing only doctype ↵Henning Baldersheim2022-01-064-4/+4
| | | | and gid
* Use enum class for the flags.Henning Baldersheim2022-01-062-30/+31
|
* Simplify by avoid both DocumentSize and PersistedDocumentSize. That is the same.Henning Baldersheim2022-01-063-9/+5
|
* Simplify DocEntry to get a clean interface with multiple implementations, ↵Henning Baldersheim2022-01-0624-30/+47
| | | | | | instead of an mutant. Also add tests for the different variations a DocEntry can have.
* Make host info cluster state version reporting correct for deferred state ↵Tor Brede Vekterli2022-01-033-75/+83
| | | | | | | | | | | | | | | | | | | | | bundles If an application uses deferred cluster state activations, do not report back a given cluster state version as being in use by the node until the state version has been explicitly activated by the cluster controller. This change is due to the fact that the replication invalidation happens upon recovery mode entry, and for deferred state bundles this takes place when a cluster state is _activated_, not when the distributor is otherwise done gathering bucket info (for a non-deferred bundle the activation happens implicitly at this point). If the state manager reports that the new cluster state is in effect even though it has not been activated, the cluster controller could still end up using stale replication stats, as the invalidation logic has not yet run at this point in time. The cluster controller will ignore any host info responses for older versions, so any stale replication statistics should not be taken into account with this change.
* Invalidate bucket DB replica statistics upon recovery mode entryTor Brede Vekterli2022-01-036-16/+61
| | | | | | | | | | | | | | | | | | The replica stats track the minimum replication factor for any bucket for a given content node the distributor maintains buckets for. These statistics may be asynchronously queried by the cluster controller through the host info reporting API. If we do not invalidate the statistics upon a cluster state change, there is a very small window of time where the distributor may potentially report back _stale_ statistics that were valid for the _prior_ cluster state version but not for the new one. This can happen if the cluster controller fetches host info from the node in between start of the recovery period and the completion of the recovery mode DB scan. Receiving stale replication statistics may cause the cluster controller to erroneously believe that replication due to node retirements etc has completed earlier than it really has, possibly impacting orchestration decisions in a sub- optimal manner.
* Let CommunicationManager swallow any errant internal reply messagesTor Brede Vekterli2021-12-222-1/+12
| | | | | | | | | | | | | | It's possible for internal commands to have replies auto-generated if they're being evicted/aborted from persistence queue structures. Most of these replies are intercepted by higher-level links in the storage chain, but commands such as `RunTaskCommand` are actually initiated by the persistence _provider_ and not a higher level component, and are therefore not caught explicitly anywhere. Let CommunicationManager signal that internal replies are handled, even if the handling of these is just to ignore them entirely. This prevents the replies from being spuriously warning-logged as "unhandled message on top of call chain".
* Minor cleanup of CommunicationManager code, no change in semanticsTor Brede Vekterli2021-12-221-22/+23
|
* Complete wiring of OK/failure metric reporting during update write-repairTor Brede Vekterli2021-12-173-11/+43
| | | | | | | | | | | | | | | | | This also changes the aggregation semantics of "not found" results. An update reply indicating "not found" is protocol-wise reported back to the client as successful but with a not-found field set that has to be explicitly checked. But on the distributor the `notfound` metric is under the `updates.failures` metric set, i.e. not counted as a success. To avoid double bookkeeping caused by treating it as both OK and not, we choose to not count "not found" updates under the `ok` metric. But to avoid artificially inflating the `failures.sum` metric, we remove the `notfound` metric from the implicit sum aggregation. This means that visibility into "not found" updates requires explicitly tracking the `notfound` metric as opposed to just the aggregate, but this should be an acceptable tradeoff to avoid false failure positives.
* Cover additional update failure edge cases with metricsTor Brede Vekterli2021-12-163-15/+49
| | | | | | | | | | | | | This change adds previously missing metric increments for test-and-set condition failures when this happens in the context of a write-repair, as well as adding missing wiring for the "notfound" failure metric. The latter is incremented when an update is returned from the content nodes having found no existing document, or when the read-phase of a write-repair finds no document to update (and neither `create: true` nor a TaS condition is set on the update). Also remove some internal reply type erasure to make it more obvious what the reply type must be.
* Let bucket maintenance priority queue be FIFO ordered within priority classTor Brede Vekterli2021-12-154-163/+126
| | | | | | | | | | | | This avoids a potential starvation issue caused by the existing implementation, which is bucket ID ordered within a given priority class. The latter has the unfortunate effect that frequently reinserting buckets that sort before buckets that are already in the queue may starve these from being popped from the queue. Move to a composite key that first sorts on priority, then on a strictly increasing sequence number. Add a secondary index into this structure that allows for lookups on bucket IDs as before.
* Dump pending GC state for nodes when update inconsistency is detectedTor Brede Vekterli2021-12-142-15/+34
| | | | | | | This is to help debug a very rare edge case where the theory is that an update operation may race with the implicit removal of said document by asynchronous GC. By dumping the pending GC state for the bucket+node we can get some good indications on whether this theory holds.
* Revert "Treat empty replica subset as inconsistent for GetOperation ↵Tor Brede Vekterli2021-12-133-26/+0
| | | | [run-systemtest]"
* Merge pull request #20470 from vespa-engine/toregge/backport-to-gcc-9-no-rangesHenning Baldersheim2021-12-111-3/+6
|\ | | | | gcc 9 does not support std::ranges::all_of.
| * gcc 9 does not support std::ranges::all_of.Tor Egge2021-12-111-3/+6
| |
* | gcc 9 has no default operator==.Tor Egge2021-12-111-1/+3
|/
* _executor -> _threadHenning Baldersheim2021-12-091-2/+2
|
* Add init_fun to vespalib::Thread too to figure out what the thread is used for.Henning Baldersheim2021-12-091-1/+3
|
* Merge pull request #20382 from ↵Henning Baldersheim2021-12-083-0/+26
|\ | | | | | | | | vespa-engine/vekterli/treat-empty-replica-subset-as-inconsistent-for-get-operations Treat empty replica subset as inconsistent for GetOperation [run-systemtest]
| * Treat empty replica subset as inconsistent for GetOperationTor Brede Vekterli2021-12-063-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `GetOperation` document-level consistency checks are used by the multi-phase update logic to see if we can fall back to a fast path even though not all replicas are in sync. Empty replicas are not considered part of the send-set, so only looking at replies from replicas _sent_ to will not detect this case. If we haphazardly treat empty replicas as implicitly being in sync we risk triggering undetectable inconsistencies at the document level. This can happen if we send create-if-missing updates to an empty replica as well as a non-empty replica, and the document exists in the latter replica. The document would then be implicitly created on the empty replica with the same timestamp as that of the non-empty one, even though their contents would almost certainly differ. With this change we initially tag all `GetOperations` with at least one empty replica as having inconsistent replicas. This will trigger the full write- repair code path for document updates.
* | Prevent orphaned bucket replicas caused by indeterminate CreateBucket repliesTor Brede Vekterli2021-12-072-3/+36
|/ | | | | | | | | | | | | Previously we'd implicitly assume a failed CreateBucket reply meant the bucket replica was not created, but this does not hold in the general case. A failure may just as well be due to connection failures etc between the distributor and content node. To tell for sure, we now send an explicit RequestBucketInfo to the node in the case of CreateBucket failures. If it _was_ created, the replica will be reintroduced into the bucket DB. We still implicitly delete the bucket replica from the DB to avoid transiently routing client write load to a bucket that may likely not exist.
* Fix active merge state count status page renderingTor Brede Vekterli2021-12-031-2/+2
|
* Merge pull request #20359 from ↵Henning Baldersheim2021-12-037-20/+109
|\ | | | | | | | | vespa-engine/vekterli/decrement-merge-counter-when-sync-merge-handling-complete Decrement persistence thread merge counter when syncronous processing is complete [run-systemtest]
| * Decrement persistence thread merge counter when syncronous processing is ↵Tor Brede Vekterli2021-12-037-20/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | complete Add a generic interface for letting an operation know that the synchronous parts of its processing in the persistence thread is complete. This allows a potentially longer-running async operation to free up any limits that were put in place when it was taking up synchronous thread resources. Currently only used by merge-related operations (that may dispatch many async ops). Since we have a max upper bound for how many threads in a stripe may be processing merge ops at the same time (to avoid blocking client ops), we previously could effectively stall the pipelining of merges caused by hitting the concurrency limit even if all persistence threads were otherwise idle (waiting for prior async merge ops to complete). We now explicitly decrease the merge concurrency counter once the synchronous processing is done, allowing us to take on further merges immediately.
* | Merge pull request #20340 from ↵Geir Storli2021-12-022-8/+25
|\ \ | |/ |/| | | | | vespa-engine/vekterli/cap-merge-delete-bucket-pri-to-default-feed-pri Cap merge-induced DeleteBucket priority to that of default feed priority
| * Cap merge-induced DeleteBucket priority to that of default feed priorityTor Brede Vekterli2021-12-022-8/+25
| | | | | | | | | | | | | | This lets DeleteBucket operations FIFO with the client operations using the default feed priority (120). Not doing this risks preempting feed ops with deletes, elevating latencies.
* | Merge pull request #20329 from ↵Arne H Juul2021-12-029-13/+14
|\ \ | | | | | | | | | | | | vespa-engine/arnej/config-class-should-not-be-public Arnej/config class should not be public
| * | more descriptive name for header fileArne H Juul2021-12-025-5/+5
| | |
| * | track namespace move in documenttypes.defArne H Juul2021-12-029-13/+14
| |/ | | | | | | | | | | | | * For C++ code this introduces a "document::config" namespace, which will sometimes conflict with the global "config" namespace. * Move all forward-declarations of the types DocumenttypesConfig and DocumenttypesConfigBuilder to a common header file.
* | Change update_active_operations_metrics to a private member function.Tor Egge2021-12-022-8/+8
| |
* | Measure latency in milliseconds.Tor Egge2021-12-022-2/+2
| |
* | Add metrics for active operations on service layer.Tor Egge2021-12-0112-5/+447
| |
* | Don't let ignored bucket info reply be propagated out of distributorTor Brede Vekterli2021-11-303-44/+61
|/ | | | | | | | | | If a reply arrives for a preempted cluster state it will be ignored. To avoid it being automatically sent further down the storage chain we still have to treat it as handled. Otherwise a scary looking but otherwise benign "unhandled message" warning will be emitted in the Vespa log. Also move an existing test to the correct test fixture.
* Reduce default bucket delete pri and fix priority inheritance for merge deletionTor Brede Vekterli2021-11-304-2/+14
| | | | | | | | Since deletes are now async in the backend, make them FIFO-order with client feed by default to avoid swamping the queues with deletes. Also, explicitly inherit priority for bucket deletion triggered by bucket merging. This was actually missing previously and meant such deletes got the default, very low priority.
* Merge pull request #20211 from vespa-engine/vekterli/reduce-update-log-levelHenning Baldersheim2021-11-251-7/+7
|\ | | | | Reduce log level from error to warning for update inconsistency message
| * Reduce log level from error to warning for update inconsistency messageTor Brede Vekterli2021-11-251-7/+7
| | | | | | | | | | | | `error` level is for the "node is on fire and falling down an elevator shaft" type of messages, not for operation inconsistencies. Reduce to `warning` accordingly.
* | Update operation metrics for delayed or chained merge handler replies.Tor Egge2021-11-244-7/+123
|/
* Actually test maintenance -> down node state transitionTor Brede Vekterli2021-11-241-1/+1
|
* Ensure member variable is initializedTor Brede Vekterli2021-11-241-1/+2
|