| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| | |
vespa-engine/vekterli/immediately-exit-cc-if-node-index-changed-live
Immediately exit cluster controller if node index config is changed live
|
| |
| |
| |
| |
| | |
We do not support live reconfigs of CC index, so swiftly exit if we
detect this, allowing the config sentinel to restart the service.
|
|\ \
| |/
|/| |
Bjorncs/upgrade zk client [run-systemtest]
|
| |
| |
| |
| | |
Ensure extra required ZK dependencies are present on test classpath
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds the following safeguards/improvements:
- Do not clear pending (non-persisted) writes over a `connect()` edge.
Avoids having the controller eternally wait for a doomed pending
write to be completed when it has no other events that can trigger
a new write.
- Trigger `lostDatabaseConnection()` whenever ZK is reconfigured to
ensure we reload the newest state before trying to compute/publish
any new states.
- Explicitly drop leadership in `lostDatabaseConnection()` to immediately
prevent controller from trying any funny leader-related business
since it no longer can depend on ZK watches triggering.
- When falling back to default state/cluster bundle, ensure that any
subsequent dependent znode write is predicated on the pre-existing
znode version being 0, i.e. did not previously exist.
|
| |
|
|\
| |
| |
| |
| | |
vespa-engine/geirst/cluster-controller-resouce-usage-limits-metrics
Expose resource usage metrics for disk and memory limits for feed blo…
|
| | |
|
|/ |
|
| |
|
| |
|
|
|
|
|
| |
Available nodes here mean nodes that are reported as Up/Initializing
and where the wanted state is Up/Retired.
|
|
|
|
| |
indirectly
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZooKeeper
This avoids a subtle edge case where the underlying ZK integration code
may fail silently a write, leaving the core controller logic to think that
it had actually durably persisted a particular state version.
In case of reelections racing with broadcasts, it would be possible for
leader-edge readbacks from ZK to retrieve a _lower_ version than one
that had already been published. This would cause the cluster controller
to get very confused about which cluster states nodes had already observed.
If a newly produced state version overlapped with a previously broadcast
state, the controller would not push the updated state to the nodes, as it
would (with good reason) assume the node had already observed it, seeing
that it had already ACKed the particular version number.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| | |
vespa-engine/hakonhall/also-deny-maintenance-when-another-node-is-in-maintenance
Also deny maintenance when another node is in maintenance
|
| | |
|
| |
| |
| |
| |
| |
| |
| | |
The cluster controller today already denies setting a node X safely to
maintenance M, if there is another node Y in another group that has wanted
state M. Which means that if Y is in M but wanted state is not M, X is allowed
to be set in M. This is an edge case which is rare.
|
| | |
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds an absolute number delta that is subtracted from the feed block limit
when a node has a resource already in feed blocked state. This means that
there's a lower watermark threshold that must be crossed before feeding
can be unblocked. Avoids flip-flopping between block states.
Default is currently 0.0, i.e. effectively disabled. To be modified
later for system tests and trial roll-outs.
A couple of caveats with the current implementation:
* The cluster state is not recomputed automatically when just the hysteresis
threshold is crossed, so the description will be out of date on the
content nodes. However, if any other feed block event happens (or the
hysteresis threshold is crossed), the state will be recomputed as expected.
This does not affect correctness, since the feed is still to be blocked.
* A node event remove/add pair is emitted for feed block status when the
hysteresis threshold is crossed and there's a cluster state recomputation.
|
| |
|
| |
|
|
|
|
| |
List all resource exhaustions for node, including enum store etc.
|
|\
| |
| |
| |
| | |
vespa-engine/geirst/resource-usage-metrics-in-cluster-controller
Add and expose resource usage metrics from the cluster controller.
|
| | |
|
| | |
|
|/ |
|
|
|
|
|
| |
Also adds top-level cluster feed block status. Does not yet make
enum store/multivalue limit feed blocks visible per node.
|
| |
|
|
|
|
|
| |
Ensures that feed block description pushed to nodes is updated as
further resource exhaustions are recorded (or disappear).
|
|
|
|
| |
Hostname is inferred from the node's RPC address
|
|
|
|
|
| |
If present, will be reported alongside the resource type in the
feed block description string.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Will push out a new cluster state bundle indicating cluster feed blocked
if one or more nodes in the cluster has one or more resources exhausted.
Similarly, a new state will be pushed out once no nodes have resources
exhausted any more.
The feed block description currently contains up to 3 separate exhausted
resources, possibly across multiple nodes.
A cluster-level event is emitted for both the block and unblock edges.
No hysteresis is present yet, so if a node is oscillating around a block-limit,
so will the cluster state.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. The node has state MAINTENANCE with (user) wanted state UP.
2. There are other nodes in the same hierarchical group that are set in
MAINTENANCE with the same description.
Also made the following change.
3. Deny a request for safe MAINTENANCE or DOWN, if the wanted state is already
set but with a different description. If the descriptions are the same, it
is assumed to be the same operator (e.g. Orchestrator) having changed its
mind.
|
| |
|
|
|
|
|
|
| |
Fully forwards and backwards compatible. Currently only supports indicating
feed blocked status for the entire cluster, with one associated descriptive
message intended to be used by distributors.
|
| |
|
| |
|
| |
|
| |
|
| |
|