| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. The node has state MAINTENANCE with (user) wanted state UP.
2. There are other nodes in the same hierarchical group that are set in
MAINTENANCE with the same description.
Also made the following change.
3. Deny a request for safe MAINTENANCE or DOWN, if the wanted state is already
set but with a different description. If the descriptions are the same, it
is assumed to be the same operator (e.g. Orchestrator) having changed its
mind.
|
| |
|
|
|
|
|
|
| |
Fully forwards and backwards compatible. Currently only supports indicating
feed blocked status for the entire cluster, with one associated descriptive
message intended to be used by distributors.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
This appears to happen if pending data node watches are triggered with
failures after the cluster controller has already reset the internal
voting state due to connectivity issues. Internal state is rebuilt
automatically, so no need to scream errors into the logs for this.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Tests running under heavy concurrent CI build load appear
to struggle maintaining ZK sessions, causing flakiness. This
is an attempt to mitigate that by bumping the ZK server's tick
interval (and therefore session timeout) by 20x.
According to local testing, this does not increase testing time
in the common case.
|
| | |
|
|/
|
|
|
|
|
|
|
|
| |
The flag controlled config read by the Cluster Controller. Therefore, I have
left the ModelContextImpl.Properties method and implementation (now always
returning true), but the model has stopped using that method internally, and
the config is no longer used in the CC.
The field in the fleetcontroller.def is left unchanged and documented as
deprecated.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes the Cluster Controller use the
vds.datastored.bucket_space.buckets_total, dimension bucketSpace=default, to
determine whether a content node manages zero buckets, and if so, will allow
the node to go permanently down. This is used when a node is retiring, and it
is to be removed from the application.
The change is guarded by the use-bucket-space-metric, default true. If the new
metric doesn't work as expected, we can revert to using the current/old metric
by flipping the flag. The flag can be controlled per application.
|
| |
|
| |
|
| |
|
|
|
|
|
| |
Makes it easier for an external observer to understand what set of nodes
is causing the cluster state to not converge.
|
|
|
|
|
|
|
| |
Avoids triggering pointless cluster state recomputation edges because
the timer code observes stale information. This can happen if publishing
the candidate state is suppressed due to e.g. the minimum time between
published state versions not having passed.
|
|
|
|
|
|
|
|
| |
Ensures that event is only emitted when we're actually publishing a
state containing the state transition. Emitting events in the timer
code is fragile when it does not modify any state, risking emitting
the same event an unbounded amount of times if the condition keeps
holding for each timer cycle.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
+ Use version property for zk in parent pom
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
Lets a controller discover that another controller also believes it is
the leader by tracking the expected znode versions for cluster state
version and bundle znodes.
If a CaS failure is triggered, the controller will drop its database
and election state, forcing a state refresh.
|
|
|
|
|
|
| |
Avoids a race condition where a bundle ZK write fails but we have not yet
detected that ZK connectivity has been lost. This could lead to violating
the invariant that published state versions are strictly increasing.
|
|
|
|
|
|
|
|
| |
When using fake timers there's a catch-22 edge case during transient
ZooKeeper disconnects. The test will not progress and increase the
fake timer value until ZK is connected, but ZK will not attempt to
reconnect until the reconnection cooldown period has passed (which
depends on the faked timer).
|
|
|
|
|
|
|
|
| |
Have a suspicion that 10s ends up being too short when ZK's
write-ahead log flushing is taking a long time due to a heavily
loaded CI container. Should still be short enough for a network
hiccup to hopefully trigger a reconnect before the test itself
times out.
|