| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
This appears to happen if pending data node watches are triggered with
failures after the cluster controller has already reset the internal
voting state due to connectivity issues. Internal state is rebuilt
automatically, so no need to scream errors into the logs for this.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Tests running under heavy concurrent CI build load appear
to struggle maintaining ZK sessions, causing flakiness. This
is an attempt to mitigate that by bumping the ZK server's tick
interval (and therefore session timeout) by 20x.
According to local testing, this does not increase testing time
in the common case.
|
| | |
|
|/
|
|
|
|
|
|
|
|
| |
The flag controlled config read by the Cluster Controller. Therefore, I have
left the ModelContextImpl.Properties method and implementation (now always
returning true), but the model has stopped using that method internally, and
the config is no longer used in the CC.
The field in the fleetcontroller.def is left unchanged and documented as
deprecated.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes the Cluster Controller use the
vds.datastored.bucket_space.buckets_total, dimension bucketSpace=default, to
determine whether a content node manages zero buckets, and if so, will allow
the node to go permanently down. This is used when a node is retiring, and it
is to be removed from the application.
The change is guarded by the use-bucket-space-metric, default true. If the new
metric doesn't work as expected, we can revert to using the current/old metric
by flipping the flag. The flag can be controlled per application.
|
| |
|
| |
|
| |
|
|
|
|
|
| |
Makes it easier for an external observer to understand what set of nodes
is causing the cluster state to not converge.
|
|
|
|
|
|
|
| |
Avoids triggering pointless cluster state recomputation edges because
the timer code observes stale information. This can happen if publishing
the candidate state is suppressed due to e.g. the minimum time between
published state versions not having passed.
|
|
|
|
|
|
|
|
| |
Ensures that event is only emitted when we're actually publishing a
state containing the state transition. Emitting events in the timer
code is fragile when it does not modify any state, risking emitting
the same event an unbounded amount of times if the condition keeps
holding for each timer cycle.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
+ Use version property for zk in parent pom
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
Lets a controller discover that another controller also believes it is
the leader by tracking the expected znode versions for cluster state
version and bundle znodes.
If a CaS failure is triggered, the controller will drop its database
and election state, forcing a state refresh.
|
|
|
|
|
|
| |
Avoids a race condition where a bundle ZK write fails but we have not yet
detected that ZK connectivity has been lost. This could lead to violating
the invariant that published state versions are strictly increasing.
|
|
|
|
|
|
|
|
| |
When using fake timers there's a catch-22 edge case during transient
ZooKeeper disconnects. The test will not progress and increase the
fake timer value until ZK is connected, but ZK will not attempt to
reconnect until the reconnection cooldown period has passed (which
depends on the faked timer).
|
|
|
|
|
|
|
|
| |
Have a suspicion that 10s ends up being too short when ZK's
write-ahead log flushing is taking a long time due to a heavily
loaded CI container. Should still be short enough for a network
hiccup to hopefully trigger a reconnect before the test itself
times out.
|
|
|
|
|
|
|
|
|
| |
Setup methods would previously override explicit timeouts set by the
tests with a default of 2 minutes. It's suspected that this long timeout
combined with spurious ZK client failures is a source of test failures.
Reduce to 10 seconds for now. Let's see how well it fares on a heavily
loaded CI container.
|
|
|
|
|
|
| |
Create the address when needed in the async connect thread.
Implement hash/equal/compareTo for Spec to avoid toString.
Use Spec as key and avoid creating it every time.
|
| |
|
| |
|
|
|
|
|
|
| |
have a list.
Avoid having to do an array copy that is not necessary.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
Version mismatches in backend do not return explicit RPC errors, so
actual vs. desired versions must be checked in order to avoid potentially
spurious activation of other versions.
Also do some minor code cleanup.
|
|
|
|
|
| |
Only displayed if not equal to published state version and if two-phase
transitions are enabled.
|
| |
|
|
|
|
| |
Mirrors the default values in the actual underlying config definitions.
|