summaryrefslogtreecommitdiffstats
path: root/clustercontroller-core
Commit message (Collapse)AuthorAgeFilesLines
* Merge pull request #16856 from ↵Tor Brede Vekterli2021-03-091-3/+20
|\ | | | | | | | | vespa-engine/vekterli/immediately-exit-cc-if-node-index-changed-live Immediately exit cluster controller if node index config is changed live
| * Immediately exit cluster controller if node index config is changed liveTor Brede Vekterli2021-03-091-3/+20
| | | | | | | | | | We do not support live reconfigs of CC index, so swiftly exit if we detect this, allowing the config sentinel to restart the service.
* | Merge pull request #16843 from vespa-engine/bjorncs/upgrade-zk-clientBjørn Christian Seime2021-03-091-0/+6
|\ \ | |/ |/| Bjorncs/upgrade zk client [run-systemtest]
| * Depend on zookeeper-server-commonBjørn Christian Seime2021-03-091-0/+6
| | | | | | | | Ensure extra required ZK dependencies are present on test classpath
* | Fix typoTor Brede Vekterli2021-03-081-1/+1
| |
* | Better handling of ZK connectivity issues concurrent with electionsTor Brede Vekterli2021-03-083-28/+59
|/ | | | | | | | | | | | | | | | | Adds the following safeguards/improvements: - Do not clear pending (non-persisted) writes over a `connect()` edge. Avoids having the controller eternally wait for a doomed pending write to be completed when it has no other events that can trigger a new write. - Trigger `lostDatabaseConnection()` whenever ZK is reconfigured to ensure we reload the newest state before trying to compute/publish any new states. - Explicitly drop leadership in `lostDatabaseConnection()` to immediately prevent controller from trying any funny leader-related business since it no longer can depend on ZK watches triggering. - When falling back to default state/cluster bundle, ensure that any subsequent dependent znode write is predicated on the pre-existing znode version being 0, i.e. did not previously exist.
* Print node index wihtout parenthesis as done elsewhereHarald Musum2021-03-032-7/+7
|
* Merge pull request #16747 from ↵Geir Storli2021-03-024-5/+39
|\ | | | | | | | | vespa-engine/geirst/cluster-controller-resouce-usage-limits-metrics Expose resource usage metrics for disk and memory limits for feed blo…
| * Expose resource usage metrics for disk and memory limits for feed blocked.Geir Storli2021-03-024-5/+39
| |
* | Only force CC metric update when you are the bossJon Marius Venstad2021-03-021-1/+1
|/
* Use (override, really) the clusterid dimension in CCs content metricsJon Marius Venstad2021-03-022-2/+4
|
* Do not use vespamalloc for clustercontroller as it is pure java.Henning Baldersheim2021-03-011-2/+10
|
* Only compute feed blocked state from available nodesTor Brede Vekterli2021-02-263-2/+45
| | | | | Available nodes here mean nodes that are reported as Up/Initializing and where the wanted state is Up/Retired.
* Pass current ZK-persisted version directly to broadcast method instead of ↵Tor Brede Vekterli2021-02-263-39/+22
| | | | indirectly
* Enforce that no cluster state can be published unless confirmed written to ↵Tor Brede Vekterli2021-02-265-4/+73
| | | | | | | | | | | | | | | | | ZooKeeper This avoids a subtle edge case where the underlying ZK integration code may fail silently a write, leaving the core controller logic to think that it had actually durably persisted a particular state version. In case of reelections racing with broadcasts, it would be possible for leader-edge readbacks from ZK to retrieve a _lower_ version than one that had already been published. This would cause the cluster controller to get very confused about which cluster states nodes had already observed. If a newly produced state version overlapped with a previously broadcast state, the controller would not push the updated state to the nodes, as it would (with good reason) assume the node had already observed it, seeing that it had already ACKed the particular version number.
* Remove unused aguments and methodsHarald Musum2021-02-2112-112/+32
|
* Mincor cleanup, no functional changsHarald Musum2021-02-2144-247/+174
|
* Remove unused code, and fix docJon Marius Venstad2021-02-191-2/+2
|
* Other nodes not being up should not hinder permanently downHåkon Hallingstad2021-02-191-5/+0
|
* Fail safe maintenance if other nodes are not upHåkon Hallingstad2021-02-194-88/+52
|
* Avoid stripping index of node that was missingJon Marius Venstad2021-02-171-1/+1
|
* Merge pull request #16494 from ↵Håkon Hallingstad2021-02-123-14/+55
|\ | | | | | | | | vespa-engine/hakonhall/also-deny-maintenance-when-another-node-is-in-maintenance Also deny maintenance when another node is in maintenance
| * Add testHåkon Hallingstad2021-02-122-11/+49
| |
| * Also deny maintenance when another node is in maintenanceHåkon Hallingstad2021-02-122-4/+7
| | | | | | | | | | | | | | The cluster controller today already denies setting a node X safely to maintenance M, if there is another node Y in another group that has wanted state M. Which means that if Y is in M but wanted state is not M, X is allowed to be set in M. This is an edge case which is rare.
* | enableSmallBuffers -> useSmallBuffersHenning Baldersheim2021-02-122-2/+2
| |
* | Use small buffers where size matters more than speed.Henning Baldersheim2021-02-122-2/+3
| |
* | Support configurable feed block hysteresis on the cluster controllerTor Brede Vekterli2021-02-108-11/+171
|/ | | | | | | | | | | | | | | | | | | Adds an absolute number delta that is subtracted from the feed block limit when a node has a resource already in feed blocked state. This means that there's a lower watermark threshold that must be crossed before feeding can be unblocked. Avoids flip-flopping between block states. Default is currently 0.0, i.e. effectively disabled. To be modified later for system tests and trial roll-outs. A couple of caveats with the current implementation: * The cluster state is not recomputed automatically when just the hysteresis threshold is crossed, so the description will be out of date on the content nodes. However, if any other feed block event happens (or the hysteresis threshold is crossed), the state will be recomputed as expected. This does not affect correctness, since the feed is still to be blocked. * A node event remove/add pair is emitted for feed block status when the hysteresis threshold is crossed and there's a cluster state recomputation.
* Cleanup: Remove unnecessary and unused methods, simplifyHarald Musum2021-02-088-68/+45
|
* Minor cleanup, no functional changesHarald Musum2021-02-089-62/+44
|
* Add extra row under content node if it's blocking feedTor Brede Vekterli2021-02-082-0/+37
| | | | List all resource exhaustions for node, including enum store etc.
* Merge pull request #16424 from ↵Geir Storli2021-02-087-30/+226
|\ | | | | | | | | vespa-engine/geirst/resource-usage-metrics-in-cluster-controller Add and expose resource usage metrics from the cluster controller.
| * Nodes above limit should only count each node once.Geir Storli2021-02-082-2/+16
| |
| * Add and expose resource usage metrics from the cluster controller.Geir Storli2021-02-057-30/+212
| |
* | Show feed block options on cluster controller status pageTor Brede Vekterli2021-02-051-0/+7
|/
* Add resource usage per node to cluster controller status pageTor Brede Vekterli2021-02-043-1/+56
| | | | | Also adds top-level cluster feed block status. Does not yet make enum store/multivalue limit feed blocks visible per node.
* Emit node-level events when resource exhaustion set changesTor Brede Vekterli2021-02-039-44/+138
|
* Recompute cluster state if set of resource exhaustions changesTor Brede Vekterli2021-02-026-18/+127
| | | | | Ensures that feed block description pushed to nodes is updated as further resource exhaustions are recorded (or disappear).
* Add hostname to resource exhaustion descriptionTor Brede Vekterli2021-01-295-19/+58
| | | | Hostname is inferred from the node's RPC address
* Support optional resource usage name fieldTor Brede Vekterli2021-01-296-24/+84
| | | | | If present, will be reported alongside the resource type in the feed block description string.
* add maybePublishOldMetrics hookArne Juul2021-01-281-0/+16
|
* Add cluster feed block support to cluster controllerTor Brede Vekterli2021-01-2715-4/+583
| | | | | | | | | | | | | | | Will push out a new cluster state bundle indicating cluster feed blocked if one or more nodes in the cluster has one or more resources exhausted. Similarly, a new state will be pushed out once no nodes have resources exhausted any more. The feed block description currently contains up to 3 separate exhausted resources, possibly across multiple nodes. A cluster-level event is emitted for both the block and unblock edges. No hysteresis is present yet, so if a node is oscillating around a block-limit, so will the cluster state.
* Dummy change to trigger systemtestHåkon Hallingstad2021-01-211-1/+0
|
* Allows setting a node safely to maintenance in these two new circumstances:Håkon Hallingstad2021-01-218-56/+302
| | | | | | | | | | | | | 1. The node has state MAINTENANCE with (user) wanted state UP. 2. There are other nodes in the same hierarchical group that are set in MAINTENANCE with the same description. Also made the following change. 3. Deny a request for safe MAINTENANCE or DOWN, if the wanted state is already set but with a different description. If the descriptions are the same, it is assumed to be the same operator (e.g. Orchestrator) having changed its mind.
* Support group maintenance [run-systemtest]Håkon Hallingstad2021-01-196-25/+216
|
* Add feed block propagation to ClusterStateBundle in JavaTor Brede Vekterli2021-01-154-19/+173
| | | | | | Fully forwards and backwards compatible. Currently only supports indicating feed blocked status for the entire cluster, with one associated descriptive message intended to be used by distributors.
* Change isMaster to updateMasterStateOla Aunrønning2021-01-042-2/+2
|
* Always set 'is-master' metric. Remove 'master-change'Ola Aunrønning2021-01-042-10/+3
|
* Track explicitly when we are initializing configJon Bratseth2020-12-161-1/+3
|
* Remove code in StatusPageServer, keep some inner classes temporarilyHarald Musum2020-11-194-655/+14
|
* SimplifyHarald Musum2020-11-193-13/+16
|