summaryrefslogtreecommitdiffstats
path: root/clustercontroller-core
Commit message (Collapse)AuthorAgeFilesLines
* Add extra row under content node if it's blocking feedTor Brede Vekterli2021-02-082-0/+37
| | | | List all resource exhaustions for node, including enum store etc.
* Merge pull request #16424 from ↵Geir Storli2021-02-087-30/+226
|\ | | | | | | | | vespa-engine/geirst/resource-usage-metrics-in-cluster-controller Add and expose resource usage metrics from the cluster controller.
| * Nodes above limit should only count each node once.Geir Storli2021-02-082-2/+16
| |
| * Add and expose resource usage metrics from the cluster controller.Geir Storli2021-02-057-30/+212
| |
* | Show feed block options on cluster controller status pageTor Brede Vekterli2021-02-051-0/+7
|/
* Add resource usage per node to cluster controller status pageTor Brede Vekterli2021-02-043-1/+56
| | | | | Also adds top-level cluster feed block status. Does not yet make enum store/multivalue limit feed blocks visible per node.
* Emit node-level events when resource exhaustion set changesTor Brede Vekterli2021-02-039-44/+138
|
* Recompute cluster state if set of resource exhaustions changesTor Brede Vekterli2021-02-026-18/+127
| | | | | Ensures that feed block description pushed to nodes is updated as further resource exhaustions are recorded (or disappear).
* Add hostname to resource exhaustion descriptionTor Brede Vekterli2021-01-295-19/+58
| | | | Hostname is inferred from the node's RPC address
* Support optional resource usage name fieldTor Brede Vekterli2021-01-296-24/+84
| | | | | If present, will be reported alongside the resource type in the feed block description string.
* add maybePublishOldMetrics hookArne Juul2021-01-281-0/+16
|
* Add cluster feed block support to cluster controllerTor Brede Vekterli2021-01-2715-4/+583
| | | | | | | | | | | | | | | Will push out a new cluster state bundle indicating cluster feed blocked if one or more nodes in the cluster has one or more resources exhausted. Similarly, a new state will be pushed out once no nodes have resources exhausted any more. The feed block description currently contains up to 3 separate exhausted resources, possibly across multiple nodes. A cluster-level event is emitted for both the block and unblock edges. No hysteresis is present yet, so if a node is oscillating around a block-limit, so will the cluster state.
* Dummy change to trigger systemtestHåkon Hallingstad2021-01-211-1/+0
|
* Allows setting a node safely to maintenance in these two new circumstances:Håkon Hallingstad2021-01-218-56/+302
| | | | | | | | | | | | | 1. The node has state MAINTENANCE with (user) wanted state UP. 2. There are other nodes in the same hierarchical group that are set in MAINTENANCE with the same description. Also made the following change. 3. Deny a request for safe MAINTENANCE or DOWN, if the wanted state is already set but with a different description. If the descriptions are the same, it is assumed to be the same operator (e.g. Orchestrator) having changed its mind.
* Support group maintenance [run-systemtest]Håkon Hallingstad2021-01-196-25/+216
|
* Add feed block propagation to ClusterStateBundle in JavaTor Brede Vekterli2021-01-154-19/+173
| | | | | | Fully forwards and backwards compatible. Currently only supports indicating feed blocked status for the entire cluster, with one associated descriptive message intended to be used by distributors.
* Change isMaster to updateMasterStateOla Aunrønning2021-01-042-2/+2
|
* Always set 'is-master' metric. Remove 'master-change'Ola Aunrønning2021-01-042-10/+3
|
* Track explicitly when we are initializing configJon Bratseth2020-12-161-1/+3
|
* Remove code in StatusPageServer, keep some inner classes temporarilyHarald Musum2020-11-194-655/+14
|
* SimplifyHarald Musum2020-11-193-13/+16
|
* Rename methodHarald Musum2020-11-181-18/+9
|
* Remove unused methodHarald Musum2020-11-181-11/+3
|
* Add back @Ignore and add timeout for testsHarald Musum2020-10-021-0/+5
|
* CleanupHarald Musum2020-10-023-44/+33
|
* Name the transport threads to understand how things are interconnected.Henning Baldersheim2020-08-043-3/+3
|
* use fixed port for rebinding testArne Juul2020-07-061-1/+1
|
* Downgrade log level when node watch races with internal state resetsTor Brede Vekterli2020-05-051-1/+4
| | | | | | | This appears to happen if pending data node watches are triggered with failures after the cluster controller has already reset the internal voting state due to connectivity issues. Internal state is rebuilt automatically, so no need to scream errors into the logs for this.
* Add dependency to vespalog (directly used by these modules)gjoranv2020-04-261-1/+7
|
* Use correct log Level class where search & replace has failed.gjoranv2020-04-251-1/+1
|
* Replace remaining LogLevel.<level> with corresponding Levelgjoranv2020-04-255-16/+16
|
* Map remaining DEBUG/SPAM/ERROR/FATAL -> Level.FINE/FINEST/SEVEREgjoranv2020-04-255-9/+9
|
* LogLevel -> Level for isLoggable()gjoranv2020-04-253-9/+9
|
* LogLevel.ERROR -> Level.SEVEREgjoranv2020-04-253-8/+8
|
* LogLevel.WARNING -> Level.WARNINGgjoranv2020-04-2510-32/+32
|
* LogLevel.INFO -> Level.INFOgjoranv2020-04-2513-80/+80
|
* LogLevel.SPAM -> Level.FINESTgjoranv2020-04-259-14/+14
|
* LogLevel.DEBUG -> Level.FINEgjoranv2020-04-2517-146/+146
|
* Import java.util.logging.Level instead of com.yahoo.log.LogLevelgjoranv2020-04-2524-24/+24
|
* Merge branch 'master' into hakonhall/remove-use-bucket-space-metric-feature-flagHåkon Hallingstad2020-04-201-2/+3
|\
| * Increase ZooKeeper test server tick timeTor Brede Vekterli2020-03-271-2/+3
| | | | | | | | | | | | | | | | | | | | Tests running under heavy concurrent CI build load appear to struggle maintaining ZK sessions, causing flakiness. This is an attempt to mitigate that by bumping the ZK server's tick interval (and therefore session timeout) by 20x. According to local testing, this does not increase testing time in the common case.
* | Remove stray parameter docHåkon Hallingstad2020-01-271-4/+2
| |
* | Remove use-bucket-space-metric feature flagHåkon Hallingstad2020-01-2611-42/+24
|/ | | | | | | | | | The flag controlled config read by the Cluster Controller. Therefore, I have left the ModelContextImpl.Properties method and implementation (now always returning true), but the model has stopped using that method internally, and the config is no longer used in the CC. The field in the fleetcontroller.def is left unchanged and documented as deprecated.
* Test metric value for different dimensions, and moreHåkon Hallingstad2020-01-172-5/+12
|
* Use bucket_space metric in retirementHåkon Hallingstad2020-01-1713-30/+120
| | | | | | | | | | | | This makes the Cluster Controller use the vds.datastored.bucket_space.buckets_total, dimension bucketSpace=default, to determine whether a content node manages zero buckets, and if so, will allow the node to go permanently down. This is used when a node is retiring, and it is to be removed from the application. The change is guarded by the use-bucket-space-metric, default true. If the new metric doesn't work as expected, we can revert to using the current/old metric by flipping the flag. The flag can be controlled per application.
* Use vespajlib's ExceptionUtils in clustercontroller-coreBjørn Christian Seime2020-01-062-13/+4
|
* Remove use of apache commons libraries from cluster-controller modulesBjørn Christian Seime2020-01-034-17/+43
|
* Need to depend on spotbugsHarald Musum2019-11-281-0/+7
|
* Add non-converged nodes to task deadline exceeded messagesTor Brede Vekterli2019-11-046-26/+142
| | | | | Makes it easier for an external observer to understand what set of nodes is causing the cluster state to not converge.
* Pass working candidate state rather than published state to timer codeTor Brede Vekterli2019-10-301-2/+3
| | | | | | | Avoids triggering pointless cluster state recomputation edges because the timer code observes stale information. This can happen if publishing the candidate state is suppressed due to e.g. the minimum time between published state versions not having passed.