vespa - An engine for low-latency computation over large data sets

	Commit message (Collapse)	Author	Age	Files	Lines
*	Include distribution config as part of cluster state bundle	Tor Brede Vekterli	22 hours	8	-18/+126
\| \| \| \| \| \| \| \| \| \| \| \|	Opaque config Slime representation is encoded as part of the existing bundle object. New cluster state version and broadcast is triggered if the distribution config changes. Including distribution config is controlled by a dedicated config parameter, which currently defaults to `false` (i.e. disabled).
*	Add metric for number of nodes not converged to the latest cluster state version	Tor Brede Vekterli	2024-06-06	4	-9/+85
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a grace period (currently hard-coded to 30s) between the time a new state version is published and before nodes are counted as not converging. This state is not sticky for a given node, so if the state keeps changing constantly it's possible that we under- count the number of non-converging nodes. But this can be improved over time; for now it's primarily a source from which we can generate alerts, as failures to converge can manifest itself as other failures in the content cluster. Also move out of sync ratio computations to the periodic metric update functionality and reset it if the current controller is not the believed cluster leader. Start moving some CC timer code to use `Instant` instead `long`s with implicit units.
*	Emit single metric for how out of sync the cluster data is	Tor Brede Vekterli	2024-05-06	6	-6/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	With these changes the cluster controller continuously maintains a global aggregate across all content nodes that represents the number of pending and total buckets per bucket space. This aggregate can be sampled in O(1) time. An explicit metric `cluster-buckets-out-of-sync-ratio` has been added, and the value is also emitted as part of the cluster state REST API. Note: only emitted when statistics have been received from _all_ distributors for a particular cluster state, as it would otherwise potentially represent a state somewhere arbitrary between two or more distinct states.
*	Replace all usages of Arrays.asList with List.of where possible.	Henning Baldersheim	2024-04-12	2	-4/+5
\|
*	Unify on Set.of	Henning Baldersheim	2024-04-11	1	-4/+3
\|
*	Unify on Map.of	Henning Baldersheim	2024-04-11	3	-8/+7
\|
*	Explicitly report docs, tombstones and buckets in disallow-message	Tor Brede Vekterli	2024-01-29	1	-2/+2
\|
*	Use stored entry count rather than bucket count for (dis-)allowing permanent ↵	Tor Brede Vekterli	2024-01-26	1	-102/+131
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	node down edge The stored entry count encompasses both visible documents and tombstones. Using this count rather than bucket count avoids any issues where a node only containing empty buckets (i.e. no actual data) is prohibited from being marked as permanently down. Entry count is cross-checked with the visible document count; if the former is zero, the latter should always be zero as well. Since entry/doc counts were only recently introduced as part of the HostInfo payload, we have to handle the case where these do not exist. If entry count is not present, the decision to allow or disallow the transition falls back to the bucket count check.
*	Use fake ZooKeeper database implementation for subset of CC tests	Tor Brede Vekterli	2023-12-04	4	-8/+159
\| \| \| \| \| \|	The fake impl acts "as if" a single-node ZK quorum is present, so it cannot be directly used with most multi-node tests that require multiple nodes to actually participate in leader elections.
*	Move Jackson util from vespajlib to container-core.	Henning Baldersheim	2023-11-24	1	-1/+1
\|
*	jackson 2.16 changes some of its default settings so we consolidate our use ↵	Henning Baldersheim	2023-11-23	3	-11/+11
\| \| \| \| \| \|	of the ObjectMapper. Unless special options are used, use a common instance, or create via factory metod.
*	Update copyright	Jon Bratseth	2023-10-09	81	-80/+82
\|
*	Consolidate hamcrest usage to 2.x and remove cthul-matchers	Bjørn Christian Seime	2023-08-29	9	-18/+1
\|
*	Add content cluster name to generated feed block message	Tor Brede Vekterli	2023-07-26	2	-28/+30
\| \| \| \| \| \| \| \| \| \| \|	Messages now prefixed with content cluster name to help disambiguate which cluster is exceeding its limits in multi-cluster deployments. Example message: ``` in content cluster 'my-cool-cluster': disk on node 1 [my-node-1.example.com] is 81.0% full (the configured limit is 80.0%). See https://docs.vespa.ai/en/operations/feed-block.html ```
*	Make generated automatic feed block error messages more user-friendly	Tor Brede Vekterli	2023-07-26	3	-34/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Messages are generated centrally by the cluster controller and pushed to content nodes as part of a cluster state bundle; the distributors nodes merely repeat back what they have been told. This changes the cluster controller feed block error message code to be less ambiguous and to include a URL to our public documentation about feed blocks. Example of _old_ message: ``` disk on node 1 [storage.1.local] (0.510 > 0.500) ``` Same feed block with _new_ message: ``` disk on node 1 [storage.1.local] is 51.0% full (the configured limit is 50.0%). See https://docs.vespa.ai/en/operations/feed-block.html ```
*	Check min replication seen from all distributors that are UP	Harald Musum	2023-07-17	1	-1/+1
\| \| \| \| \|	This means we will also check distributors that are on same node as a retired storage node
*	Fix minor issues after code reviews	Harald Musum	2023-07-17	1	-2/+0
\|
*	Check redundancy also for groups that are up	Harald Musum	2023-07-16	1	-18/+27
\| \| \| \| \| \| \|	When we allow several groups to go down for maintenance we should check nodes in the groups that are up if they have the required redundancy. They might be up but have not yet synced all buckets after coming up. We want to wait with allowing more nodes to be taken down until that is done.
*	Renames and minor refactorings, no funcational changes	Harald Musum	2023-07-09	2	-109/+109
\|
*	Simplify	Harald Musum	2023-07-05	1	-1/+1
\|
*	Minor refactoring and start of some new test	Harald Musum	2023-07-05	1	-11/+32
\|
*	Move fetchStatusPage	Harald Musum	2023-06-19	1	-0/+30
\|
*	Modernize	Harald Musum	2023-06-07	1	-3/+4
\|
*	Require non-null MetricUpdater	Harald Musum	2023-06-05	2	-16/+7
\|
*	Add back some testing of getMaster()	Harald Musum	2023-06-05	1	-0/+60
\|
*	Remove support for RPC method getMaster, only used in tests	Harald Musum	2023-06-05	2	-135/+1
\|
*	Simplify and remove some test methods	Harald Musum	2023-06-01	5	-25/+16
\|
*	ZooKeeper is always used, simplify	Harald Musum	2023-06-01	3	-35/+9
\|
*	Require non-null zooKeeperServerAddress in FleetControllerOptions	Harald Musum	2023-06-01	4	-125/+134
\|
*	Copy options when reconfiguring	Harald Musum	2023-06-01	1	-20/+10
\|
*	Require distribution to be non-null and fix tests	Harald Musum	2023-05-26	1	-0/+1
\|
*	Remove RPC method only used in tests	Harald Musum	2023-05-26	1	-44/+1
\|
*	Simplify and minor cleanup	Harald Musum	2023-05-15	1	-125/+133
\|
*	Avoid duplicating code	Harald Musum	2023-05-13	3	-23/+21
\|
*	Set waiter in createFleetController	Harald Musum	2023-05-13	1	-4/+4
\|
*	Create slobrok in constructor and simplify setup	Harald Musum	2023-05-12	11	-50/+45
\|
*	Inject timer from test classes instead of inheriting	Harald Musum	2023-05-12	13	-124/+152
\|
*	Move method	Harald Musum	2023-05-12	2	-10/+11
\|
*	Fix exception message	Harald Musum	2023-05-12	1	-1/+1
\|
*	Remove testname and logging related to starting and stopping	Harald Musum	2023-05-12	10	-59/+7
\| \| \| \|	Not used, reintroduce using junit TestInfo class if needed
*	Remove advanceTime call	Harald Musum	2023-05-12	1	-1/+0
\| \| \| \| \| \| \|	The timer used in fleetcontroller in these tests (except one in a disabled test) are instances of RealTimer, whereas the one used in this call is a FakeTimer from the superclass, so this call makes no sense IMHO.
*	Minor cleanup	Harald Musum	2023-05-08	14	-40/+20
\| \| \| \|	GC dead code, optimize imports, fix unnecessary throws statements
*	Implement toString for implementations of UnitState	Harald Musum	2023-04-20	1	-0/+3
\|
*	FIx retiredOrNotUpGroups()	Harald Musum	2023-04-18	1	-1/+1
\|
*	Handle case where a node has another description for wanted state	Harald Musum	2023-04-18	1	-11/+24
\| \| \| \|	Also add group indexes for disallow messages where relevant
*	Check state down later and simplify	Harald Musum	2023-04-17	1	-5/+28
\|
*	Parameterize tests	Harald Musum	2023-04-14	1	-109/+140
\|
*	Merge branch 'master' into hmusum/allow-groups-to-be-down	Harald Musum	2023-04-14	1	-1/+1
\|\
\| *	Restore isControlledShutdown and simplify tests a bit	Harald Musum	2023-04-13	1	-9/+25
\| \|
\| *	Fix bug in distributuon config generation	Harald Musum	2023-04-13	1	-1/+2
\| \|