Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | Remove code in StatusPageServer, keep some inner classes temporarily | Harald Musum | 2020-11-19 | 4 | -655/+14 |
| | |||||
* | Simplify | Harald Musum | 2020-11-19 | 3 | -13/+16 |
| | |||||
* | Rename method | Harald Musum | 2020-11-18 | 1 | -18/+9 |
| | |||||
* | Remove unused method | Harald Musum | 2020-11-18 | 1 | -11/+3 |
| | |||||
* | Add back @Ignore and add timeout for tests | Harald Musum | 2020-10-02 | 1 | -0/+5 |
| | |||||
* | Cleanup | Harald Musum | 2020-10-02 | 3 | -44/+33 |
| | |||||
* | Name the transport threads to understand how things are interconnected. | Henning Baldersheim | 2020-08-04 | 3 | -3/+3 |
| | |||||
* | use fixed port for rebinding test | Arne Juul | 2020-07-06 | 1 | -1/+1 |
| | |||||
* | Downgrade log level when node watch races with internal state resets | Tor Brede Vekterli | 2020-05-05 | 1 | -1/+4 |
| | | | | | | | This appears to happen if pending data node watches are triggered with failures after the cluster controller has already reset the internal voting state due to connectivity issues. Internal state is rebuilt automatically, so no need to scream errors into the logs for this. | ||||
* | Add dependency to vespalog (directly used by these modules) | gjoranv | 2020-04-26 | 1 | -1/+7 |
| | |||||
* | Use correct log Level class where search & replace has failed. | gjoranv | 2020-04-25 | 1 | -1/+1 |
| | |||||
* | Replace remaining LogLevel.<level> with corresponding Level | gjoranv | 2020-04-25 | 5 | -16/+16 |
| | |||||
* | Map remaining DEBUG/SPAM/ERROR/FATAL -> Level.FINE/FINEST/SEVERE | gjoranv | 2020-04-25 | 5 | -9/+9 |
| | |||||
* | LogLevel -> Level for isLoggable() | gjoranv | 2020-04-25 | 3 | -9/+9 |
| | |||||
* | LogLevel.ERROR -> Level.SEVERE | gjoranv | 2020-04-25 | 3 | -8/+8 |
| | |||||
* | LogLevel.WARNING -> Level.WARNING | gjoranv | 2020-04-25 | 10 | -32/+32 |
| | |||||
* | LogLevel.INFO -> Level.INFO | gjoranv | 2020-04-25 | 13 | -80/+80 |
| | |||||
* | LogLevel.SPAM -> Level.FINEST | gjoranv | 2020-04-25 | 9 | -14/+14 |
| | |||||
* | LogLevel.DEBUG -> Level.FINE | gjoranv | 2020-04-25 | 17 | -146/+146 |
| | |||||
* | Import java.util.logging.Level instead of com.yahoo.log.LogLevel | gjoranv | 2020-04-25 | 24 | -24/+24 |
| | |||||
* | Merge branch 'master' into hakonhall/remove-use-bucket-space-metric-feature-flag | Håkon Hallingstad | 2020-04-20 | 1 | -2/+3 |
|\ | |||||
| * | Increase ZooKeeper test server tick time | Tor Brede Vekterli | 2020-03-27 | 1 | -2/+3 |
| | | | | | | | | | | | | | | | | | | | | Tests running under heavy concurrent CI build load appear to struggle maintaining ZK sessions, causing flakiness. This is an attempt to mitigate that by bumping the ZK server's tick interval (and therefore session timeout) by 20x. According to local testing, this does not increase testing time in the common case. | ||||
* | | Remove stray parameter doc | Håkon Hallingstad | 2020-01-27 | 1 | -4/+2 |
| | | |||||
* | | Remove use-bucket-space-metric feature flag | Håkon Hallingstad | 2020-01-26 | 11 | -42/+24 |
|/ | | | | | | | | | | The flag controlled config read by the Cluster Controller. Therefore, I have left the ModelContextImpl.Properties method and implementation (now always returning true), but the model has stopped using that method internally, and the config is no longer used in the CC. The field in the fleetcontroller.def is left unchanged and documented as deprecated. | ||||
* | Test metric value for different dimensions, and more | Håkon Hallingstad | 2020-01-17 | 2 | -5/+12 |
| | |||||
* | Use bucket_space metric in retirement | Håkon Hallingstad | 2020-01-17 | 13 | -30/+120 |
| | | | | | | | | | | | | This makes the Cluster Controller use the vds.datastored.bucket_space.buckets_total, dimension bucketSpace=default, to determine whether a content node manages zero buckets, and if so, will allow the node to go permanently down. This is used when a node is retiring, and it is to be removed from the application. The change is guarded by the use-bucket-space-metric, default true. If the new metric doesn't work as expected, we can revert to using the current/old metric by flipping the flag. The flag can be controlled per application. | ||||
* | Use vespajlib's ExceptionUtils in clustercontroller-core | Bjørn Christian Seime | 2020-01-06 | 2 | -13/+4 |
| | |||||
* | Remove use of apache commons libraries from cluster-controller modules | Bjørn Christian Seime | 2020-01-03 | 4 | -17/+43 |
| | |||||
* | Need to depend on spotbugs | Harald Musum | 2019-11-28 | 1 | -0/+7 |
| | |||||
* | Add non-converged nodes to task deadline exceeded messages | Tor Brede Vekterli | 2019-11-04 | 6 | -26/+142 |
| | | | | | Makes it easier for an external observer to understand what set of nodes is causing the cluster state to not converge. | ||||
* | Pass working candidate state rather than published state to timer code | Tor Brede Vekterli | 2019-10-30 | 1 | -2/+3 |
| | | | | | | | Avoids triggering pointless cluster state recomputation edges because the timer code observes stale information. This can happen if publishing the candidate state is suppressed due to e.g. the minimum time between published state versions not having passed. | ||||
* | Move grace period event edge from timer to event diff calculator | Tor Brede Vekterli | 2019-10-30 | 9 | -25/+103 |
| | | | | | | | | Ensures that event is only emitted when we're actually publishing a state containing the state transition. Emitting events in the timer code is fragile when it does not modify any state, risking emitting the same event an unbounded amount of times if the condition keeps holding for each timer cycle. | ||||
* | Use mockito-core 3.1.0 | Håkon Hallingstad | 2019-10-18 | 16 | -35/+47 |
| | |||||
* | Revert "Reapply "upgrade to zookeeper 3.5"" | Harald Musum | 2019-09-27 | 2 | -24/+9 |
| | |||||
* | Revert "Revert "Hmusum/upgrade to zookeeper 3.5"" | Harald Musum | 2019-09-24 | 2 | -9/+24 |
| | |||||
* | Revert "Hmusum/upgrade to zookeeper 3.5" | Harald Musum | 2019-09-17 | 2 | -24/+9 |
| | |||||
* | Add spotbugs dep to clustercontroller-core. | gjoranv | 2019-09-10 | 1 | -0/+7 |
| | | | | + Use version property for zk in parent pom | ||||
* | Don't use import wildcards. | gjoranv | 2019-09-10 | 1 | -9/+17 |
| | |||||
* | Cleanup tests, no functional changes | Harald Musum | 2019-09-03 | 57 | -457/+384 |
| | |||||
* | Guard critical cluster state version ZK writes with an atomic CaS | Tor Brede Vekterli | 2019-07-11 | 5 | -68/+170 |
| | | | | | | | | | Lets a controller discover that another controller also believes it is the leader by tracking the expected znode versions for cluster state version and bundle znodes. If a CaS failure is triggered, the controller will drop its database and election state, forcing a state refresh. | ||||
* | Do not allow states to be published when they have pending ZK writes | Tor Brede Vekterli | 2019-07-05 | 2 | -0/+15 |
| | | | | | | Avoids a race condition where a bundle ZK write fails but we have not yet detected that ZK connectivity has been lost. This could lead to violating the invariant that published state versions are strictly increasing. | ||||
* | Don't use fake timers for master election tests | Tor Brede Vekterli | 2019-06-20 | 1 | -14/+14 |
| | | | | | | | | When using fake timers there's a catch-22 edge case during transient ZooKeeper disconnects. The test will not progress and increase the fake timer value until ZK is connected, but ZK will not attempt to reconnect until the reconnection cooldown period has passed (which depends on the faked timer). | ||||
* | Increase ZK client session timeout from 10s to 30s | Tor Brede Vekterli | 2019-06-19 | 1 | -1/+1 |
| | | | | | | | | Have a suspicion that 10s ends up being too short when ZK's write-ahead log flushing is taking a long time due to a heavily loaded CI container. Should still be short enough for a network hiccup to hopefully trigger a reconnect before the test itself times out. | ||||
* | Reduce ZK session timeout for master election unit tests | Tor Brede Vekterli | 2019-06-17 | 1 | -4/+4 |
| | | | | | | | | | Setup methods would previously override explicit timeouts set by the tests with a default of 2 minutes. It's suspected that this long timeout combined with spurious ZK client failures is a source of test failures. Reduce to 10 seconds for now. Let's see how well it fares on a heavily loaded CI container. | ||||
* | Keep the spec final. | Henning Baldersheim | 2019-05-28 | 1 | -1/+1 |
| | | | | | | Create the address when needed in the async connect thread. Implement hash/equal/compareTo for Spec to avoid toString. Use Spec as key and avoid creating it every time. | ||||
* | Remove usage of deprecated Method constructor | Bjørn Christian Seime | 2019-05-23 | 2 | -21/+21 |
| | |||||
* | mockito-all => mockito-core | Henning Baldersheim | 2019-04-29 | 1 | -1/+1 |
| | |||||
* | Change interface from Mirror.Entry[] to List<Mirror.Entry> as you already ↵ | Henning Baldersheim | 2019-04-22 | 1 | -4/+9 |
| | | | | | | have a list. Avoid having to do an array copy that is not necessary. | ||||
* | Address code review feedback for cluster controller changes | Tor Brede Vekterli | 2019-03-26 | 3 | -36/+42 |
| | |||||
* | Work around some Java generics snags in test mocks | Tor Brede Vekterli | 2019-03-22 | 1 | -0/+7 |
| |