aboutsummaryrefslogtreecommitdiffstats
path: root/orchestrator
Commit message (Collapse)AuthorAgeFilesLines
* Clean up in orchestratorValerij Fredriksen2019-06-073-19/+0
|
* Comment zone-app specific code that should be removed after the migrationValerij Fredriksen2019-06-043-1/+3
|
* Special handle the tenant-host application where neededValerij Fredriksen2019-06-043-1/+15
|
* Keep the spec final.Henning Baldersheim2019-05-282-4/+4
| | | | | | Create the address when needed in the async connect thread. Implement hash/equal/compareTo for Spec to avoid toString. Use Spec as key and avoid creating it every time.
* Orchestrator Support for metrics proxyHåkon Hallingstad2019-03-223-36/+5
|
* Put accessors back inside MutableStatusRegistryJon Marius Venstad2019-02-1110-29/+48
|
* Use a more specific ZK path for cache counterJon Marius Venstad2019-02-111-2/+2
|
* Add noteJon Marius Venstad2019-02-081-0/+5
|
* Remove ReadOnlyStatusServiceJon Marius Venstad2019-02-088-92/+32
|
* Expose host status cache, and use it for all bulk operationsJon Marius Venstad2019-02-0813-72/+153
|
* Replace test usage of InMemoryStatusServiceJon Marius Venstad2019-02-083-21/+27
|
* Remove unused annotationJon Marius Venstad2019-02-083-21/+0
|
* Make all StatusService-es ZookeeperStatusService-esJon Marius Venstad2019-02-082-41/+16
|
* Remove InMemoryStatusService, and clean up curator usage in ↵Jon Marius Venstad2019-02-083-165/+35
| | | | ZookeeperStatusServiceTest
* Fix locking in InMemoryStatusServiceJon Marius Venstad2019-02-081-5/+6
|
* Invalidate cache only on changes or exceptionsJon Marius Venstad2019-02-081-8/+17
|
* Actually make sure cache is up-to-date while lock is heldJon Marius Venstad2019-02-081-0/+1
|
* Simplify, using pre-computed host-to-application mapJon Marius Venstad2019-02-081-28/+3
|
* Read host statuses in bulk per app, and cache until any changesJon Marius Venstad2019-02-071-6/+40
|
* Remove unused or pointless codeJon Marius Venstad2019-02-072-41/+4
|
* Health rest APIHåkon Hallingstad2019-01-313-0/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Makes a new REST API /orchestrator/v1/health/<ApplicationId> that shows the list of services that are monitored for health. This information is currently a bit difficult to infer from /orchestrator/v1/instances/<ApplicationInstanceReference> since it is the combined view of health and Slobrok. There are already APIs for Slobrok. Example content: $ curl -s localhost:19071/orchestrator/v1/health/hosted-vespa:zone-config-serve\ rs:default|jq . { "services": [ { "clusterId": "zone-config-servers", "serviceType": "configserver", "configId": "zone-config-servers/cfg6", "status": { "serviceStatus": "UP", "lastChecked": 1548939111.708718, "since": 1548939051.686223, "endpoint": "http://cfg4.prod.cd-us-central-1.vespahosted.ne1.yahoo.com:19071/state/v1/health" } }, ... ] } This view is slightly different from the application model view, just because that's exactly how the health monitoring is structured (individual monitors against endpoints). The "endpoint" information will also be added to /instances if the status comes from health and not Slobrok.
* Metadata about /state/v1/health statusHåkon Hallingstad2019-01-253-14/+26
| | | | | | | | | | | | | The service monitor uses /state/v1/health to monitor config servers and the host admins (but not yet tenant host admins). This commit adds some metadata about the status of a service: - The time the status was last checked - The time the status changed to the current This can be used to e.g. make more intelligent decisions in the Orchestrator, e.g. only allowing a service to suspend if it has been DOWN longer than X seconds (to avoid spurious DOWN to break redundancy and uptime guarantees).
* Nonfunctional changes onlyJon Bratseth2019-01-219-19/+24
|
* 6-SNAPSHOT -> 7-SNAPSHOTArnstein Ressem2019-01-211-2/+2
|
* Avoid stacktrace in log on timeoutsHåkon Hallingstad2018-11-093-8/+66
|
* Fix probe messageHåkon Hallingstad2018-11-051-2/+2
|
* Send probe when suspending many nodesHåkon Hallingstad2018-11-0516-121/+157
| | | | | | | | When suspending all nodes on a host, first do a suspend-all probe that will try to suspend the nodes as normal in Orchestrator and cluster controller, but actually not commit anything. A probe failure will result in the same failure as a non-probe failure: A 409 response with description is sent back to the client.
* Wrap CC HTTP failures in 409Håkon Hallingstad2018-11-019-159/+91
|
* Revert "Revert "Revert "Revert "Enforce CC timeouts in Orchestrator 4""""Håkon Hallingstad2018-11-0119-308/+496
|
* Revert "Revert "Revert "Enforce CC timeouts in Orchestrator 4"""Håkon Hallingstad2018-11-0119-496/+308
|
* Revert "Revert "Enforce CC timeouts in Orchestrator [4]""Håkon Hallingstad2018-11-0119-308/+496
|
* Revert "Enforce CC timeouts in Orchestrator [4]"Harald Musum2018-10-3119-496/+308
|
* Retry twice if only 1 CCHåkon Hallingstad2018-10-305-20/+22
| | | | | | | | | | | | | | Caught in the systemtests: If there's only one CC, there will only be 1 request with timeout ~5s, whereas today the real timeout is 10s. This appears to make a difference to the systemtests as converging to a cluster state may take several seconds. There are 2 solutions: 1. Allocate ~10s to CC call, or 2. Make another ~5s call to CC if the first one fails. (2) is simpler to implement for now. To implement (1), the timeout calculation could receive the number of backends as a parameter, but that would make the already complex logic here even worse. Or, we could only reserve enough time for 1 call (abandon 2 calls logic). TBD later.
* Revert "Revert "Revert "Revert "Enforce CC timeouts in Orchestrator 2""""Håkon Hallingstad2018-10-3019-306/+492
|
* Revert "Revert "Revert "Enforce CC timeouts in Orchestrator 2"""Håkon Hallingstad2018-10-3019-492/+306
|
* Revert "Revert "Enforce CC timeouts in Orchestrator 2""Håkon Hallingstad2018-10-2919-306/+492
|
* Revert "Enforce CC timeouts in Orchestrator 2"Håkon Hallingstad2018-10-2919-492/+306
|
* Merge branch 'master' into hakonhall/enforce-cc-timeouts-in-orchestrator-2Håkon Hallingstad2018-10-263-38/+58
|\
| * Use dynamic port in testsHarald Musum2018-10-242-38/+56
| |
| * Add GET suspended status to application/v2Jon Bratseth2018-10-221-0/+2
| |
* | Fixes after review roundHåkon Hallingstad2018-10-265-34/+30
| |
* | Enforce CC timeouts in Orchestrator 2Håkon Hallingstad2018-10-2319-306/+496
|/
* Replace 'tonytv' with full name in author tagsBjørn Christian Seime2018-07-056-6/+6
|
* Correct share-remaining-timeHåkon Hallingstad2018-06-251-1/+1
|
* Remove unused methodHåkon Hallingstad2018-06-251-4/+0
|
* Avoid fatal first CC request timeoutHåkon Hallingstad2018-06-254-9/+36
| | | | | | | | | | | | | If the first setNodeState to the "first" cluster controller times out, then we'd like to leave enough time to try the second CC. This avoids making a single CC a single point of failure. The strategy is to set a timeout of 50% of the remaining time, so if everything times out the timeouts would roughly be 50%, 25%, and 12.5% of original timeout. An alternative strategy would be to use 33% for each, which would be more democratic.
* set-node-state timeout in CCHåkon Hallingstad2018-06-224-12/+8
|
* Revert "Revert "Move TimeBudget to vespajlib and use Clock""Håkon Hallingstad2018-06-228-36/+49
|
* Revert "Move TimeBudget to vespajlib and use Clock"Harald Musum2018-06-218-49/+36
|
* Use UncheckedTimeoutException from guavaHåkon Hallingstad2018-06-212-4/+5
|