| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
We already require 3 config server (and controller) nodes, but it is not
sufficient to protect the hosts from being left with only 1 healthy host: Say
the config server host application contains 2 nodes. An upgrade of host-admin
on one of those nodes is allowed, since only the host is suspended and none of
the 2 nodes are down. This is fixed by handling config server hosts similar to
config servers: assume 3 nodes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR introduces a new flag group-suspension, which if true, enables:
- Instead of allowing at most one storagenode to suspend at any given time, it
will now ignore storagenode, searchnode, and distributor service clusters,
and rely on the cluster controller to allow or deny the request to suspend.
This will increase the load on the cluster controllers.
Combined with earlier changes to the cluster controller, this new flag
effectively guard the feature of allowing all nodes within a hierarchical group
to suspend concurrently.
I also took the opportunity to tune related policies:
- Allow at most one config server and controller to be down at any given time.
This is actually a no-op, since it was effectivelly equal to the older
policy of 10% down.
- Allows 20% of all host-admins to be down, not just tenant host-admins. This
is effectively equal to the old policy of 10% except that it may allow 2
proxy host-admins to go down at the same time. Should be fine.
|
|
|
|
|
| |
application-model/src/main/java/com/yahoo/vespa/applicationmodel/ClusterId.java
Co-authored-by: Harald Musum <musum@verizonmedia.com>
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Makes a new REST API /orchestrator/v1/health/<ApplicationId> that shows the
list of services that are monitored for health. This information is currently a
bit difficult to infer from
/orchestrator/v1/instances/<ApplicationInstanceReference> since it is the
combined view of health and Slobrok. There are already APIs for Slobrok.
Example content:
$ curl -s localhost:19071/orchestrator/v1/health/hosted-vespa:zone-config-serve\
rs:default|jq .
{
"services": [
{
"clusterId": "zone-config-servers",
"serviceType": "configserver",
"configId": "zone-config-servers/cfg6",
"status": {
"serviceStatus": "UP",
"lastChecked": 1548939111.708718,
"since": 1548939051.686223,
"endpoint": "http://cfg4.prod.cd-us-central-1.vespahosted.ne1.yahoo.com:19071/state/v1/health"
}
},
...
]
}
This view is slightly different from the application model view, just because
that's exactly how the health monitoring is structured (individual monitors
against endpoints).
The "endpoint" information will also be added to /instances if the status comes
from health and not Slobrok.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The service monitor uses /state/v1/health to monitor config servers and the
host admins (but not yet tenant host admins).
This commit adds some metadata about the status of a service:
- The time the status was last checked
- The time the status changed to the current
This can be used to e.g. make more intelligent decisions in the Orchestrator,
e.g. only allowing a service to suspend if it has been DOWN longer than X
seconds (to avoid spurious DOWN to break redundancy and uptime guarantees).
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DuperModel is (will be) responsible for both active tenant applications
(through SuperModel) and infrastructure applications. This PR is one step
in that direction:
- All infrastructure applications (config, confighost, controller,
controllerhost, and proxyhost) are owned and managed by DuperModel.
- The InfrastructureProvisioner retrieves all possible infra apps from the
DuperModel (through a reduced API), and "activates" each of them if
target is set and there are any nodes etc.
- The InfrastructureProvisioner then notifies the DuperModel which
apps have been activated, and with which hosts.
- The DuperModel can then build delegate artificially create ApplicationInfo,
which gets translated into the application model, and finally the service
model.
- The resulting service model has NOT_CHECKED for each hostadmin service
instance. This is sufficient for goal 1 of this sprint.
- The config server application currently has health, so that's kept as-is
for now.
- Feature flags have been tried and works and allows 1. to disable adding the
infra apps in the DuperModel, and 2. to enable the infra configserver
instead of the currently created configserver w/health.
|
| |
|
|
|
|
|
|
|
|
|
| |
If the nodeAdminInContainer ConfigserverConfig has been set, with this PR, the
service monitor will always report the node admin container service as UP,
thereby avoiding issues related to standalone node admin seemingly being down
when not running as part of the application.
This postpones checking /status/v1/health for later.
|
|
|
|
|
|
| |
- Add missing dependencies so that all provided non-yahoo jars
are listed in container-dependency-versions.
- Add relativePath for all child poms of parent.
|
| |
|
|
|
|
|
|
| |
- Add missing dependencies so that all provided non-yahoo jars
are listed in container-dependency-versions.
- Add relativePath for all child poms of parent.
|
| |
|
|
|
|
|
|
| |
- Add missing dependencies so that all provided non-yahoo jars
are listed in container-dependency-versions.
- Add relativePath for all child poms of parent.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
This should be a no-op. The only changes that actually could have an impact
are the changes to getting the cluster controllers, but it should be
functionally equivalent.
This PR will make it easier to change the Orchestrator policy to allow
suspending several nodes (NodeGroup) in an application on a single Docker host.
|
| |
|
| |
|
| |
|