Check node or cluster health in post-start script

Hi all

While changing configuration of our rabbitmq deployment lately, we ran into the following issue:
- Canary was updated. Canary came up (process was running), but Rabbitmq could not start properly or crashed. This rendered the node useless - however, since the process was running, monit reported 'running' and moved on
- The second node was updated as well, also coming up unhealthy, but also with a running process.
- The third node had the same issue.

Result: The whole cluster was down. The post-deploy script then failed on the nodes, so we noticed something was wrong.

If we would add a `post-start` script, this scenario could not happen.
Since `post-start` are run on each VM before BOSH marks it as healthy, individual node failures would be caught before moving on with the next instance.

UAA already does that in a similar fashion: https://github.com/cloudfoundry/uaa-release/blob/develop/jobs/uaa/templates/bin/post-start

What do you think? We're happy to provide the `post-start` as a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check node or cluster health in post-start script #87

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development