Open
Description
Hi all
While changing configuration of our rabbitmq deployment lately, we ran into the following issue:
- Canary was updated. Canary came up (process was running), but Rabbitmq could not start properly or crashed. This rendered the node useless - however, since the process was running, monit reported 'running' and moved on
- The second node was updated as well, also coming up unhealthy, but also with a running process.
- The third node had the same issue.
Result: The whole cluster was down. The post-deploy script then failed on the nodes, so we noticed something was wrong.
If we would add a post-start
script, this scenario could not happen.
Since post-start
are run on each VM before BOSH marks it as healthy, individual node failures would be caught before moving on with the next instance.
UAA already does that in a similar fashion: https://github.com/cloudfoundry/uaa-release/blob/develop/jobs/uaa/templates/bin/post-start
What do you think? We're happy to provide the post-start
as a PR.
Activity