Skip to content

Check node or cluster health in post-start script #87

Open
@MatthiasWinzeler

Description

Hi all

While changing configuration of our rabbitmq deployment lately, we ran into the following issue:

  • Canary was updated. Canary came up (process was running), but Rabbitmq could not start properly or crashed. This rendered the node useless - however, since the process was running, monit reported 'running' and moved on
  • The second node was updated as well, also coming up unhealthy, but also with a running process.
  • The third node had the same issue.

Result: The whole cluster was down. The post-deploy script then failed on the nodes, so we noticed something was wrong.

If we would add a post-start script, this scenario could not happen.
Since post-start are run on each VM before BOSH marks it as healthy, individual node failures would be caught before moving on with the next instance.

UAA already does that in a similar fashion: https://github.com/cloudfoundry/uaa-release/blob/develop/jobs/uaa/templates/bin/post-start

What do you think? We're happy to provide the post-start as a PR.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions