To reduce load on CMS we should create CMS Requests with no more than N nodes in each batch.
N could easily be no more than 1000.
New workflow could look like this:
- get all nodes to restart, random shuffle them
- get N nodes from all nodes, put it in CMS Request with partial_permissions_allowed=true
- restart nodes while getting permissions from CMS
- if some nodes in current batch cannot be restarted, put them in separate blocked nodes batch (like DLQ in messaging)
- get next batch, repeat 3 and 4
- after all batches finished, get blocked nodes batch and split them in batches by N, trying to finish restart (here availability mode changes could be applied to eventually finish restart)
No labels