Description
Relevant discussion in Discord: https://discord.com/channels/880272256100601927/884130225280139394/921455993135697972
This morning I was involved in a production deploy that led to some weird and one-off runtime errors. Working in a rather large arc app that has pretty consistent traffic and takes about 5 mins to deploy the full cloudformation stack, we added a new @event
. After deploying to prod, we saw the lambda for a dynamodb data stream trigger we have in place start erroring out with 'unknown event '. This particular line of code is the source of the error we were seeing: https://github.com/architect/functions/blob/main/src/events/publish-topic.js#L15-L19
Taking a look at the cloudformation event logs for the deploy, I did notice this particular timeline of events:
2021-12-17 09:20:42 UTC-0500 NewEventTopic CREATE_IN_PROGRESS
: the SNS topic begins to be created2021-12-17 09:20:43 UTC-0500 NewEventLambda CREATE_IN_PROGRESS
: the lambda for the new event begins to be created2021-12-17 09:20:45 UTC-0500 DataStreamLambda UPDATE_IN_PROGRESS
: code for the data stream lambda begins to be updated2021-12-17 09:20:52 UTC-0500 NewEventLambda CREATE_COMPLETE
new event lambda creation complete2021-12-17 09:20:53 UTC-0500 NewEventTopic CREATE_COMPLETE
SNS topic creation complete2021-12-17 09:20:55 UTC-0500 DataStreamLambda UPDATE_COMPLETE
data stream lambda update complete; new code in the data stream lambda that publishes to the new topic is now live at this point IIUC?2021-12-17 09:20:56 UTC-0500 NewEventTopicParam CREATE_IN_PROGRESS
The SSM parameter that informs@architect/functions
’ service map on where the new event SNS topic exists begins to be created2021-12-17 09:20:59 UTC-0500 NewEventTopicParam CREATE_COMPLETE
the SSM parameter informing the service map is now ready
So, if my snooping around is correct, between steps 6 and 8 (about a 4 second window), when lambda code in the data stream is live and when the SSM parameter informing arc/functions where the new event we created exists is live, any executions of the data stream lambda would not know where to look for for the new event, and trigger the ReferenceError
inside arc/functions.
Two solutions considered:
- Can we front-load creation of SSM parameters during a deploy if that's possible? Probably only via the use of
DependsOn
, and the only way I can see this working out is if we mark all Lambdas depending on all SSM parameters. Probably overkill? - Instead of erroring out immediately in arc/functions if something can't be found in the service map, we invoke retrieving the service map from SSM Parameter Store one more time? so instead of creating a ReferenceError, we run the service discovery routine maybe one more time, and if it still can't be found, then we error out?
Activity