PagerDuty’s DevOps: Avoiding a Cyber Monday Fail Dana Oshiro
Last year an estimated $7.35 Billion was spent online during the Black Friday and Cyber Monday weekend. Coupled with the fact that engineering teams are often short staffed with many requesting the week off, the Thanksgiving weekend could be the makings of a perfect storm. We caught up with PagerDuty’s Dev Ops Lead Arup Chakrabarti to hear his tips for managers during this peak shopping season.
In past years and in dealing with Cyber Monday, Chakrabarti’s own team has actually been more responsive than during regular hours. He offers, “We reminded everyone of the importance of these shopping days and that they represented significant revenue for the entire year. This instills a sense of urgency and responsibility in everyone.
Once teams are well aware of the significance of the Thanksgiving weekend, Chakrabarti also ensures that teams are well-prepared for what is likely to be an onslaught of requests.
- On Call Schedules with Daily Rotation: Make sure that you have on-call schedules covered for all of your engineering teams. If you do not want someone to have to cover the entire holiday weekend, a daily rotation (instead of weekly) distributes that on-call load.
- Anticipate Traffic: Be mindful that Black Friday and Cyber Monday are major events and try to predict what your traffic pattern is going to look like. Will it be 10x, 100x, 1000x? These are numbers that any engineering team that focuses on managing their operations properly will know this because it effects the way that you plan for these major events.
- Define Escalation Path: Have the appropriate business escalation contacts defined ahead of time. During these major events, if your systems are not performing adequately, a common tactic is to disable functionality until traffic dies down, but you need the input from your business partners to make the right decisions here.
- Have a Plan: Have your incident response plan ready. Do not try to invent one on the fly when your site is down. Make sure everyone knows what is expected of them ahead of time before downtime occurs.