PagerDuty’s DevOps: Avoiding a Cyber Monday Fail
Last year an estimated $7.35 Billion was spent online during the Black Friday and Cyber Monday weekend. Coupled with the fact that engineering teams are often short staffed with many requesting the week off, the Thanksgiving weekend could be the makings of a perfect storm. We caught up with PagerDuty’s Dev Ops Lead Arup Chakrabarti to hear his tips for managers during this peak shopping season.
In past years and in dealing with Cyber Monday, Chakrabarti’s own team has actually been more responsive than during regular hours. He offers, “We reminded everyone of the importance of these shopping days and that they represented significant revenue for the entire year. This instills a sense of urgency and responsibility in everyone.
Once teams are well aware of the significance of the Thanksgiving weekend, Chakrabarti also ensures that teams are well-prepared for what is likely to be an onslaught of requests.
- On Call Schedules with Daily Rotation: Make sure that you have on-call schedules covered for all of your engineering teams. If you do not want someone to have to cover the entire holiday weekend, a daily rotation (instead of weekly) distributes that on-call load.
- Anticipate Traffic: Be mindful that Black Friday and Cyber Monday are major events and try to predict what your traffic pattern is going to look like. Will it be 10x, 100x, 1000x? These are numbers that any engineering team that focuses on managing their operations properly will know this because it effects the way that you plan for these major events.
- Define Escalation Path: Have the appropriate business escalation contacts defined ahead of time. During these major events, if your systems are not performing adequately, a common tactic is to disable functionality until traffic dies down, but you need the input from your business partners to make the right decisions here.
- Have a Plan: Have your incident response plan ready. Do not try to invent one on the fly when your site is down. Make sure everyone knows what is expected of them ahead of time before downtime occurs.
Subscribe to Heavybit Updates
Subscribe for regular updates about our developer-first content and events, job openings, and advisory opportunities.
Content from the Library
MLOps vs. Eng: Misaligned Incentives and Failure to Launch?
Failure to Launch: The Challenges of Getting ML Models into Prod Machine learning is a subset of AI–the practice of using...
Technical & Cultural Learnings from 10 Years of Computing
What the Software Community Has Learned from 10 Years in Tech Amara’s Law states, “We tend to overestimate the effect of a...
How to Generate Financial Reporting for Board Meetings
How to Get Financial KPIs Ready for the Board Meeting Your startup just closed its first major round after a long and arduous...