- Greg Poirer
On May 12th, 2016, Heavybit member company Opsee hosted their first ever WelpConf. The event featured guest speaker Andy Smith from Wercker, and a panel moderated by Dan Turchin of BigPanda, featuring Andy Smith & Greg Poirer from Opsee. Videos of the talk and panel are below, along with further thoughts from Greg.
Have you ever been on-call? Has your phone beeped and buzzed at 2 a.m.? Have you ever looked at an alert and thought to yourself, “Welp?”
Striving for Stability
Systems are a representation of reality that strive, and fail, toward determinism. They are only perfect in our mind, and while we make every effort to translate our desires into reality, we often fail to consider every possible input or address every possible failure scenario during implementation, and this leads to instability.
Every action between building locally on a developer’s laptop and deploying to production should be reproducible.
Only that which is reproducible can be stable. The unknown may never be fully known, but 99.99% availability is a laudable achievement. By working toward reproducibility in our actions, we can inform our decision making with assurance that the system will behave in a predictable fashion. Docker is a powerful tool that is a step in the direction toward reproducibility, but there are right ways to do things and there are wrong ways to do things.
The IT Ops Journey: Then, Now, and Next
Computer operators, systems technicians, systems administrators, operations engineers, systems engineers, developers. Throughout the history of computers, the only constant has been operations. Someone is responsible for making sure things are doing what they should be doing. When the fire went out, homo erectus would rekindle the flame.
Playbooks, battleplans, cron job failure e-mails, and scheduled restarts of services have been the tried and true method of reacting to operational incidents for time immemorial. The sysadmin’s handbook twenty years ago is still very much the monitoring of today.
Monitoring has been the Big Five for as long as most of us can remember: ping tests, process checks, CPU utilization, memory utilization, and disk utilization.
In the past few years, intrepid developers have made the things under observation and the ways that we observe them increasingly sophisticated. We’ve gotten exceedingly good at being able to answer the question, “What is happening and where is it happening?” We have not, however, taken to going a step further.
Our distrust of the systems we build has left us in a position of constant manual intervention and ad hoc automation of remediation.
We need goals for IT operations and monitoring. We need to be able to build trust in the systems we build. Anything emitted by a system should be recorded, catalogued, and understood. Relationships between components should be well-defined and discoverable. The people building and deploying systems should have a comprehensive understanding of their failure domains, and the systems themselves should help inform that comprehension.