SF Metrics: Richard Waid & Ben Hartshorne Ted Carstensen
Richard Waid, Director of Monitoring Infrastructure, LinkedIn
In the past 6 years, the Monitoring team at LinkedIn has dealt with an explosive change in scale from 12k to 850M individual metrics, as well as a migration from NOC based escalation to direct remediation and escalation. In this talk, Richard gives a brief overview of how they accomplished that, how they fit into the overall engineering ecosystem, as well as what they’re doing for the next major evolution in their journey. Along the way Richard covers a few of their major learnings: protecting the ecosystem against well meaning users, planning for explosive scaling, and the global vs. local optima challenge of self-service tooling.
Ben Hartshorne, Software Engineer, Honeycomb
The two main methods of reducing high volume instrumentation data to a manageable load are aggregation and sampling. Aggregation is well understood, but sampling remains a mystery.
In this talk, Ben starts by laying down the basic ground rules for sampling—what it means and how to implement the simplest methods. There are many ways to think about sampling, but with a good starting point, you gain immense flexibility. Once you have the basics of what it means to sample, Ben looks at some different traffic patterns and the effect of sampling on each. When do you lose visibility into your service with simple sampling methods? What can you do about it?
Given the patterns of traffic in a modern web infrastructure, there are some solid methods to change how you think about sampling in a way that lets you keep visibility into the most important parts of your infrastructure while maintaining the benefits of transmitting only a portion of your volume to your instrumentation service.
Taking it a step further, you can push these sampling methods beyond their expected boundaries by using feedback from your service and its volume to affect your sampling rates! Your application knows best how the traffic flowing through it varies; allowing it to decide how to sample the instrumentation can give you the ability to reduce total throughput by an order of magnitude while still maintaining the necessary visibility into the parts of the system that matter most.
Ben finishes by bringing up some examples of dynamic sampling in Honeycomb’s infrastructure and talks about how it lets them see individual events of interest while keeping only 1/1000th of the overall traffic.
Interested in joining Heavybit? Our program is the only one of its kind to focus solely on taking developer products to market. Need help with developer traction, product market fit, and customer development? Apply today and start learning from world-class experts.