August 9, 2017
Ep. #10, Killing Products Gracefully
Craig and Rimas are joined by Suzie Prince from ThoughtWorks in Ep. #10 of Practical Product. Suzie explains how as a product manager she ha...
Good afternoon. My name's Maarten Van Horenbeeck, I'm the chief information security officer of Zendesk. Also, I work for this organization called First, which really has as a goal to educate individuals and organizations about how to respond to security incidents, and that's a little bit of what I want to do with you today.
In order to do that, I want to start off with defining a little bit what a security incident is. I'm going to take a slightly different approach than what I think you're used to. Typically, when you think about a security incident it's often when you're writing your security policies and you're thinking through "What am I going to actually declare an incident?" I want to take a little bit of a different angle.
In the mid-2000s, when people started talking a lot about this concept called "Targeted attacks," which we now know as "Advanced persistent threats," there was a group within Lockheed Martin that started working on putting together a model to really describe what an attack typically looks like. They called that the "Cyber kill chain," and the cyber kill chain took a look at an incident from the very first stages, where an attacker becomes interested in you as an organization, moves through different levels and finally achieves all its objectives and actually achieves at gaining whatever it is that they want to do within your organization or with the data that you host.
It was a very interesting way for defenders to start looking at attacks, because it allowed us to split out the different pieces of an attack, and for instance, determined that there was a separate delivery stage where malware or code was being delivered to a system and then it was used on the system in question.
Now, we as incident responders also take a really interesting look at security incidents. We look at it from a completely different perspective. We look at it in the way that we go and mitigate an incident when it actually takes place, and we do that by also looking at a couple of different stages.
Essentially, we're always preparing for the next security incident. We're always thinking through "What is it that we need to do to improve, to make sure that we catch the next incident and we can actually successfully detect and respond to it?" When we start that response, we're always looking first at triaging what is actually happening, because quite often when you are a smaller organization and you don't have dedicated security responders, they're going to look at every incident somewhat bewildered and somewhat concerned about what is happening. They're trying to figure out "How important is this?" When you have a piece of data that leaked externally, what is the impact on your company? What is the impact on your customers? When you have a user that maybe checked in a secret into GitHub. What is the effect of that, and what can it lead to if that secret was in some way ascertained by an attacker?
The next step we're going to take is we're going to try and mitigate the incident, and mitigation usually is about stopping the bleeding and stopping the attack right now, so we can take some more time and figure out what else it is we need to do. We'll contain the incident, we'll really figure out how we can stop it from spreading and "How do we make sure that additional systems don't get compromised?" And finally, we'll eradicate the incident by cleaning up those systems or replacing them, or maybe rebuilding or reinstalling machines and deploying them in the environment. Across that line we're going to be thinking about a couple of different things. We're going to be thinking about how we actually, as I said, stop that bleeding and then also how are we going to talk to customers? That's something that is often very difficult when you haven't really done it before.
At the end, we think about [inaudible], we think about "What actually happened and how can we improve?" I'll give you some hints on each of these individual steps in a second, but before I go there I want to talk a bit about the things that can go wrong.
I tried summarizing them in these six boxes. First of all, something that very commonly goes wrong is that customers don't really trust your assessment of the situation. That can happen because there might not be a single voice, maybe multiple people in your company are talking to the media or talking to externals or customers in different ways about what just happened to you. Maybe you've given some information that was incorrect and you have to go and correct it. This happens to everyone, but it's something to be better prepared for when you know that it's likely to happen in the fog of war of a security incident.
A second one I want to highlight, which I think is particularly important for all of us, is that we're often part of a supply chain that we don't recognize. I'll actually use an example of that in a second, but imagine that you as a company have access to someone else's source code or you may have access to another company's AWS secrets.
These types of things are really impactful when someone breaks into your systems to get access to those secrets, whereas you might think that they're actually after the data that you store specifically rather than this integration information, this data that can be used to gain access to someone else. Understanding how your product is being used and what it is being used for and the types of data that you store is really critical.
Your team may also simply not have the right experience to respond if you've never gone through an incident before. There's a lot of opportunities to learn when that happens. Quite often you also won't have all the data, but one of the wonderful things about technology today is that we generate tons and tons of logging data. We don't always have the time and space to store it or to have the knowledge to really analyze it. You may have also under or overreacted to the incident, maybe you thought it wasn't as important as it ended up being. Maybe you thought it was really important, you went out and then you realized that actually the impact was a little bit lower. It takes a bit of knowledge and a little bit of experience to get this right and understand what really the severity is, and if you can start preparing for that as you're building out your program it's usually very valuable.
Then finally, a very silly one, but you may not know how to contact your customers if you have customers sign up to your platform, but you don't actually validate their email because that seemed like a step that might build in some friction. You may not actually be able to reach out to them, even though they may still be able to log into your platform or to log into your application.
So there's a lot of things that can go wrong, and I'm going to use an example of a security incident that happened to a company in the Netherlands back in 2010. Before I do that, I want to be really clear that there's absolutely no intent here to name or shame people that were involved in that incident. It's just that it happens to be one of the very best examples to learn from, because so many things were new in that incident that hadn't happened before.
Back in 2010, there was a company that was a certificate authority, and many of you may be familiar with that. But if you think of your phone or your computer, there is a way that your phone or computer knows how to trust your website and that's using public infrastructure. Every phone and every computer has about 150 companies that it explicitly trusts will be able to attest that a particular website belongs to a particular individual. The way that works is that they've assessed that company and the processes that they follow, and that company then whenever someone buys a certificate from them will actually go out and validate that you're actually the owner of that particular domain name. That may happen to a process called domain validation, or a number of other different ways. When that company trusts that you actually own your domain name, it will issue a TLS certificate that you can then use to prove that you're actually the owner of that website.
Almost 10 years ago, there was an incident where one of these companies that was trusted was compromised. It was a small company in the Netherlands, nobody really thought very much of it in the context of the wider web because it was mostly used on a very regional basis. At about that time, there was a very smart engineer at Google who actually thought that there might be something risky about this entire model because there's many different companies and even governments that have the trust by browsers and operating systems to issue these certificates. At the time, Google rolled out in Chrome a particular feature called "Certificate pinning" that went and looked whether a certificate that was actually issued for Google .com was in fact issued by the certificate authority that Google used, and within days of rolling out that feature a user in Iran tried to access mail.google.com and got an error message, and posted that error message on the Google forums. Now the first people to notice this were actually [inaudible], which is the German National Incident Response Team, which is part of the German government. When they discovered it, they reached out to the browser vendors, to the operating system vendors, and to this company in the Netherlands and the Dutch National Security Response Team to let them know that this incident had unfolded. What was learned in the next few days was actually really interesting.
First of all, it turns out that breach had been discovered before, and at the time the company had hired an incident response consultant and did a forensic investigation, and they determined the impact of the breach on themselves as a company. But they hadn't really thought through the implications of the breach on all of these other organizations, so for instance in that incident there was a certificate issued for mail.google.com, but also, for instance [inaudible]. Many different interesting websites have rogue certificates issued at the time. Now, luckily thanks to a very specific protocol it was actually possible for the company and all the investigators that got interested in this to determine how widespread this exploitation was. It was pretty widespread. About 700,000 connections of people mostly in Iran trying to access their mail were backdoor or at least in some way accessed without authorization during that very specific incident. The browser vendors and the operating vendor system vendors took immediate action to mitigate this, but in many ways they did it by blocking requests to use these certificates. Actually blacklisting it in the browser or the operating system, and this was a really fascinating incident because it was the first time that in the Netherlands there was a national crisis related to a cybersecurity incident. In fact, the prime minister went on TV at 1 AM to assure the citizens that the government was doing what they could to help make things better. It's a very interesting incident because this company probably never really realized the implications that these types of incidents could have, and I think this is something very important for all of us to think about .
A couple of tips that I wanted to leave you with today are, first of all, when you have an incident to make sure that you assign a very clear leader. When you are a smaller organization, when you're just a couple of people, you're probably going to be quite challenged and stressed when an incident happens. I would encourage you to write out a plan ahead of time, even at a high level, that's based on those steps that I talked about at the beginning. If it's clear to you what it is that you're going to be doing next, then you're not going to be as stressed during the incident and you're going to have a better handle on things when you're reacting. It's also very important to make sure that the person leading the response isn't the person doing the technical investigation.
I can't tell you the amount of times where I've seen people, and honestly done it myself. Going into logs and then completely losing track of the fact that they just discovered something they have to escalate. Make sure that you have these roles clearly described and clearly allocated.
Also, make sure you have someone focused on communication. It's very difficult when you learn something new every 10 minutes to make sure that you keep a good understanding of what's actually happening, and are very crisp and coherent in how you communicate with others. It's the communication that gains you trust, whether it's from your CEO or your customers.
The second thing is to build relationships before you actually need them. There's a couple of different forums that you can participate in. There's First, there's also information sharing and analysis centers that focus on these types of things. Quite often it's relatively inexpensive for you to participate in these forums and learn from them. And finally, if you're not ready for that, go to a security conference and ask some of your peers about what it is that they do to prepare for an incident and how you can also help make that better.
One thing that I've learned from that incident that I talked about earlier in the Netherlands is that you really want to know the right people to be able to partner when something happens. So make sure that you know your peers, your competitors, and that you connect with their security teams and don't compete on security, but try to make the pie bigger for everyone by making sure that they trust SaaS services. As a result, they will trust you.
Retaining external support is really important as well. You're not going to have the ability to know everything from the legal side to the technical side of incident response, so make sure that you built these connections ahead of time and you look into what it is that you can potentially even contract ahead of time. There is retainer agreements you can sign with law firms, with forensic investigators, so they can come help you when an incident actually happens. Now, you might ask yourself the question, "Should I really be spending my money on this?"
And the answer is it depends a bit on the size and you may not. But even then, it's worth starting the conversation so you at least know how much it's going to cost you when an incident actually happens. And you'll know a little bit about the process, and maybe you can agree on terms for getting support before the incident happens so you don't end up locking yourself into agreements you disagree with.
Understand who it is that you need to report to, and this has become more and more important with rules and regulations like GDPR. Understand that there's differences between regions in Europe. Quite often you're expected to report to a competent authority, which could be a national search, could be a regulator . In the US you're most likely required to report to your customers or their customers in some way, so investigate what actually applies to you as an organization. What rules are important? What are the things that you really want to learn ahead of time so that they don't catch you by surprise? You can do that yourself, but I would actually highly recommend that you talk to your attorneys. You usually all will have an attorney that you work with on things unrelated to security incidents. Maybe have the conversation about what support you need when an incident strikes, so that you can really figure that out ahead of time, and also think about your culture.
Culture is one of the most critical things for a company because if you are very open with your customers and then you have an incident and you hide away and you don't communicate, your customers may lose a lot of trust. So make sure you have that conversation with your legal support ahead of time, so they also know what it is that you will want to do in terms of your customer relationships when a security incident strikes.
Communicate often and early, but always be correct and truthful. That can mean that if you don't know something for sure, you may actually want to tell your customers that you don't know it yet or that you're still investigating and that you don't have that information. But make sure that you're as truthful as you possibly can be, and make sure you create a place where customers can continue to learn new things. I have this little siren up here, and this is actually a really important concept in a sense that when an incident happens everyone will have an opinion about what happened, and they will all have some level of validity or reasonableness and things that people will want to know about and consider.
But I think it's really important as a company that's actually affected by this, that you have the right mechanisms to communicate in a way that your customers can actually find it and they can get authoritative information from the source.
There's also some interesting mechanisms here that you might want to think about. For instance, if you develop an SDK and you deliver software to your customers and they use it on premise in some way, you will want to think about how it is that they will actually discover that your software has vulnerabilities. Quite often that might be by making sure that vulnerabilities have a CVE number assigned. CVE stands for common vulnerabilities and exposures, and it's a universal tracking number for vulnerabilities typically requested by the person who found the vulnerability, but in some cases also by the vendor that's actually fixing it.
That's often for enterprise customers the way that they will learn about the vulnerability, because if there is a CPE, their vulnerability scanners are going to write code to detect it. The software composition analysis tools will start building the right tooling to detect that vulnerability in your customers code.
Know the basics. This is probably the single most important thing that I've learned in dealing with security incidents. You will be learning something new every hour, every day. It's very easy to get lost in what is happening there. There is a lot of details, and details may look very impactful and in the end they may not be.
So, have a couple of questions that you continuously ask yourself as you learn something new and you go to new things. Think about exactly what the impact on customers is. This new information that you just learned, did it tell you something that you didn't know yet? Did you learn something new about how access in this case was achieved? And does it actually impact customers? The best way that I found to do this is by having one document with a couple of paragraphs at the top, and actually call it an "Impact statement." You continuously update it whenever you learn something new so that everyone that's involved in the incident can open up the document, look at it, and see what the current status is. If you learn something new and you don't know how relevant it is, add it somewhere near the bottom and make sure that every few hours you review it, you look at it, and you see if there's something new that you need to integrate in the impact.
That brings me to the final thing I really want to talk about, and that is that you should never, ever let a good security incident go to waste. You look at this graph, you might laugh a little bit, but this is typically where every incident responder is when they're dealing with a security incident versus the greatness that potentially awaits them if they learn all the lessons. I probably started dealing with security incidents in I would say probably 2002, and it was originally supporting large companies when they were affected by the early windows swarms or things like Code Red, Sasser Blaster and so on.
Every time I've dealt with an incident since I felt like a little kitten up here, there's something that makes me feel really uncomfortable about what I am doing and every time there's this opportunity to take what I just learned and turn into a lion, that the next time is going to deal better with a security incident. That is where all of you are, because in the end the way that you build maturity to deal with security incidents is by dealing with security incidents.
One thing I always tell my team is you will likely have a security incident of some magnitude at some point in time, and it's probably going to be more frequent than you want even though it might not be very frequent. But it's sometimes good to just look at the things that just didn't meet the threshold and treat them as a real incident, so you have an opportunity to go through the entire process and get everyone prepared. I found it to be incredibly valuable to do these little tabletops where I just take an hour of time and I come up with a scenario, I put some key leaders around the table and we talk to a security incident and they have to make decisions. Even if they don't make the right decision in the scenario, the next time a real incident strikes you can talk to each other about "Do you remember how we did this in the exercise? You made this decision and it worked out, or it didn't work out?" It's a really valuable time to learn.
Ways you can do that is by studying and documenting what you're doing and making sure that you do a retro with everyone involved, get everyone around the table and make sure they know that you're not trying to find blame and that you're just all trying to improve. Get them to share what worked for them and what didn't work for them, and then spend a little bit of time thinking about how you can put all of that in place. Understand the root cause of your incident. Usually when you say the root cause is something, it's usually a couple of levels deeper. There's a method called the five why's that you can apply, let's say that there's a root cause that you determine which is a token that was put in a public GitHub repo. Think about why it ended up there, like what was it that made it possible for it to enter? Maybe you're not doing code reviews? Maybe you're missing some tooling that could actually do this checking automatically? Think about the different layers that lead to that decision. Maybe you're not doing enough training, or maybe you shouldn't actually be using tokens in the first place and there's a better place to pull that authentication information.
Next, communicate your needs and share your learnings. I'm going to be very honest with you, this is always difficult when you have a security incident. I just recently went through one of my own and it's very difficult to share everything that you learned immediately, but think to what you learned and find opportunities to share. Even if you end up a little bit later in the cycle and a year later you can say, "Now I can actually talk about this," because the only way that we're going to get better is by all sharing the mistakes that we make and learning from them. There is absolutely no shame in that. Everyone that shamed someone else who has a security incident, that's very unfortunate and we shouldn't be doing that, because the only way we're all going to make sure that we don't all have to make the same mistake again is by actually being transparent about what we did wrong so others can learn the different challenges that they are going to face. That's actually it, that I wanted to share with you today. Thank you very much.