about the episode
about the guests
Guy Podjarny: Hello, everybody. Welcome back to The Secure Developer. Today we have Adrian Colyer with us. Adrian, welcome to the show.
Adrian Colyer: Thank you. Pleasure to be here.
Guy: It's good to have you, Adrian, here. Today we'll do a slightly different spin on what we do.
Typically we talk about this intersection of security and development or developers, and today we're actually going to talk about this intersection of secure development and security as a whole with science, with paper, with proper research.
That's because Adrian writes The Morning Paper. So, Adrian, can you tell us a bit about, just give us some background about yourself and about The Morning Paper and what it is?
Adrian: Sure, of course. My background was technical. I did a number of CTO roles for many years, most notably with a company called SpringSource that did something called the Spring framework in and around the Enterprise Java space.
Adrian: The little-known framework, and carried that journey on through VMWare and the formation of a company called Pivotal.
Then a couple of years into that journey, which is now three, four years ago, I left really to come back mostly to Europe, where I've always lived but not always worked, at least not full-time, to see what was happening in the startup scene.
Launched myself temporarily with a venture capital firm called Accel, here based out of London. Four years on, I'm still there now. So that's a little bit about my background.
The Morning Paper is a habit I slipped into by accident.
About four years ago, I was sitting on the train on my commute on the way into London with my fellow passengers looking at the Times and the Telegraph, and other newspapers that we have here in the UK. I happened to be reading an academic paper that morning. And I thought, "This is kind of fun. Everybody's reading their morning paper."
I tweeted the title of my paper with a little hashtag, #themorningpaper. And I'm not quite sure exactly how it happened, but I've done it every day since.
So it'll be four years this August that every weekday, bar Easter and Christmas breaks, etc., I have read a computer science research paper, typically in the morning, and then written up my thoughts and posted it as a summary towards the end of the day.
Guy: Definitely we're going to dig into that a little bit more about this unusual and quite impressive habit that you've developed there.
Today you and I prepped a little bit, a bunch of topics, mostly you. Some of the really interesting studies or papers that happened, or that you posted and wrote about that dealt with security or secure development. Let's dig through some of those, and at some point we'll come back a little bit to this morning routine of yours.
You write about a lot of papers, right? They cover many topics. They touch security, clearly a hot topic these days, including in the textual world. The first area maybe to tackle is just things around security that are more in the day-to-day. We read, we think oftentimes of these research papers as these far, theoretical issues.
Adrian: Exactly, yeah.
Guy: Are there any of the papers that you've read that are interesting, that are applicable, that people can absorb and use something in their day-to-day development jobs or security jobs?
Adrian: Yes, yes, there really are.
It's a really common misconception that there's a big gulf between work that might be done in academia and practical day-to-day stuff.
But I guess my bias is to select papers that have some more immediate relevance to practitioners anyway. You know, there really are a bunch of papers that open your eyes to what's going on and what's possible, things to think about and maybe even to take practical steps.
Thinking about this, we picked out a paper to begin with called Thou Shalt Not Depend On Me, which came from the NDSS conference in 2017. It's by Lauinger et al, and it's something that everybody can relate to. It is really straightforward and you hear the story, and you go like, "Oh, of course, yeah, it's not surprising." And yet it's kind of sad at the same time.
Adrian: They get good data on 11 or so of the 72 and then they do an interesting analysis which just says, how many of these sites now, you've got a big corpus, how many of them have at least one vulnerable library?
They start to analyze how good are we at keeping up to date with libraries that are including a project, particularly on the browser side? What's going on there? Trying to understand people's patterns and practices.
I say the shocking but perhaps not surprising thing is a huge percentage of even very popular sites have no invulnerable libraries included in their site.
It doesn't mean they're all directly exploitable, of course. But they've all got some vulnerability in there. Twenty-one percent of the Alexa Top 100 sites, to give you an idea. As you grow out to the full 75,000, we hit about 38%, which is pretty stunning.
What's interesting when they dig into this is, A, it's something to be aware of, but then there's a few extras. Often it's not the libraries you've directly included, though it can be, we can talk about that in a moment, but it's libraries that get indirectly included by something that you have pulled in.
And out of that, the very worst culprits turned out to be all of the ad trackers and analytics libraries, etc., that have a really bad habit of pulling out of date things into your stack and then you're kind of in a bad place.
Guy: Super interesting. I remember that piece. The debate made some headlines, and it got to Hacker News.
Adrian: That's right, yes.
Guy: It made some attention because, beyond the catchiness of the data bit, it's not always that you can summarize the key finding of a research paper in a--
Adrian: Yes, exactly. It's very immediately understandable, isn't it?
Guy: It's interesting. There's the insight itself, which is really interesting, and the secondary insight of the fact that they didn't pull it in.
But I find it interesting, this delta, like we oftentimes see, we do some of this ourselves, vendor-driven analysis, bolt-data analysis, that comes down and does it. How do you differentiate, or how do you see the difference between a research paper that comes out with this type of statement versus some non-research or a vendor entity?
Adrian: That's an interesting question. I mean, clearly, a vendor could equally have done this particular piece of research, just like the academic team could have done.
I think there's always this air of plausibility that comes with the academic thing. Or put it this way, if you want to look at it the other way, "If a vendor does it, it's obvious they've got a vested interest."
So nobody is that surprised when the result comes out and there's always an overtone of, "buy my tool," whereas when it's a pure piece of academic research, you can just look at the data and go, "Well, okay, at least it should be independently verifiable, it should be peer reviewed, etc. I can hopefully go and actually look at the datasets they use, in many cases, and verify this for myself."
Now you've got the question of, "Okay, what do I do about it?" And that's where, of course, "Wouldn't it be wonderful if somebody had a tool that would tell you if you had out-of-date dependencies?" Surprise, surprise, here we are.
That's the other interesting side, I think, of this work, for me, is that question about what you do about it. I mean, clearly it says you really do need something that's keeping you on top of these dependencies.
You can imagine what's going on in these larger companies. Somebody builds the site. It's done, it's deployed. Why would you go back and touch this and tamper with it, etc.? So they also drill into the stats around that, and that's again not surprising, but it is quite a revealing picture.
They found the median age of a library that's deployed on these sites in their study, behind the most current version, is something like 1,200-odd days, 1,177 days.
Semantic versioning, etc., has been adhered to, but you're going minor, you may be going major. So we actually have collectively a ton of work to do to bring this together. Even standing as an outsider, get it clean first and have some process to keep it clean is really the only way to do this.
Guy: Do you find that when you read these papers, do they generally bias to have actionable recommendations? I mean, they in theory should be used to prove a thesis, right?
Adrian: That varies tremendously. There's a little bit of selection bias in the papers I choose to cover. I tend to go for ones that I think are relatable. But certainly not all papers actually are there with the purpose of telling you pragmatic, actionable steps. The first and primary purpose for any academic paper is to be published.
Adrian: And so anything beyond that is normally a bonus.
Guy: Yeah, it's just extra. So this is one example, a good, very concrete example. What other example comes to mind around practical--?
Adrian: I guess there's a lovely counterbalance to this. We've just talked about a paper and a piece of work that really says it's really important that you keep your dependencies up-to-date, and you need some process around that. If you don't, you're probably going to have way more vulnerabilities than you think.
The flip side is, how do we do that? Well, often we are using package managers, etc. And it makes me smile that we put an enormous amount of trust in running apt-get update or whatever, bundle install, or whatever it is.
Guy: The automatic action.
Adrian: Often, with privileges, it's the thing that we always do first that pulls, essentially, software that we don't really know quite what's in it, off of the internet and installs it in our precious machine.
Guy: Why did you trust something you just downloaded off the internet?
Adrian: Exactly, it's the one thing we are all trained to trust. And yet, obviously when you flip it round and think about it, this whole package manager coupled with continuous integration/continuous deployment is a wonderful, wonderful attack target.
If you can get something in that delivery stream somewhere, then we're all set up to automatically put it in production for you, which is a beautiful and scary pipeline.
The second paper is one called Diplomat, and it's got a long title, Using Delegations to Protect Community Repos. It's from 2016, and it looks at this problem of how do you know you can trust what you're getting when you npm install, when you gem install, when you use any of these package managers from the Docker Hub, from PyPy, whatever it is.
They analyze a bunch of these systems and look at what's going on. In particular, this work focuses on the signing of the packages in various ways, and looks at how do we sign today, what are the various strategies that are used.
For example, there's one single master key that's used by the repo, or maybe developers have their own keys, maybe there's some kind of delegation mechanism, how does it work? Obviously, there are ways a single compromised key can do a ton of damage.
Guy: Indeed, yes.
Adrian: As in fact has happened in some packages in the past. Again, this is a very pragmatically grounded paper. I guess if you're not writing a package manager, it's of less immediate use.
But it's something you really need to be aware of when you think about what you're pulling into your systems all the time. They devise a kind of a key delegation system. They tested it out with PyPy and I think with something behind Docker and a few others.
It really says, "Look, we need to think about how keys are managed for signing this stuff. It needs to be pragmatic so it actually works. We need to accept things like we wish developers always signed, but, you know what, sometimes they don't. So actually we, as the managers of the repository, are going to have to sign it, etc."
Really, it's very straightforward. It's a delegation hierarchy of trust, and they've got two basic mechanisms. You can basically have a prioritized list of signers. It's exactly what you think, first go to A, and if A doesn't have it, you can fall back to B.
Then they have a way of specifying, "This particular rule terminates the chain." So if you get as far as C, you should never look any further, so that you can stop cascading. Really they just say, "Look, given that, how could we pragmatically use these tools to do a lot better around the repos that we've got?"
Their maximum security model, they have a legacy one as well, but let's talk about their maximum security model, basically has three buckets, right? So you've got what are called "claimed projects." Let's think of these as the healthy ones with active maintainers, and we know who they are, etc.
You can set up a proper key delegation to the people that own that collection of projects. Often there is a group that you go to. I think about Spring, for example. There's lots of things under the Spring umbrella, their collectively own group of projects. So you can say these are claimed, known projects.
We can use offline keys that are owned by that particular team to do this, the developers themselves will do it, because they're active, they're engaged. The second bucket of packages, they call them the rarely updated, the forgotten ones. They're useful. They're still out there. They're downloaded, but they're kind of stable and mature, at least. Let's put that bit of light on it.
Guy: A positive note on them, yes.
Adrian: So for these ones they say, "Well look, actually, in that bucket, we use offline keys, again. The admins of the repository will do it. This is okay, because they are infrequently updated so we can manage tractability around that."
Guy: Might make it more cumbersome to deploy, but that doesn't happen that often.
Adrian: Exactly. Then you've got your problematic bucket, which is the new upcoming projects bubbling up. What do you do about that? For those that keep the keys online, it's the solution because they're coming through all the time.
Offline signing is a pain, but it's mitigated by saying, look, when a new one comes on, we'll sign it with online keys. But every two weeks, we rotate those down into the "claimed project" bucket. So you've always got a limited window.
It's actually a fairly simple-to-understand scheme. But they analyze, for example, what would happen with PyPy users, looking at the Python packages, what comes down if you'd have had this system in place? And they assume, it was quite interesting, a threat model, whereby an attacker basically takes over the repository, has all the keys, exists undetected for about a month.
I would imagine upfront that this is like, "Game over. We're all hosed." And actually they managed to protect, if they had that system in place, about 99% of PyPy users would still be kind of good, even under that kind of threat.
Guy: Pretty massive.
Adrian: So a few pragmatic things that can make a big difference. Maybe it's worth just very briefly talking about a related work called CHAINIAC, because it goes one step further.
Here they look at things like co-authorities, which is having multiple signatories all coming together, and fully fashionable to have a blockchain, actually. They have several blockchains underpinning it, so that we've got actually a proper use case.
I think this is a genuine valid use case: an immutable public record of the releases that have come out and the corresponding signatures. That makes a ton of sense.
Guy: And they don't say "blockchain" anywhere.
Adrian: They do use the word "blockchain."
Guy: But it's a different concept.
Adrian: It's not published as a flashing lights blockchain paper, you know. So, yeah, there is a lot of work in this area. The whole "software supply chain" is a phrase that likes to be used here. We've got a lot of work to do, I think, to secure it all the way up and down the line.
Guy: Yeah, indeed. This was awesome. For me, it almost on a natural level feels the most appropriate type of analysis for more academic mindsets to do, just given that it is, first of all, fundamental. You need a very comprehensive analysis to understand all the scenarios.
The word "pragmatic" that you used there is not often seen in there. So that's nice that that was added. But also the "entirely neutral," minus the desire for your work to be used. But there's no financial interest, at least, in play.
And subsequently there's probably core math elements here almost, right? Sort of architectural structures and just an understanding of what is--
Adrian: In many cases, to do the analysis of various kinds, yeah.
When you get a good project, these research papers are real treasure troves.
Because somebody, or maybe a team, has spent many, many months, normally, doing all this work and packaging it up, and then condenses all the learnings for you into a short, relatively bite-sized piece. So if you get one on a topic of interest, they can be terrific.
Guy: What's your experience been around seeing these papers manifest in real-world products or offerings, or open source projects that get used in earnest?
Adrian: That's a really interesting question. I'm often surprised. I have a bias towards picking, as I say, more practitioner-oriented papers. But even so, you expect that the bulk of the ideas are ahead of or a little bit left-field for the mainstream commercial industry.
But I do have anecdotal data from sending out papers to a few thousand people, now, every day. We're up to nearly 20,000 people, to my surprise, that at least get the email.
Guy: It's really interesting.
Adrian: They might not open it, who knows? But that's terrific. And reasonably often, two things happened. One is I'll get feedback like, "I never knew I needed this research, but it arrived at just the right moment and it really helps with something I'm doing."
There is an element of serendipity, where something arrives that helps somebody, that I couldn't have predicted, they couldn't have predicted in advance.
The other thing that is really interesting to me is, again, although I couldn't plan it, and I never know what element it's going to be, but obviously in my role with XL I get to meet a lot of companies, hear a lot of business plans, see a lot of exciting tech stuff that's going on. And it's amazing how often what I've learned and picked up, just from trawling through some of the research, is highly relevant to those conversations.
Again, if you said to me, "Just do the selection of papers that are going to be relevant," I couldn't do it, obviously. But there is quite a high correlation. I think in general the gap between academia and industry, that transition time has been shrinking, as it has everywhere else.
The one I really always think about is the Berkeley AMPLab. All the projects that came out of that, Spark, etc., and how quickly they went from a pretty well-structured research agenda to open source to companies in a matter of no time.
Guy: Indeed, definitely different than the past. I think a part of it is the world learning to embrace the tech or the research, and a part of it is maybe at least a stream within the world of the academy that its bias is for more practically applicable research, that in a way we benefit from.
Adrian: Yeah, exactly. The entrepreneurial spirit has definitely infused academia in a way that it hadn't 10, 15 years ago.
Guy: Yeah, before.
Adrian: And I'm sure that helps.
Guy: Let's shift gears a bit, from the practical to the novel. Those are our concrete, practical things that we did today that we should change, but sometimes research is all about breaking limits, finding new avenues of thoughts.
In the world of security, what examples come to mind that have done something novel around this security activity?
Adrian: This is
one of my favorite things in security papers is the sheer ingenuity of the researchers and the ways that they find to break things that leave you simultaneously with this, "Well, that is so cool," feeling and, "Oh my God, that's terrifying."
Both things are true at the same time, and one of the papers that I came across that caught the imagination of a few folks in that arena was called When CSI Meets Public WiFi, which, actually, for an academic research paper, it's a very catchy title.
Guy: It's a very good title, yeah.
Adrian: So that they did terrific there. The CSI here actually stands for Channel State Information, and the headline of the paper is, you're using your mobile phone, you're interacting with some service that requires you to enter a PIN in order to, for example, validate a payment, something like that.
Simply by the way that your hand moves across the surface of your phone when you're tapping in the PIN, the researchers are able to infer what your PIN is, with surprisingly high accuracy.
I think it's like 60, 70% success rate, and they're recovering six-digit PINs in one test for the Alipay service. They give a stack ranking of the PIN could be this. Number one result is what the PIN actually is, and if you looked at their top three, top five scores, obviously, they're doing pretty well.
This is really quite stunning. It's like, okay, how the hell does this work? It's super ingenious. The setting is classic coffee shop setting, and you're sat at the table working. The attacker needs to be relatively close, sort of another table within the coffee shop, a few meters, something like that. It begins with the old chestnut of, "let's set up a rogue access point." So, we should all know about those rogue access points.
Guy: Indeed, we still fall for them, but yeah.
Adrian: We still fall for them. So, it's a rogue access point. Now I'm going to decrypt your traffic load. It turns out that you can figure out when somebody's about to go to one of these payment services, because you've got to time the attack just right, simply by looking at the IP address.
They often use different IP addresses for the payment part of the service. They just look for traffic going to that IP address.
Guy: For security, actually, to use the other services for security purposes.
Adrian: Yeah, exactly. Doesn't this happen so many times? So, different address. They're relatively stable, a couple of weeks or so. I'm an attacker, I use the site, I go to that service. I use some kind of traffic sniffer, see the IP address. Great, got it.
When my victim in the coffee shop now goes to that IP address, what I start doing is sending a high rate of ICMP echo requests. They'll bounce back with little replies, about 800 a second, which sounds a lot, but actually the bandwidth requirement is such that nobody's going to notice this. This is completely surreptitious.
What happens is, in the network interface card, many, many of them will also freely make available to you what's called this "channel state information." So your wifi can go over a number of different channels, and the strength of the channels is based on the constructive and destructive interference and all sorts of things that goes on.
So you can imagine, in the phone, your hands are around the phone and moving across it, that's enough to interfere with these different channels, which is detectable in this CSI information inside the network card, and it turns out you can run a classifier on that.
In the paper there are pictures of the way it forms, and they're very identifiable. You can figure out which digit it is. Now, you do need to know how that particular user moves their hands, so you might think, "Oh, this is a weakness of the strategy." But, as they point out, you only need a bit of creativity to figure that out.
For example, one could throw up a captcha for using this particular wifi service, or something along the lines that happens to have digits in the capture image, and there you go. You've caught, there are very, very creative ways of doing this.
Guy: You only need 60, 70% success rate.
If you normalize the patterns and you were successful 10% of the time, that's really good stats for attacks.
Adrian: Yeah, exactly. There are so many ingenious ways of recovering passwords and PINs. Microphones on the keyboard, accelerometers, they've done it with smart watches. They've done it with webcams, all sorts of things. But this one is kind of special, and it's zero access to the device. Nice remote, hands off.
Adrian: It shocks us all that, "Oh, wow, that's even possible. I just never thought of that."
Guy: It's this whole class of side-channel attacks.
Guy: Somebody's ability to use a seemingly benign channel that you think would have nothing to do with it, and you can still reconstruct it.
Adrian: Yes, this is a little bit off piste, but since you reminded me of that. There was a piece of work called CLKSCREW, along that lines, that also really caught my imagination, which looks at the DVPR, the power management for the chip, where to save your battery, you can downscale it a little bit, you can reduce the power, you can reduce the frequency, etc., or you can, of course, increase it back again.
It turns out there aren't good enough safeguards around that, and it's open, so you can push the thing into overclocking to cause occasional bit flips. There's this whole other wonderful story, and again it doesn't seem intuitive, but the way they do it is just incredible.
If you can flip one single bit, you're basically kind of hosed, it turns out, really. The way I intuitively came to understand that is, think about at the core something, like the factoring of prime numbers or something like that. That fundamental that sits right behind all of this, and imagine if you could just change one bit at the appropriate time.
The odds are you've got a much more factorable number. For example, it's just an example, it's not exactly how it all works, but there are many little things like that, where you time the bit flip such that all the guarantees you thought you had, they break.
Guy: They disappeared on it.
Adrian: Some of these side channel things are just incredible.
Guy: Yeah, okay, so that's fascinating. Alarming, but fascinating.
Adrian: Yes, indeed.
Guy: Give us another example of an interesting--
Adrian: Another one I picked when we were thinking about this, it's also phone-based, it turns out. But Leave Your Phone at the Door is a really interesting one. This is actually about industrial espionage, or those kind of scenarios.
Imagine we're in a industrial manufacturing plant, and I've got some CNC milling or some 3D printing, or something like that going on. And there's a lot of IP, actually, in the way I construct and manufacture these objects.
So, if you're actually physically in the plant and you're near enough the machine, and I can either figure out a way to plant some kind of malware on your phone, or if that's too much hassle, I just call you up, and you're happy to talk to me, in the vicinity of the machine. Then I can use the phone's microphone. Or if I've actually got the phone, I can use the magnetometer as well.
These machines give off characteristic noises, depending on the angle of the head. So depending on what they're printing and when they move, for example, vertically up and down, you can imagine a characteristic whine noise. I can sort of hear what it might be like in my head.
Those are also fingerprintable, it turns out. I mean, it can be done reliably enough to be in the background of a phone call, and so you can actually, and the paper had some pretty amazing examples, of a particular shape being printed out, and reconstructing after the fact, "here's the shape that I think you printed."
And again, they're relatively simple shapes in the work, but the fact that it's possible at all is actually the astonishing thing. Hence again, the fun title in the paper. Really, one of the takeaway lessons here is, as we all know,
these phones are amazing spying devices, packed with sensors and all sorts of things, and people are going to the most creative ways of getting information out of them.
Really, it's just another example of what can be done.
Guy: Any information you provide can and will be used against you in those elements.
Adrian: Exactly, yes.
Guy: Okay, I know we had a challenge here, because there's just so many creative, I guess, that's really where the minds go wild a little bit when researchers can just explore the different paths.
Adrian: Yes, yes.
Guy: We've only mentioned a handful of these, but you collect this ridiculous number of papers that you read. I assume also you read more of those, and then you write and you summarize. I mean, how do you do it? How much time does it take? What are your sources?
Adrian: Yeah, some of the most frequently asked questions. Really short on how long does it take, which is people's most favorite question.
It probably takes me between two and three hours on average to read the paper, think about it, write it up, and then, especially if I took into account then the time to actually turn it into a blog post and an email newsletter and a couple of tweets that go with.
The whole packaging it up and pushing it out. It probably is closer to three hours a post. I try not to add up the total time too often, but it's somewhere around that order.
In general, I read the paper in the morning if I'm commuting in, often I'll use the commute time to do it. But otherwise, if I'm just at home, I'll read it in the morning. I like to let it mull around in the back of my mind. I mark it up quite heavily as I read it, and then later on, it's just one take.
If you do one every day, you haven't got time to be too precious about it. So I think that discipline actually helps.
You've got to do it. You've just got to start writing. I'll kind of outline the piece, my key thoughts. I'll try and figure out what's the story I really want to tell around this, and then get to it. So that's kind of the basics of the process. Paper selection, I guess I've honed it over the years. People say, "Where do you find the interesting papers?"
There's a number of ways of doing it. When you're just getting started, there are actually quite a few good lists of papers out on the internet. You'll get through that fairly quickly. But for seed material in a topic area, that's great.
Then you might look at recommended reading lists for university courses, etc., which will help you find some of the classic, test-of-time papers that give you a solid background.
Then the other thing that I guess has become the bedrock of my personal routine is you get to know both research groups and conferences that regularly publish work you like.
So I actually have a calendar with the main conferences I follow all marked on it. I know when they are in the year. I know right now is the time to go look at their proceedings, and I'll work through and do a first pass through the abstracts. "These ones might be interesting." Then I'll do the quick read, and then I'll have the final selection.
That is like the cornerstone now, of my year. I probably have about 20-odd conferences that I regularly follow. Plus the pressure, I suppose. In a sense, you've got to come up with one every day. I'm always on the lookout for an interesting paper. Anywhere I see, Twitter, news feeds, whatever, I stash them all away and then I work through that backlog.
Guy: Explore it. At this point, do you get a lot of papers sent to you? Do you get a lot of recommendations?
Adrian: I get some, and it's always very gratefully received. If anyone's listening that wants to send me a paper, those are always very welcome. I do get a few, sometimes from researchers saying, "Hey, we've just published this work. I think you might find it interesting."
Sometimes from researchers pointing out somebody else's work, which is always lovely, saying, "Hey, I saw this thing. I think it's really good." Sometimes from practitioners, yeah, so they do come in, but it's still a minority sourcing avenue for me.
Guy: Still a small amount, yeah, got it. Do you write them up in advance? Do you bulk write? Do you write seven of them, so you can have a day off?
Adrian: Yeah, I do, absolutely. This is one of the things that keeps me sane. My weeks are very hectic and varied. I could be off here, there and everywhere. So I don't live within a 24-hour pressure window to have the next day's post. I am normally one week in advance.
By the end of the weekend, I like to have all the posts for Monday to Friday all scheduled and good to go. People who follow regularly will notice they all come out, and the tweets, at exactly the same time every day. That's because they're all scheduled in advance.
It's funny, actually, this week I'm reading a collection of papers from a really terrific workshop, I'm excited to be able to share these with you, called Re-coding Black Mirror, if anybody's ever followed the Black Mirror show.
Guy: Talk about spooky.
Adrian: It's looking at many of these various scenarios, you know, how technology could go wrong, and the ethics around it. One of the papers I was reading was about what rights do people have to data once a person is deceased? How should we think about data rights?
It turned out to be a really interesting question. It caused me to think, yes, actually, if something happened to me, this is a very long answer, on average you'd still get two and a half posts. So yeah, fingers crossed this won't happen, but you're good for at least two and a half days if you die, on average.
Guy: Yeah, you'll take care of that soon. Fascinating. Yeah, I think we'll probably share a bunch of these links. Also a bunch of the stories that we don't have time to talk about today in the notes of this podcast.
Adrian: Yes, of course.
Guy: Because there's so many of them.
Let's shift back into the contents. Another category we chatted about was not just exploring the new attack techniques, but rather the other way round, the security of the new technologies and what do they imply. Are there a couple of interesting examples from that world?
Adrian: Yes, there's one that we bounce back and forth that, again, is another hub, of course, once you hear it. But I must admit to being quite naive and not thinking about this beforehand.
The paper title is called Game of Missuggestions. Again, it's a bunch of researchers that analyze what auto-complete suggestions do you get when you start typing in your web browser, when you go to Google and do a Google search, etc.
I guess I'd never really thought about that as an attack vector, but it turns out, not only is this an attack vector,
there's actually an entire service industry that will carry out this attack for you for a fee, ranging from about $300 to $2,500, depending on the sort of keywords you're interested in, etc.
The goal is, suppose an example in the paper, the main one is, "I'm interested in finding online backup software." Classic thing there, "I might want to provide you my online backup software, because that's probably going to have access to all your files," and that sounds interesting.
So you start typing "online backup," and in your auto-complete suggestions, you might get shady vendor, "online backup free download" auto-complete suggestion. To me at least, I'd imputed some degree of trust to what I was seeing. This is clearly a popular search. It must be a well-known thing.
That turns out it's completely unwarranted due to these services, even with Google, etc., are very effective. They analyze this whole ecosystem, which turns out to be making about half a million dollars a week for some of the more popular manipulation services, or I think as it's called, "online reputation management services" is the phrase they might like to use.
Really what they do is dead straightforward. You say, "This is the search terms I'm interested in, this is the little phrase I want to pop up," and they'll go and they'll use armies of crowdsource workers and other things to just drive a high volume of search requests using those keywords.
And it must appear somewhere in the results, then click on the appropriate link, and to do this over and over and over again, and it turns out in the experiments that, after some period of elapsed time, which can be up to a month.
But in the grand scheme of things, isn't that long, you can seriously game this and get your preferred phrase right up near the top of the results. It can stay there then for one to three months as well. So it's actually a pretty effective mechanism.
Guy: Yeah, that's why people pay good money for advertisements that are placed in that space.
Adrian: Exactly, yeah. Actually, the way they uncover how this is going is also kind of cool. This is worth a short digression.
Many people will have heard of this thing called Word2Vec that lets you take a word and turn it into a vector representation that somehow embodies its meaning. What they looked at is if the auto-complete suggestions are genuine, they probably ought to be fairly similar in space to true search results that you would get if you actually completed the search.
So they do this kind of Word2Vec, and then they look at the distance in the vector space and they find that, indeed, actually the ones that have been manipulated are arbitrary scale, but about 0.7 distance, and the ones that are genuine are 0.5, something of that order. There's a real gap, anyway. These Word2Vec techniques can uncover it by looking at the similarity.
Guy: Yeah, which will probably imply that Google might employ these to identify those components, and then the attackers would find a different way to improve it.
Adrian: It's a never-ending arms race, yeah.
Every surface you expose, somebody will find a creative way to try and manipulate it to their end.
Guy: The attacks can scale to a magnitude where, if you think of Google as just this volume and size--
Adrian: Precisely, who'd have thought that--
Guy: That can no longer be affected by a single entity.
Adrian: That is the incredible thing, isn't it? That if you're targeting niche enough areas, you can game the system, it turns out.
Guy: That's amazing. It's interesting in many ways. The identification of it, the pattern itself, scary a little, trust those recommendations less the attack on Google in that scale, but also the attack on machine-learning data. This is a new methodology that we are also embracing and shows how if you can manipulate or poison the data, you can poison the results.
Adrian: Precisely, that's a whole other area we probably haven't got time to fit in, but there's some great work on that. As you say, if you can influence the training data, or the training time in a model, etc., they're learning machines, and they will learn what you tell them.
Guy: You can teach them something wrong.
Adrian: You can absolutely bias what comes out of those systems, if you're feeling malevolent.
Guy: I think we have time for one more. Let's dig into one more interesting security of the new frontier.
Adrian: Okay, so my absolute favorite that I've recently read, it's like science-fiction to me, this particular one. It's called Securing Wireless Neurostimulators, is the paper title.
If you're not familiar with a neurostimulator, it's a medical device called an IMD, an implantable medical device that you often have implanted in your chest or somewhere around there, and it's directly wired to your brain. So already, this sounds like, "Ooh, this is kind of interesting."
Guy: Yeah, that sounds so.
Adrian: If you have certain illnesses like Parkinson's and things, this can be very therapeutic, delivering the right kind of voltage to the right parts of the brain at the right time.
Of course, it's implanted. It needs to be remotely programmable. It's kind of like an everything wrong with IoT/embedded security tale. If you get into the paper, it's "security by obscurity," and a protocol that's not documented but can be reverse-engineered. Once you've reverse engineered it, it turns out you could drive all these various attacks and things.
So, you think the obvious, "Ah yes, okay, once I can send commands to the device, yes, you can cause a person to not be able to move, to not be able to speak."
Maybe you could do permanent brain damage, because you could probably kill somebody. I hope these devices have safety guards, but given all the other things I know, who knows?
Guy: And those might be overridable, if you know.
Adrian: They might be overridable. So there's that, but there are two things about this paper that made me kind of just go, "Wow." One of them was, what if you're not actually trying to change the signal, you're just trying to actually use it to get signal from the person's brain.
Now, this is something I hadn't twigged, but there's this brainwave called the P300 wave, which, the name comes from. It comes about 300 milliseconds after you've visually seen something, and you can't really spoof this thing.
They've shown that when you're recording this wave, it's possible to see if you recognise something like a picture of your password, or your PIN, or a face that you claim never to have met that you do actually know, etc. And so this is like literally hacking your brain in a sense, getting secrets out, in a way.
Guy: Reading your thoughts, yeah.
Adrian: The newer generations of neurostimulators will expose this P300 wave information. So this is like, that's actually possible as a hack? This is amazing and scary, all of these things are.
And then the second really cool thing they do is like, "Come on, we need to do better. We can't have this kind of security. But actually, how would we generate a secure key for communicating between the implanted device and the programmer?"
Many people try different schemes, and the challenge is nearly always, not only give a sufficient source of randomness, but also then how you transmit it from inside the device to the program in a way that's secure, that can't be eavesdropped on, etc.
The really cool thing that the researchers do is they find something called the LFP, the Local Fluid Potential, which is like a physiological signal that can be read from your brain, to do with the fluids around the brain, and the electrical field in them and some other things I can't fully explain, and don't fully understand.
But suffice to say, it's pretty unique and pretty random. So they use this as a genuine source of randomness. So literally your brain is making the random encryption key.
Guy: Yeah, it's the seed generator, wow.
Adrian: Which is really cool. The other bit is then fairly straightforward. They require an actual explicit physical touch from the programmer to the skin, to have this sent from the wire that goes from the brain down to the device, and through the skin. So they've got a few protections around it.
But those two mechanisms, the P300 wave, and this LFP thing, they're both really kind of science-fictional.
Guy: Very cutting edge. Yeah, the world of medical IoT is a scary one, when you say "security" next to it. But its potential is amazing, the whole "with great power comes great responsibility" phrase.
Adrian: Yeah, indeed.
Guy: It really kind of hits a nerve here. These are fascinating, and they're really again just scratching the surface. We didn't get to talk too much about privacy. We were going to talk about it. Do you just want to mention in passing a couple of interesting papers here?
Adrian: Yeah, let's do a very quick in passing. One interesting paper, again, quite a practical paper that I recently came across is called Privacy Guide. This is the idea that, as we all know, hopefully getting better with the GDPR, but buried inside the terms of services are all sorts of stuff, they're incredibly lengthy. Again, the researchers analyzed it.
It would take us about 244 days, I think, if you actually wanted to actually read the terms of services for the major services that you use.
Totally impractical. So they use machine-learning and NLP and various other techniques, and they read it for you.
They actually turn it into what I think of as a bit like a nutrition label. So they've got 11 categories that they've broken it down into, and you get a simple traffic-light scoring in each category, and they'll tell you basically whether the terms look good, or there's something to go and investigate. So that's a kind of really quick aid.
Guy: Yeah, sounds really useful.
Adrian: Underpinning that, my absolute favorite work on privacy notices that work, because we've also researched that you can tell somebody, "This is what this has access to." You can be as explicit as you like, and they'll still click "yes."
Guy: Yeah, they'll just push through, because they're trying to do something else, right?
Adrian: They just push through. It turns out there is one dialog that works. It's from a paper, they gave it a brilliant title, The Curious Case of the PDF Converter That Likes Mozart. It's a PDF converter app, as you can imagine. It's downloading, looking at your music, all sorts of stuff it shouldn't really need to do.
What they found was, the thing that gets people's attention is, instead of just saying, "It will have access to your photos, it will have access to this," they do say that, and then they say, "And this is what it can reveal about you."
They take some examples of information that could be inferred, like, "These are the faces of the top five people you communicate most with." Or, "We can figure out that you like X, and you frequently shop here."
Giving these insights that are gleanable from the data, actually works. And it's, "Oh, right. Oh, hang on, I don't want that." Maybe I could get this impact of what it actually means to reveal that information, because I don't think people have a good intuition about how much could be learned from fairly scarce data.
Let me finish with one last amazing-to-me tale, which ties into privacy and the GDPR and this discussion we're having around anonymization and pseudo-anonymity. God, that's a hard word to say. It's kind of like it's anonymous, but it isn't really.
To be declared fully anonymous, I think that the wording is something like it must be impossible to reverse engineer, which, if you read around a bit, that's also an arms race. I guess, the paper that brought home to me how clever people are at re-identifying from data is called Building Trajectory Out of Ash, something very close to that.
Guy: Yeah, Trajectory Recovery From Ash?
Adrian: Trajectory Recovery From Ash, yeah, there you go. Very shortly, maybe I'll leave this as a mental exercise so we can figure how it's done.
Here's the set-up: you have aggregate data from cell towers, the time series. So what you have is cell tower, name/number/identifier, whatever it is, and the number of devices communicating that's actually attached to that cell tower at this point in time.
You've just got that data for all your cell towers and the device counts. You would think that would be safe to release for analysis and research, etc.
In this paper, the researchers show step by step, and you work through it. And you go, "Oh, no, no, no, no, no, no." They can uniquely identify all the individual users' trajectories from this aggregate data. Once they've got back to the trajectories, it's actually trivial to find the person.
I mean, if you look at the accumulation of paths that are connected, the two most common addresses are probably home and work, and you're nearly there. Give me three most-frequently-used locations, I probably know who you are with a few extra scraps of information.
So they do this step by step, just from the aggregate data. It's super clever, and it's really ingenious. The essence of the idea is this: if you imagine two cell tower locations, let's do three, let's make them in a straight line so I could walk from A to B to C.
If there are five people at each of them at time one, and at time two there are six in the leftmost, let's say, and four in the center and five on the right, your most likely guess is that somebody walked from the central tower to the one on the left, if I've kept my analogy right.
You can do this and look at, essentially, it's having a big optimization of the problem. Like, what's the least cost movement that could cause this with some other heuristics about time of day, and whether people like to move or not. And if you're going at a certain velocity, you're likely to keep going.
It works, and you can solve this set of constraints, and out falls highly reliable trajectories for people. So you're never as anonymous as you think. It's a pity, that.
Guy: No, you're never anonymous, and never safe, really, in all these side channels. Wow. So these are scary items.
I guess I'll remind everybody reading this, oftentimes there's some recommendation element here, right? So, we go in, one aspect of it is to highlight the concern. But another more important thing is, I think, practically everything, and you probably have that bias in selecting them, is some concrete suggestions, advice.
Sometimes advice to you as a consumer of these things, and sometimes advice to the future creators of these types of entities, which is us, right? That's the tech industry, that's the development industry, or the developers, that element that can build things, that can learn from those, and build them correctly.
So, Adrian, before you entirely disappear on us here, I have a question I like to ask every guest. Which is, if you had one security-related advice or pet peeve that you like to share, what would that be?
Adrian: It's a great question. I guess I've nominally had the half an hour I had along this conversation to think about it, but it's basically been in the back of my mind.
Maybe the thing I would say is, the more of this you read and you understand, just to reinforce the impossibility of first coming up with a design, and then bolting on security after the fact.
I think security is an integral part of the design, and everything else that follows. That's the way that you need to be thinking about it.
A lot of the right things will fall out with that approach. So perhaps my pet peeve is the bolt-on security/security by obliviousness/we didn't really think about it but we'll probably get away with it, which in today's highly networked world, with some of the things that are becoming connected. As we've talked about, it's just not responsible any more to have that approach.
Guy: Just doesn't fly, yeah. Cool, that's a really good tip, good advice. Well, this was fascinating. We've gone on longer than we intended, just because we can just go on and on.
We'll post a whole slew of links here, but tell us just quickly if somebody wants to subscribe to The Morning Paper, how do they do it? How do they find you?
Adrian: Yeah, great question, thank you. The simplest way would be to Google "The Morning Paper," and maybe my name, Adrian Colyer, which is spelled C-O-L-Y-E-R. Or it's at blog.acolyer.org.
Guy: Cool, yeah, and they can register, and I highly recommend it. I've been reading them for quite a while, and we'll post a bunch of these links so you can figure out the math for the Trajectory From Ash, or you can actually read the paper, or at least Adrian's summary of it.
Adrian, this has been a pleasure. Thanks a lot for coming on.
Adrian: No, it was a ton of fun. Thank you so much.
Guy: And thanks everybody for joining us, and tune in for the next episode.
Content from the Library
You’re Targeting Developers? So Is Everyone Else. Here’s How to Do Segmentation Better.
Caroline Lewko is an accomplished visionary and entrepreneur who has spent over two decades helping develop groundbreaking...
Recommended Reading: From Experts in the Heavybit Community
As 2021 wraps up and we enter a period of much needed downtime, we wanted to offer a list of books to keep you engaged and...
Use Free Tools to Market Your Developer Product
Adam DuVander is a developer, marketer, and founder of EveryDeveloper, where he helps great dev tool companies reach more of the...