April 4, 2019
Ep. #27, Open Source Security with Jeff McAffer of Microsoft
In episode 27 of The Secure Developer, Guy is joined by Jeff McAffer, director of Microsoft’s Open Source Programs Office, who shares his ...
Hey guys. Thanks for coming. I'm Amber. I lead the project engineering team at Stripe. I've been there for around three years now and I spend a lot of my time there working on the API, building things on the API, generally trying to figure out how to scale our API.
So, for those of you who don't know, Stripe is a payments API. We want to build the payments infrastructure for the internet which basically means we want to give people the ability to create credit charges via an API, people to send transfers to bank accounts, generally being able to transact online and move money around as well as grow businesses online.
As a result, we have lots of APIs. As many APIs first start out we started off with one endpoint so that was our charges endpoint. I think at the time it was actually called like XQ charge or something like, that was before we had discovered REST.
So you would basically hit up a charges API, send us your credit card information, send us the amount you want us to charge the credit card for and we would go ahead and create a charge.
Over time and sort of as we introduced more functionality, so now we have things like recurring billing, we do subscriptions, we have marketplaces functionality so you can send money to recipients, sort of the number of API endpoints that we had, bloomed.
Pretty quickly we had to figure out how we scale our API and I don't just mean in terms of performance scaling but how do you scale your abstractions and how do you scale your codebase. I thought that be a pretty interesting thing to talk about and what better way than to go through an example. So, let's build an API.
Stripe uses Sinatra. We're mostly a Ruby shop so a lot of the examples here are going to be in Sinatra and Ruby, but a lot of them can be translated to any language or frameworks so you can sort of imagine it in your favorite language or framework of choice.
So, here's a super simple example of an API for charges API. So, here we are collecting some of the card number, so the card information. So we got our card number, we got our amount, we're doing some magical things to be able to actually create a charge, so behind the scenes that like creates all of the relevant database objects, and goes out to the card network and stuff like that.
Then we're returning an API response. So, all of Stripe's API responses have an ID and also return all of the relevant information for the actual charge. So in this case we got the amount, we got the redacted card number and whether or not this charge was successful or not.
So, we also have to do some validation, so in this case, figuring out if the card number and the amount are reasonable things to have been sent, and if they're not, then returning an error or something like that. This is a pretty simplistic example of what validations we have and also authentication.
So, if when someone is sending in an API request, we have to figure out what the user is, if they're even able to do, if they have permissions to be able to access that API endpoint and so on.
So you can tell this endpoint is sort of like getting complicated and crowded, but this should be all we need to actually create an API endpoint for charges. If we went ahead and uncurled this right now with the card number and amount we would return a charged object. So great, we have an API. That was really easy. But what next?
So, now that we have the ability to be able to create a charge maybe you want to be able to retrieve a charge by that ID that we just got back. Maybe you want to be able to list all of the charges that you've ever created or something like that, you've got more endpoints.
More functionality, so like I said after we have our charge endpoint we then introduce things like subscriptions API and marketplaces API and all sorts of interesting things, so more functionality, more changes.
What happens when you want to deprecate something or rename an API property or parameter? As anybody who has ever created an API knows, this is really, really annoying. Once you add something to your API, you can basically never take it away ever again, and basically more problems is what you have.
You have more code basically by definition, you have more functionality and endpoints and stuff like that. Back to our example, a lot of things are getting completed together, so now in the earlier example we had to do API key authentication and do validation of the parameters that are being passed in, do the actual API logic then return API responses and return error messages and stuff like that; and at the same time, be able to support every operation for any combination of parameters or responses that has ever existed.
That's pretty hard to do. You get a lot of copy-pasta. Lots of code duplication everywhere as you start adding new endpoints and then new functionality and stuff like that. You also start to introduce a lot of dependencies to your API. So what I mean by this is now you have a lot of places where you have to go change when you change your API.
One good example of this is documentations. So, say you built your API like we did earlier then you wrote up your documentation, maybe you did that in the form of static HTML or markdown or something like that, and then you launch your API and everything is great.
Maybe a week later you decide you want to add an API parameter or something like that. Now you have to go change both the API and the documentation which is going to be in separate file and with lots of code and lots of documentation, eventually you end with something like this.
Someone's going to forget to update the docs. I have been that someone many times and things are gonna be out of sync. Maybe for documentation this means that someone doesn't know that they're allowed to use an API parameter or maybe they're using an API parameter that has been deprecated. But in the case of API libraries, so Stripe has API client libraries, if we forget to update that, then like the code just breaks, right?
So, for us especially because we're a payments API, broken integrations are really, really bad. Any customer whose integration breaks is literally losing money every single minute that they're down. So we need to figure out a way to support all of these things.
So, I just said a lot of scary things. How can we go about making this better? A lot of people talk about sort of like the bast way to design your API, right? Like what are the best design decisions, what do the endpoints look like, how 'REST' full are you? That's sort of like a holy war in it of itself.
I mean, myself included. I actually gave a talk earlier last year which I'm happy to talk offline about if anybody is interested, about the design decisions that we made to make Stripe's API as easy to use for developers who are integrating on Stripe.
But I think that the thing that nobody tends to think that much about or at least nobody tends to talk that much about, is how do you make your API really easy and elegant to use for the engineers who are actually developing on top of it, and I think that's pretty understated, right?
How do you like, make your engineers happy and how do you be able to change things and add things without being afraid that things are gonna break?
Building APIs or building products or anything like that as everybody knows requires a lot of like iteration and feedback cycles. Nobody gets it perfect the first time so you're constantly going to be changing things.
Being able to optimize those steps like being to able to make it really easy for your engineers to be able to add things or change things sort of like translating to really significant speed gains and quality gains later down the road.
So yeah, so, the first thing, which is what ended up doing was we separate the different layers of logic. This is basically CS 101, right, like separate responsibilities of different things.
In the earlier example, our endpoint is doing a lot of different things. So we're doing the authentication and doing the validation, doing the actual endpoint specific logic. So in our case it's payments but maybe you're sending e-mails or uploading files or something like that.
We're also constructing the response, so in our example we had basically hard coded it in our endpoint, which is fine for one example but once you start to add any other endpoints, things are gonna quickly fall down. And also doing error handling and stuff like that.
So error handling and authentication are two pretty easy things to pull out. You don't necessarily have to know in your API logic whether like the user was able to authenticated or not, right? You should trust that somebody else is going to do that for you.
So a lot of frameworks have a notion of like before filters or something like that where you could put logic that runs before any other endpoint. We use Raccoon Middleware for this which actually tends to work really well.
Similarly for error handling. So when you're in your API logic, you don't want to have to worry about what the like JSON response is supposed to look like. You just want to be able to do something like raise an error. So in our case we have user errors that we represent and have a different HTTP status code for and stuff like that.
So you want to be able to raise an error whenever something happens and know that somewhere down the line, somebody is gonna catch it and format it correctly for you.
At the same time for things like we talked about authentication and we talked about error handling, but what about things for like, what about things like validation and figuring out API response and stuff like that?
So when you're in your API logic, you also don't want to have to worry about what the actual response is going to look like. You just want to know that you're going to have to render a charge for whatever that means later.
So, internally we represent these two things as API Methods and API Resources. So if you're familiar with MVC this is sort of like the C and the V in MVC, so controllers and views.
So, our API methods know which parameters they're supposed to accept so we've made our own sort of like DSL for this, Ruby is really good at that kind of thing. To be able to see really quickly which things are required, what the actual resource is that's going to be rendered and then whatever the execution block is gonna be.
The API resource itself similarly knows like what things it's supposed to spit back out and then you can do something like custom stuff. In the case of card numbers you're able to say oh no, use the charges reducted card number instead.
So with these two things our routes start looking really nice here. So that's nice.
Back to my example about sort of like having the docs and having to update docs and things being out of sync and stuff like that; you basically want to make it really, really hard for you to mess up in those places. So this is things like consolidating all the logic so there's not like multiple places where you have to change something, removing a lot of the code duplication.
One of the things that I thought was really cool when I first joined Stripe was we actually document our API methods directly inside the code. Sorry, our API parameters directly inside the code. So in this example, in the API method that we just saw right underneath the declarations for what things are being accepted is the documentation for each of those things.
So if I went ahead and had to add another parameter after this, I would just add the documentation right under and this makes it almost like impossible to forget to add the documentation because you don't have to go to another file, you don't have to like go recompile something. Our documentation, so stripe.com/doc/api actually will auto-generate itself or auto-generate most of the documentation based on these particular methods, so it will pull these in.
People often ask us how we do our backwards compatibility. So, when you're integrating with Stripe you shouldn't have to see something like "Oh, you're on version 2.154 so make sure to have that in your endpoint URL when you call us."
Instead, whenever someone makes or whenever someone makes API calls to Stripe, we effectively hide the fact that we have any backwards compatibility. So what they see is they just make an API request and from then on that API request always happens to work, we never broke them.
What we do on our side is we actually, secretly the first time they make an API request, we store what version they're on and from then on we sort of have a contract in our codebase that we're never gonna serve them something that's going to break them.
Turns out people really appreciate this kind of thing. This is generally how we end up thinking about API design and product design in general. So always do like the most sane thing for like the default.
So if somebody doesn't have to care or they don't want to care they shouldn't have to, right? If they don't want to know what version they're on they shouldn't have to, it should just like magically work for them all the time.
But if they do want to care, so an the example of versioning we do allow people to send version overrides in the headers or upgrade their versions in the dashboard but for most people they just don't have to do anything.
We also approach sort of our ... So that's what we do for our external users. So, hide backwards compatibility for them, they shouldn't ever have to care. Why not apply that to our internal codebase as well?
So if somebody is, if some developer or some engineer who's working on the API wanted to add something, they should never have to care about backwards incompatibility either.
So, how do we go about integrating this kind of thing? So we have one deploy for all of our- we have one deploy and one codebase for all of our API versions. We don't do a thing where we like deploy separate services for each person or anything like that. But that can pretty quickly turn into a mess of spaghetti conditionals, right?
Imagine some time in the future, I'm pretty sure this will never happen though, Stripe decides that all charges are only gonna be one dollar. We deprecated the amount parameter.
The sort of like naive way to implement this would be to put in your execution block. If the user is not on the first version, which means that they're not allowed to pass anymore and they happen to pass it then tell them they're not allowed to do that. Then do whatever normal stuff here. And then later if the user is not on the first version, then they're not supposed to see the amount parameter or the amount property in their response so delete it from the response after.
So there are a couple things wrong with this. First of all, who has any idea what version one is actually supposed to be? You can sort of infer it by looking at this code like, "Oh okay, well if they're not in version one then okay, that's what that means." But sort of it's very unintuitive and who knows that's it actually this do. So, you're inviting a lot of bugs and regression since you're doing it this way.
The second thing is you're conflating all of the backwards compatibility code with the regular logic. So if I wanted to come along as an engineer working on the API and I wanted to maybe optimize some way that we sent the charges to the card network or something like that, now I have to wade through all of these backwards incompatibility code to be able to do that; and this doesn't look too bad. There's only two sets of conditionals here but imagine you like removed 15 response parameters from there, now it's completely crazy.
So how did we end up doing that? We've implemented our versioning system around a series of gates. A gate allows you to do something, so if you are on an earlier version and that meant that you were allowed to send the amount parameter when nobody else wass allowed to, if you're on the gate 'allows amount.'
We store all of these in essential YAML file and basically what this does is it has all of the versions and the behaviors that are related to that version in one place. So if we had this version that was on September 24th and we deprecated the amount parameter, so here the description that is sending the amount is now deprecated, that means that we now need a gate that's called 'allows amount.'
And we have the description in here and everything because just like our docs, we auto-generate our version documentation based on this file. So everything's in one place.
So now it looks a lot nicer, right? So now we've replaced the not user version one with if the user doesn't allow amount and they pass them out then they send them there. So now this looks a lot nicer where we have the behavior for what's supposed to happen along with the error messages and validations in the same place so it's a lot easier to read.
But we still haven't fixed like the, we still haven't fixed the conflating issue. If I wanted to add something unrelated to this API method, I still have to look through all of these code, which gets pretty messy over time.
We did this by adding a separate set of compatibility layers. So just like the layering sort of like list I had earlier with the authentication and the validation and stuff like that. We have two other compatibility versions.
So one is for request compatibility and the other one is for response compatibility. Now when API responses come in, so when API requests come in. So here's an API request, maybe I set amount and maybe I didn't, it'll go through this filtered layer which will figure out if you're allowed to send something or will like munge the request to like, 'rename parameters' or whatever, send that logic onto the actual API logic and the construct API response step and then constructor API response says normal but now that API response goes through another compatibility layer, and this is where we actually remove the amount parameter.
So these two layers have a really nice property of now they no longer have to worry about any kind of compatibility, so they know it's already taken care of or will be taken care of. So those layers always represent the most current version of the API. It's not confusing at all and we abstract everything else to other places.
So, a lot of the examples I gave in here were pretty trivial, just like taking some card information or returning a card object or sort of like fakes, so I'm pretty sure we're never going to deprecate the amount parameter, but what does this actually look like in practice?
So Stripe has a 106 endpoints, 65 versions and six API clients. You can sort of do the like combination math there without giving us the ability to see an API version really easily or see an API method and be able to change something without being worried that we're going break somebody down the line. I don't think the Stripe API would have been what it is today.
So yeah, so design for yourself. Separate your layers of logic, make it really hard to mess up, hide your backwards compatibility. Do everything that you would do for your users but for yourselves instead and that will sort of like pay itself back.
What else? We're not really sure. We're still figuring it out as we go. We're certainly not perfect. We're learning things everyday. There's still a lot of things in our API that annoy me to this day but if you're ever interested in talking about this kind of stuff or interested in like scaling API obstractions in general I would love to chat.
Cool, thanks very much.
So the question was how many gates do you have? We gave a lot of examples of how many versions we have but not how many gates we have. So typically gates are usually one to one with versions unless we want to bundle a lot of them up at the same time. So I guess like strictly greater than equal to the number of versions, but maybe double or triple or something like that.
Yeah, so the question was are API versions by gate, was that a design decision from the beginning or did we change or something like that? So you'll actually notice that our API, I can go back to one of the earlier examples, but it's a post view on charges. So you got like sort of like multiple weird versioning things going on.
So actually the way we use the version was we had a V zero which is our RPC API and then we were like, oh, REST is totally a thing, we should make our API REST full, so then we switched to our V One API. I think we generally like reserve, the versions that are in the endpoint don't really stand for anything right now.
Maybe if we did like a massive over grade, upgrade of everything like we changed all the endpoints or something like that, we would upgrade it to like V two. We used to do it that way, we don't anymore, that doesn't really mean anything. We use the date versions more now just because they're very intuitive, right? Like how am I supposed to know if I was on version 2.45 whereas on our upgrades list you can say, "Oh, when did you join or when did you start making your first API request?"
And those are keyed by date and internally we know which date which functionality was introduced into. So it wasn't always like this but I think it's been really helpful to do it by date.
The question was, "You said all of your versions and deploys are sort of in one service. Does that mean that you have one monolithic codebase?' I think that everybody starts off as a monolithic codebase. We're to the point where we're starting to figure out like which things should be broken out and stuff like that so we're starting to go down that path but the majority of our code is in one codebase right now.
Yeah, so the question is did you every drop support for any of the versions that you guys have either because nobody's used it or something like that? So I guess the answer to that is not to date but it's something that we do want to go into.
I know some other APIs they sort of have like a policy of they'll only keep around APIs versions for four years or something like that. We haven't been around for four years yet so we'll see if we decide to implement that policy going forward.
We have considered it. There are a handful of versions that only have had to have like zero or like one person on it, and for the ones that cause a lot of technical debt, or like causes to keep things around that we don't want to keep around, we generally sort of like try to shepherd people off as soon as we can but for the majority of versions that either just like remove or add a new property, I think it's not causing us too much pain just because we abstract it out of the normal flow, so it's not hurting us too much for now but it's something that we would definitely consider going forward.
Yeah, so the question is how often does Stripe use data about how people are using the API to direct feature development?
It's actually funny that you asked that. I think in the very early days one thing that I think Stripe did really well was do the unscalable things for as long as we could. So one thing that we actually did was had, we e-mailed every API request that everybody ever made to ourselves and then sort of creepily watched as people made API requests and then we would proactively e-mail them to be like, "Oh, I think you like meant this other parameter," or something like that.
But we did also use that to sort of like fix our docs like oh, why is that person doing that? That's kind of weird. Maybe our docs like are bad in this area, right? Not for any like particularly large- actually that's not true. So for our marketplace, a good example is our marketplaces offerings. So for those of you who don't know for marketplaces you not only need to accept credit card payments but you also need to be able to pay things out.
So think Lyft, right? They want to be able to accept credit cards from their users and also pay their drivers out at the same time. We saw that HomeJoy who is one of our users who's also a marketplace, was using us or was trying to use as for payouts way before we actually launched that feature, so I guess we looked at that kind of like, "Oh, seems like a lot of people are trying to do this thing, maybe we should build something around it?" So I guess less for very large projects, although it has happened before, but we do it all the time for fixing docs and things like that.
The question was what kind of instrumentation do you put behind your APIs? Do you mean in terms of performance or in terms of...? So yeah, so we currently have a very immature implementation of New Relic that we use to some extent.
A lot of the things that we've been doing recently are building our data infrastructure so that we can store a lot of the requests like that, so a lot of the queries that we do are through like Impala or something like that. But a lot of those efforts have been, are relatively young so we'll see. I'd love to talk about that more if you have thoughts there.
Yeah, so the question was, I mentioned the compatibility layers where we try to abstract as much as we can to different layers of the actual API logic, does it have to deal with any of that? But how do we handle parameters that are both new and need to interact with any of the API logic and the answer there is nothing terribly impressive, we just have to use that gate, the gate inside of there. So in our code, in the actual code that would use that we just mingle that in there a little bit. But those cases are relatively ... We don't do that very often I think so it's not too much of a problem.
Okay, so the question is what does ... Is it what does the gate do? So if you were on an old version would you get an error or something like that?
Okay yeah, maybe I didn't explain that correctly here. So what the gating allows us to do is allows all- so we have like special behaviors for everybody who is in like some legacy API, right?
If you were using a previous version of the API and we later changed the functionality, that would never break you. You would always be able to do what you were allowed to do, if that makes sense. But anybody who is making a new API request to the Stripe API would be forced to use the latest version.
So, basically we invisibly layer so there's lots of versions and various users are on every version depending on when they first made their API request. So you would always be able to make the same API request and have it work as when you first started. Does that make more sense? Okay, cool.
So I guess I misspoke earlier when I said that. So we do have one monolithic codebase but we do have separate services. So, the code is still in one repository but we have multiple services for different services I guess. So we have one for the API, we have one for the dashboard and one for the site and stuff like that. So it's only the codebase that is monolithic not the service itself.
Yeah, yeah, they just use the same codebase. So, tester is sort of annoying in that if you want to like deploy the API you have to wait for like the site test pass so we're exploring that there, but we do do separate services.
Yeah, so the question is, how do you manage testing all these versions and stuff like that? We have functional test for both the current version and all previous versions as well for the exact inpoint behavior, so in our test, we would just like mock out the gate, mock the user to have that gate and then test for the actual change that was happening.
So it results in a lot of tests but usually the changes are quite small. It's like remove one property or add one property like that so the tests themselves look pretty simple even if there are a lot of them.
So the question was we talked about using Rack middleware for the authentication and error handling. Is that a separate codebase? It's in the same codebase.
So the question was, when we write a new documentation who writes it and what the review process looks like? So engineers who are building said feature write the documentation. So we don't have any PMs at least right now or anything.
All of our engineers are the PMs of their own projects. I think the thing that attracted me to Stripe the most is sort of this notion of like endonometure, right? You build your feature, you document it, you write the blog post, you write marketing copy and stuff like that.
So engineers end up writing the documentation for those. I think that's worked really well for us. Yeah. So for the technical like copy review we do in the same that we do code review. So, sort of like in our Github PR process.
So the question was what's the relation between the external documentation and the documentation tags in the code? So we actually saw our documentation. We wrote the engine I guess itself, we're not using any external thing or documentation tool for that.
So, these are actually pulled in to the documentation. If you go to stripe.com/docs/api a lot of the statically-a lot of the parts that never change so maybe the introduction or the description for what a charge even is, those are written straight into the HTML but when it comes to describing what the different request or response parameters are, since those are sort of influx and dependent on the versions and stuff like that, those are pulled in directly from the code here, so everything is sort of intertwined.
And I guess like this is where a monolithic codebase helps where our API code is in the same area as our documentation code so it can pull in the different classes like this and generate them nicely.
The question was do you have any deeper validation beyond integer or like string or something like that? So yeah, this is a pretty simplistic example. We go much further than that. So, these ones are- so these two particular ones are just like type validations but we actually let people like write whatever validations that they want.
One example it could be like array separated by commas or something like that, and then you would write the validation method in sort of a different file and that would get pulled in depending on what you said here.
The question is, that doesn't in any way render to the documentation itself? I think we do. That's a good question. I think right now, our documentation is slightly annoying in that we only represent the types for parameters in either of the request or response or something like that.
So right now I think it will at least from this it looks like we don't have that right now but we are revamping our docs to allow for those and I assume that we would do whatever things necessary to add that there.
So the question is, is the documentation actually pulling out codes from these methods or is there another generation step or something like that? So yeah, so we've written our documentation system like ourselves.
Like I mentioned earlier, our monolithic codebase actually helps in this way and the code for the API's in the same area as the code for the documentation. So we actually do pull these like exact, these exact API methods and figure out where the documentation is, what the documentation should be for each parameter, and all of that is done automatically and we wrote that ourselves.
The question was so when someone loads the page is it dynamically generated and the answer is yes.
The question is what parts of our API don't we like? So one thing that we definitely have to work on, so now that we have all these versions we haven't figured out a great way of versioning our API documentation, so our API documentation is kind of like silly in that in a way it represents the current version of the API right now and we're actually working on this right now.
Like how do we version our documentation because now there's going to be different versions, like different branches and maybe you have a little scroller where you can time lapse or something. We have no idea but so that's something I think is like a flaw currently in our documentation and something we're working really hard to fix right now. I'd love to hear if you thinks that's cool or if you have any thoughts there also.
What do I see lacking in the documentations tool space? So I actually gave a variation of this talk last week at a conference, so I went to the API Strategy Conference, and a lot of people are actually really excited about this auto-generation, documentation generation thing, and a lot of the question I got afterwards were like, "Oh, how did you implement?" and stuff like that, right?
And I basically said, "Oh, we built it ourselves so it was really easy to add whatever we wanted here." But a lot of people for their API documentation, they use something ... Basically what I imagine, I guess I've never used another tool, but like what I imagine to be like the word press of documentation generation, so like a CMS or something like that.
So I guess the ability to do things like this, the ability to customize it, to hook it into your own codebase so you can do things dynamically and stuff like that is I think what's lacking.
Thanks so much for having me.