Ben Porterfield
Data-Driven Product Changes with Venmo and Instacart

Ben Porterfield is the cofounder and VP engineering at Looker. He has worked in both engineering and product roles at numerous startups in Silicon Valley including as the lead engineer at Sticky, Inc., later acquired by DeviceVM. He also co-founded Rally Up, a mobile startup acquired by AOL in 2010. Ben focusses on engineering process and expanding Looker’s traction as a leading business intelligence platform.

Collapse
00:00:00
00:00:00

Introduction

Hey everyone, I'm Ben Porterfield. I'm the co-founder and VP of Engineering of Looker. Looker is a business intelligence platform. That means we help businesses make better decisions using data, which hopefully qualifies me to talk to you about business intelligence a bit.

Business Intelligence

I'm going to talk a little bit about business intelligence and I'm not going to talk about high level financial metrics like CAC and MRR and CMRR and magic number stuff. This is really important stuff for you to care about if you're a SaaS company, but changing these numbers won't help you understand how to change your business. Business intelligence will help you understand how to change your business.

So the two things we care about when it comes to business intelligence are visibility and control.

Visibility means you can see both at the aggregate level and the detail level what's happening in your business down to a specific customer. Control means you can understand, when you make a change, how that change affected your customers.

This was actually just in Salon yesterday, it's an interesting example of how visibility control can help change your business. Netflix apparently went with House of Cards, two seasons, a hundred million dollars, just based on the fact that they could understand that people that cared about the original series of House of Cards, also cared about Kevin Spacey. Which I think is crazy, and it worked. It's totally working for them.

So how do we get visibility and control? I've broken the analytical process for visibility and control down into five parts- Tracking, storing, merging, and retrieving data, that's kind of all the boring stuff, and then the analysis and decision making, that's the stuff that everybody actually cares about.

We're going to go over each of these parts briefly, and then talk about some lessons that I've learned from talking with customers about this sort of stuff.

Tracking

All right so tracking data, what do we track? The important things to track for business intelligence fall into two categories: transactional data and event data.

Transactional data you're probably already tracking because you have to because you're a technology company. That's how you run your business. You're tracking users, orders, and inventory. That's going into some sort of SQL database I imagine. Maybe it's Mongo.

Then event data is the kind of ancillary stuff that leads up to or is around transactions. Traditionally you'd think about tracking that stuff into KISSmetrics, Google Analytics, Mixpanel, maybe Keen IO, maybe SNOWPLOW, or maybe Segment. That's all the other stuff that happens in your app that you use. That's the clicks, the in-app button hovers, or whatever it is.

A couple points on tracking event data. A huge important thing that most companies forget about is embedding tracking of event data in your product process.

What I mean by that is, any time you roll out a new feature, you should track enough information about that feature to understand the usage. Everyone forgets this, and no one actually understands when they roll stuff out how it's being used. That's super important.

Don't forget about server-side events. There's lots of information that happens on the server side that isn't necessarily transactional, but track that stuff too.

The taxonomy of that stuff actually matters in the long run when you have 1000 different types of events. You have a huge event space. You're bringing on new analysts or data scientists into your organization. You have a massive flat space of events and it's impossible to understand what they mean.

The important thing there is to actually think ahead of time how you're going to define things. Put them somewhere. Put them on a spreadsheet or something like that, and actually name space them so that you can help people understand the feature, what you've actually tracked.

Here are a couple points on storing data. This may be contentious, for transaction data you probably all made this decision, but if you haven't or if you're adding new technologies, definitely go with SQL, super important for analytics in the long run. We can talk more about that later.

Store all the states. If you have any offline processes at all, it's really important that you track those processes. We'll come into companies quite often. Delivery companies, or recruiting companies that have interviews. Delivery companies where the drivers are delivering things, and they aren't actually tracking what's happening during the deliveries or the interviews. If you aren't tracking those processes, then business users are going to have questions that you won't be able to answer.

This goes without saying, keep it clean. Messy schemas and messy data makes it really impossible in the long run for people that come into your company new and fresh, for them to understand what the hell is going on. So, if there are problems with your data, clean it up. If there's problems with your schema, clean it up ahead of time.

This is to really drive the first point home. This is Michael from Buffer. MongoDB was a great choice for Buffer from an engineering perspective, but turned out to be a terrible choice for them from an analytical perspective because other people, besides engineers, couldn't get the data back out.

A couple points on storing event data. This is also probably a little bit contentious, but I think you should own it. What I mean by that is, I think you should store all of your event data that you're tracking in your own SQL database somewhere.

You know at a minimum, if you can get it and you can merge it into your own SQL database, that's great too. But it's really, really beneficial if you can own it.

It's not to say that you can't use the ecosystem too. There are tons of great SAS-based event-tracking systems. Mixpanel, KeenIO, Segment, SNOWPLOW, there's tons of them. They're wonderful. I think you should use them too. I think that it's very smart for you to track to both places and build your own library that tracks to both places.

Don't forget when you're tracking event data to store everything.

If you're tracking any event that's happening, and if there's any transaction that's closely related to it, you should be tracking all the transactional IDs that are possibly related to it. So if there's a user doing something, track the user. If they spot something they're going to buy, track the thing that they're going to buy.

That is super important because eventually you're going to want to merge that data and understand the process by which the user got to a transaction. You can't do that if you aren't tracking the IDs.

Merging Data

Which brings me to merging data. If you track transactional data, you can answer questions like, "How many sales did I make today?" and if you track event data you can answer questions like, "How many people landed on this specific page today?"

But if you actually put them both together, you can answer much more complicated business questions like, "How many people landed on a specific page, and then bought a product and it was the first time they've ever bought a product."

Being able to answer questions like that is crucial to being able to make innovative changes in your business. It's really important that you get to a point with your data that you can actually merge your transactional and event data.

I'm a huge proponent of that and I think you should probably do it in what I like to call an "analytical database." That means an MPP database, an Amazon Redshift, an Aster Data(Teradata), Greenplum, Vertica, something like that. Or if you have tons of data, which you probably don't, but if you do, like a Spark, Hadoop, or Impala, Redshift is clearly the leader right now in the space.

If you want to try this out, it's certainly possible for you to do in Redshift. It's super easy and it's super fast, and just a little side note, this is not an easy thing to do. Often companies come to us, want to ask questions but it turns out they can't because their data isn't prepared for it.

They have to go away for three months, hire a consultant to do some crazy ETL thing to get their Mongo data, their Salesforce data, their Marketo data, and all their SQL data in the same place. Then they can ask the question. This is a big thing. It takes a long time, but it's important to think about from the start.

So a bit about why. In the past there weren't these analytical databases that actually made it possible to query in real time at a size of data and scale of data that made it useful to you. What we would do is we would have event data and transactional data and we'd detail it all together and whittle it down until we had a small amount of data.

Then we'd put it into an enterprise-based data warehouse. We'd have a team that managed the data warehouse and made sure it didn't fall down. Those people would roll things into silos of data, and then the BI team would take that information and they would give it to the end user.

This is what we see across the board when we go to larger organizations. This is still happening. The problem here is that if you want to ask new questions, this is a really, really long innovation cycle. Especially if the ETL team is whittling away data that you thought you might care about and the end user says, "I have a question now," you're basically screwed. You have a three month iteration cycle on answering questions for business users.

The modern approach, the thing that we advocate, is throwing data into an MPP-based database, usually it's Redshift or maybe it's Impala. Then the only people that care about this data is the data team. The data team, data scientists, or the analysts' only role is, in this specific case, is to model this data. They translate it from how you think about it in the database to how you think about it from a business perspective. They then provide that information to the end user.

The innovation cycle is much shorter because you only have one team to work with so people are asking questions. The data team's saying, "We can't answer that question yet, but here's where we're going to change the tool that you're using, and now you can answer the question."

Just a quick example- this is from an Asana blog post about their data infrastructure. This looks complicated.I've seen a Levis data infrastrucutre diagram like this that was probably 20 times as big. This stuff gets really complicated, fast, and the reason it gets complicated fast is that in the past there weren't these MPP databases that made things easy, so plan ahead.

Retrieving Data

Now let's talk about getting it out. This is what Dollar Shave would call "the Juan problem". It's a problem that I guarantee all of you have today if you don't have a business intelligence tool. The problem is that you have a queue, so almost without fail, you're providing some tool to give your business users answers. The rest of the time they're going to either engineering, if you're a certain size, analysts, or data scientists if you're at a larger size.

There's a queue of things and the business users want to get answers to questions and they can't. The reason they can't is because they get backlogged. When you get backlogged, you don't have access to data immediately and it's a bad scene. It's not just you guys, it's not just Dollar Shave Club.

Here's Venmo. They would write custom scripts every time they had requests and they'd repeat the process every time someone wanted to do something simple, like change the time frame. This ends up being a real problem in all companies. We've even had this problem at Looker before we had enough tools in place to support the business users.

This is an example of why. As your business grows and you get more sophisticated, people that are in certain departments ask more and more complicated questions. As they ask more complicated questions, the tools you provided to them probably don't provide the right solutions now.

If you're buying at an eCommerce store or if you're a retail buyer, at first you just care about what's selling, because you want to buy more of those things. Then later, when you're more sophisticated, you care about what's getting returned because maybe you don't want to buy as much of those things.

Then when you get even smarter about it, you want to ask yourself, "All right, what am I going to mail people they should buy? Because the chance of them buying that thing increases the likelihood of them buying something else." And this is just one department.

You can imagine if this type of complicated query starts stacking up with a bunch of other departments, withmarketing and sales and whatnot. Things get tough really fast.

I'm definitely going to advocate for self-service. Giving business users a tool that allows them to ask these types of questions on their own is the most important thing you can do from a data retrieval standpoint.

This doesn't necessarily belittle data science, by the way. There are lots of decisions that business users need to make that don't require statistical analysis. They just require people to be able to answer very simple questions.

If we wanted to open a new market in Maine, for example, what are the markets like around Maine? We can look at that and then we can make better decisions.

Game-changing insights don't always come from the analysts or data science group. They often come from business users because the business users are the ones that are embedded in the product process. They understand the problem the best and they're the best at formulating the right questions to get the right answers. So we see the data science role transitioning to 50% supporting the business user and then 50% doing more complicated analysis that the business user can't necessarily do today.

Analysis & Decision Making

Let's talk about the fun stuff- analysis and decision making. A couple things I want to talk about that you can hopefully bring back to your data and that I think are interesting. One is clearly defining success metrics. Two is looking for a low-hanging fruit. Three is going a level deeper with your analysis.

#1 Success Metrics

Let's talk about success metrics. What are success metrics? Success metrics are focused on outcome, which sounds so simple, but is actually so not common when we go talk to companies. Most of the time, this involves engagement and retention.

What do you want to happen and are people coming back? If people aren't coming back to your product and if people aren't using your product, then you're doomed. Engagement and retention are the most important things to focus on and strangely enough, when we go into companies, quite often its not the thing that people are focused on.

People are often focused on bad metrics and bad dashboards because they think visualizations are the key to understanding things, which they often aren't. They think that with weird metrics somehow they can get them to the right place.

For example, if you're a delivery company, something a lot of people jump to might be, "What's the average delivery time?" Right, that's an important metric for us to care about, but it turns out that's not really important at all. It totally depends on the radius of the delivery and maybe you want to cohort it by the delivery. Or maybe what you really care about is, are the people that you deliver to happy with the speed of delivery?

Sprig does a crazy good example of this. Every time you buy something through Sprig, they deliver you food, and the next time you open the app they're like "How was it? Excellent? Good? Or bad?" That's the thing that you really care about. That's a real success metric, delivery time is not.

Engagement ends up being one of the most important success metrics. I want to talk a little bit about how to track it. So it's definitely not Google Analytics. It's definitely not page views and it's definitely not time on page, which Google Analytics is trying to push on you right now. I think you're going to have to invent it, usually, based on your business. It's going to really depend on the things that you care about for engagement.

Upworthy has blogged about this. They do a thing called "attention minutes." Attention minutes care about reading stuff. Their attention minutes are how far down does somebody get in the article? Are they moving their mouse around? Are they scrolling? Are they pressing play or pause on the video? And for their business, that's a thing that makes a lot of sense.

For Looker, we care about a thing called approximate usage, which is a term that we've invented, but what it really means is we care about specific events that suggest that people are engaged with data. We track all the events in the world, but some events mean that people are exploring through data and so those events are the ones that we care about.

This is an example of how we derive that actually. We derive approximate usage from our event table. By the way, this is exactly why I think you should track your own event data and own it yourself because when you own it yourself, you can actually care about things like engagement and event metrics like our approximate usage and you can modify it and change it and see how changing it actually affects your business.

So we're inventing approximate usage here based on our event table. We're using a select statement to pull out a table that looks at the user, the day, and how much time they've spent engaged in Looker. This is, to some degree, a simplified version. We actually only care about specific events. I'm looking at all the events, but regardless, this is an interesting way to care about engagement. How many times are people doing a thing you think is useful in your product.

That approximate usage stuff is actually a smaller example of a much broader theme, which is derived tables.

Let's say you're putting a bunch of transactional data into your analytic system. You're putting a bunch of event data in your analytical system. That data's all good, but it's not analytical data, right? It's the data that your engineers are tracking to run your business and to track events.

The idea about derived tables is, you want to construct analytical information and put it back into the analytical database. Now to be fair, you can actually calculate this stuff on the fly with SQL if you have small amounts of data, but once you have large amounts of data, it's actually impossible because it takes too long to return.

Derived tables are a very big thing that we push on people all the time. I think it's super important to try them out. A bit about derived tables, it's very easy to start, you can just start with subselects if you're building analytical stuff.

Once that gets complicated, you can just do SQL on cron. Every night, you can look at your event table the same way we were calculating our derived metrics. Look at our event table, delete it, pull the stuff back out and push it back to the analytical database.

This is okay because analytical databases are built to hold lots of data. In the past that was a difficult thing to do. Now we can push all kinds of data to it.

This stuff's most useful at a row level. This isn't an aggregate, like a roll up sales by month or sales by week type of thing. This is actually adding more data to the database, but it's more useful analytical data that takes a long time to calculate.

This stuff's great for cohorts and sessionization. Tiered, derived dimensions versus some other metric end up being super important a lot of the time.

As an example of a derived table, here's user order facts. This is basically information about a user, based on the order table. It is information about orders that they've made, so information like the lifetime number of orders that a user has made, the first time they made an order, the last time they made an order, or the distinct number of months they have with orders.

So every night, for example, we might pull the stuff out of the database, munge it around, and then push it back to the analytical database. Then we can query current analytical database with this additional information later, which provides a lot more opportunity to do complicated analytics.

Here's an example of one. I've actually tiered the lifetime number of orders per user, into buckets. So the T02 is people that have ordered one time, T03 is people that have ordered two times, T04 people that have ordered three to four times. Then I pivoted that out by traffic source and looked at order profit, which looks kind of silly, but here's a graph of it, which looks a little bit more reasonable.

It's very obvious from this graph that Facebook does a terrible job of bringing in people that ever order anything useful. People coming through Facebook are not ordering enough. This is fake data, I made up the graph, you could have figured this out from the first tier, but in lots of cases, this really will matter.

A couple other things you can actually derive that are relevant to engagement and success metrics. Usage is super important. How likely is a person to purchase if they do a thing? Is a great thing to derive and figure out.

Retention. Are people coming back? When are people coming back, and why are they coming back? What type of people are coming back, did they come from a certain source? Repeat buyers is a huge one that we never see people look at until we go in and push them to look at it.

Why are people buying more than once? What happened in your app that made them buy more than once? What specific API call were they using? Churn is not necessarily just people going away forever, but what are they likely to never do again in any scenario? How unlikely are they to ever use a certain API call or spin up and over the server.

Time to transaction is also a huge one. Heroku looks at time to transaction in terms of activation. How long from when a person signed up until a person did a git push, and how long between that time and then the time that it took them to actually buy a thing? If you can look at those two windows and then figure out why the shorter ones are important, you can figure out who you need to target or how you need to lead your users that aren't doing that in the right direction.

Sometimes you have to invent stuff. Here's an example of that. This is Venmo. Venmo lets you pay or charge people with a mobile app or on the web. So Venmo once pushed a product change that made it a little bit confusing as to whether or not you were paying or charging a person. So the product team started getting a bunch of screen shots and emails about people overdrafting their accounts and whatnot. They gave it to their analyst team, "What's going on? Is this a real problem? We need to figure this out."

And so what they did was say, "All right, it seems like people are doing this pay/charge mistake thing, how are we going to figure that out?" So they invented a metric. They used a derived table and they called it a pay/charge mistake. The way they did that was they looked for people that had paid a certain amount, and then charged double that amount within two days. Because that implied that the person had paid the thing. "Eh, I actually didn't mean to pay the thing, I meant to charge the thing," and then charged it.

Once they made a derived table and graphed it, it was super obvious. If you look at the top chart that we're looking at, the orange line is the iOS change that happened, the other lines are Android and the web, so obviously something has gone wrong here. There's a bunch of pay/charge mistakes.

The bottom graph is the new iOS versus the old iOS. Nothing changed with the old iOS, but with the new iOS app, obviously there was a problem. So they invented this metric. They made a derived table out of it, and then it was super obvious to them. They needed to go back to the product team and say, "It's time to actually make a change to the app." They made a change, and everything went back to normal.

Briefly a bit about inventing metrics. Identify a behavior. It could be a good or bad behavior, as long as you see something that's a little bit of an outlier, it's worth identifying. Measure a percent of population that's doing that thing. Then experiment. You won't always get this right, but quite often when you invent stuff it ends up resulting in lots of cool insights.

#2 Low-Hanging Fruit

Next, let's talk about low hanging fruit. Hotel Tonight is an app that lets you book hotels the day of. I don't know if any of you remember what flash-sale sites were like, but Hotel Tonight was apparently invented in the heyday of flash-sale sites.

The way they work is you can't book a hotel room until noon. Or you couldn't. So Colin Zima here, who is actually our chief analytics officer was our head of analytics at Hotel Tonight through an acquisition. As soon as he got into Hotel Tonight he was thinking, "This is weird. It's a silly thing to force people to not be able to book until noon. Let's see if that impacts anything at all."

He made a graph. The X-axis of this graph is the time that people opened the app for the first time ever. So these people have never used the app before. They just downloaded and opened it. The Y-axis is the chance that the person will ever book a room on Hotel Tonight, ever in the world.

I don't know if you can read the numbers here, because it looks pretty small, but it's actually 9 AM because everything's time shifted. All the times are on PST. So 9 AM is noon on the East Coast. Where you see the question marks, that's noon on the East Coast, and things jump up and you see a spike. Then another spike at Central Time and another spike at Pacific Time.

Everyone that opens the app for the first time after the sale starts will happily book a room at a 20% greater chance than people that open it before. You'd intuitively think, "Oh maybe these people were invested, because they opened the app, so they'll go, then come back at noon and they'll book something." But they're losing basically 20% of their potential future lifetime value.

I guess it's less because that's just one booking. On that whole bottom left quadrant. So what they did was say, "We're still going to be a flash sale site" for whatever reason, but they moved the flash sale back to nine. So the entire three hour window, basically half that quadrant. As soon as they did that, it moved straight up. That's the simplest decision you could ever make. The business was around for three years before Colin joined, and nobody thought, "Hey, maybe we should not do what we're doing?" It seems really easy.

This is Andrew at Instacart. He's an analyst at Instacart. This is a similar example about low-hanging fruit. Instacart, you probably know, you order stuff on your app and then people go to the store. They pick it up for you and they bring it to your house.

They were having a problem where if something you ordered isn't in the store, they have to refund you or they have to go find a similar thing, which is a pain in the ass. It ends up messing with efficiency and it makes customers unhappy. So all Andrew had to do when he joined, and Instacart has been around for a while, all he had to do was say, "Well what are the types of products that we're selling that's causing this problem?"

It turns out, it's all seasonal, perishable items. It's all fruit, literally. All they had to do was say, "Okay, for fruit we're going to talk to our partners and as long as they have a certain volume of fruit movement we'll show that on the app, and if they don't, then we won't." They got, the graph as their lift in the refunds. So simple. Simple stuff that makes the business way better.

Every company we go into has something like this that's super super simple and all you have to do is think smart and look at the right things.

#3 Going One Level Deeper

Now I'm going to talk about the complete opposite of that. Sometimes you look at really simple things and you make assumptions that are actually wrong. If you look a little bit deeper, it turns out that if you hadn't looked deeper, it would have been bad.

So why YPlan is an example. YPlan lets you purchase group tickets for stuff. You can go, "There's this thing happening on Friday, we're all going to go," and you buy it all through YPlan. They make you not have to search on Facebook for stuff to do.

They didn't send me any charts, but in the top chart here, we have group size, and on the Y-axis we have the percentage chance of buying something. One to two group people are way more like ly to book on YPlan than three or four group size people. Or three or four parties of people.

So the immediate thing you would think when you look at this is, "Well, I guess people that are using our app are only going out on their own or with maybe one or two friends, so we should focus on events that are better for one or two people."

But intuitively they thought, "I don't know, that doesn't quite make sense, let's look at the time to book." It turns out that as soon as you add more people the time to book goes up exponentially. This is because people are jerks, right? You've asked five people to come to something on Friday. One guy might have a date and so they don't go, and if they don't all agree, and they end up cancelling the whole booking and it all goes away.

The business decision that you would have made if you were just looking at the top level metrics is let's focus on events that are one or two person events. But instead, the business decision they made was, let's build a product that helps groups of people. Something that facilitates groups of people making decisions faster about whether or not they're going to events, which is a totally different thing and totally changes the entire nature of their business.

This is Michael Schiff from Heroku, he's the VP of business operations at Heroku. He sent me a couple slides that are also on the one level deeper vein. On the left here, we have signups by month, and on the right we have an activation rate. Activation rate in this case is what they care about, are people actually pushing within 30 days after they've signed up?

If you look at the last quarter, it looks like their signups by month is going down. That's a negative thing, you want more people to sign up not less people. If you look at the right, you'll see that their activation rate is actually going up. If we only cared about looking at signups by month, we might be thinking,"We're really scared".

If you cared about a host of things, including activation rate, it's clear that things are probably not as bad as they seem. It turns out that two things are impacting this. One, in the last quarter, marketing focused on getting people that cared, that were more relevant to the platform. They were going for less numbers, more relevant numbers. Then they also did product changes that helped bring activation to the forefront.

What they do as soon as you sign up, they ask one question: "What is the language you're working in?" As soon as you answered that question, they brought you to the right place in the dev center and they would walk you through how to commit and push with that language. Those two changes together made the activation rate increase.

But if you're only looking at signups by month, it seems like a bad thing is happening. Many times it's important to look at more than just the high-level things, then go a little bit deeper. This is also why often times we see dashboards that are terrible. You would almost never dashboard activation rate unless you're smart enough to start getting deep into things. People always do signups by month and they're thinking,"What's going on?"

Here's another example. On the left we have direct billings and on the right we have a waterfall. Direct billings is pretty obvious, right? It's just all the different types of billing that they're getting from all the different parts of their platform. Clearly that's up and to the right, so that's all positive. On the right though, we're looking at a waterfall of gains and losses. That's for each month. From the previous month, are people buying more? If they are, then it's green. A person from the month before, are they paying more or less?

If they're paying less, then it's red. If you look at December, which is the farthest right, clearly, there's more billings. Everything is positive, right? But if you look on the far right, it turns out that things actually look a little bit scary. A lot of people that are currently in the platform are paying less and there's a lot of people that are actually paying even less than they were paying the month before. People aren't buying as much and they're actually paying less.

Turns out that according to him, this is a seasonality thing, they expect that every December, so it's not that big of a deal. But if you don't look at this kind of stuff, then it would look like everything is great, and you don't actually understand enough to dive deeper into the system. Part of the point of this is to not confuse an increase in a metric with success or a decrease.

We see this all the time. We see this with virality metrics all the time. A company invents some new way to convince their friends to invite their other friends to the system, and so then they're measuring invite numbers, which is great. We're inviting a ton of people. Unfortunately the people that you're inviting aren't actually sticking around, they're not coming back, which is the thing you actually want.

So again, success metrics and engagement is the most important thing to care about. Not invites or whatever metric you thought you invented that was a cool thing that you wanted.

Takeaways

So things that I think are super important to take away from this: If you aren't putting data into an analytical database, it's the most important thing you can do. It'll take a while. It'll probably be more expensive than you want, but it will absolutely change your business. Give business users a tool that empowers them to answer questions on their own, as much as you can. Then define success metrics, and success metrics should absolutely focus on engagement and retention. Go for the shallow, low-hanging fruit, but also go deep. It really depends, just be smart about it. That's all I have. I'm happy to answer questions, and I'm ben@looker.com if you want to send me an email with questions as well.

Q&A

Why do you prefer SQL?

For a couple reasons. For transactional stuff, it's just easier. It's usually the right thing to do. For event stuff, I can go back and forth about it, but the problem is that eventually I'm really going to advocate for merging the data.

Merging data and moving data around is super hard. It's always a challenge, and it always breaks. If you're using SQL in both cases, then your chance of being successful with merging data is much higher.

Then once you have data someplace that speaks SQL that's analytical, all the tools that you care about that you're going to provide to your data scientists or your business users, etc. are going to speak SQL.

SQL is going to be the way that people talk about analytics in the long run. You see this with a bunch of MPP databases. IBM is investing in things, Amazon's investing in things, and Microsoft's investing in things that basically let people talk SQL.

SQL, it turns out, is the right way to ask questions about data. It's not your custom query language that you build, or your custom Git API. Ultimately, you're going to have more potential to use multiple tools and to be more successful if all of your data is somewhere that's SQL accessible.

Business Intelligence errors

We like to talk about being a data-driven company, but we definitely have done a terrible job up until now of actually tracking events in the product. That's why I said embed the tracking of events into your product process, that's a huge deal. The other day we thinking, "Okay, we made a mistake with permissions so we need to make this combination of permissions not work anymore because there's a security hole, right?" And I thought, "How many people are using this set of permissions?" We were like, "We don't track that right now, so we don't know."

Our estimated guess is maybe it's 12 companies, but who knows how many people we're going to impact. All we had to do was add events to that and we could have tracked it and understood exactly what was happening. But we couldn't.It's currently like this. It's happening right now. We're rolling out a release and we're thinking, "Hmmm, hopefully no one notices that this is a big deal."

If we had tracked that stuff and had embedded events into our product process, we would have known ahead of time what to expect. That's just one example. There's tons of examples like that. We're getting a lot smarter about it, but we need to do more. We preach it a lot, but we don't necessarily do it as much.

Measuring engagement

For us? There's the example that I gave which was the approximate usage and is the thing that we probably care about the most. We definitely care about how many people are using dashboards. We have different classes of users, right? It's hard to get into it without getting too deep into the way our product works. We have code that people write at our customers. We're a language-based platform.

We care about how many times people are pushing new changes to that model that describes their data. That's a huge one for us. The more that you're changing the model, the more that you're engaged. That's from the analyst perspective. Then the more you're querying, the more you're engaged from the business user perspective. The more you're looking at dashboards, the more you're engaged with not necessarily a business user, but maybe a line-worker perspective.

It really depends, but we look at all those things and more across the board. As I said before, it's really going to depend on your business. You have to be smart about it and say, "What's the thing that we care about? What do we think should be happening?" That's the thing that you should go figure out how to measure.

Business Intelligence & small companies

So actually that slide was what I would call an old-school approach to business intelligence. That had an ETL team, which is Extract-Transform-Load. They were taking data from various places, munging it around, making it smaller, and then putting it to a data warehouse. There was an EDW, which is an Enterprise Data Warehouse team, and those people are protecting this core of data that's really important.

Then they were pushing that to a BI team, a business intelligence team. Those people get these silos of data that they can then build reports on. If you're a smaller organization, you should do the next slide, which was basically you throw all your data somewhere.

Even for large organizations, you can do this now. You couldn't in the past because databases couldn't actually handle as much data as they can now. Analytical databases are great because you can throw trillions of records into them and they still work, so you shouldn't have to care anymore.

All you need is an engineering team, or if you get a little more sophisticated, an analyst or a data science team. Those people can do everything now. You shouldn't have to care about moving that data around in weird ways like you had to in the past.

Moving data out of a NoSQL database

For Mongo I think there's a library called MoSQL that maybe Stripe wrote? There's one that people definitely use a reasonable amount. I think across the board, it's going to get easier and easier as time goes on. This is a thing that people have to care about, and they have to get that data into new places.

Or you hire a consultant. We've seen a lot of people do that. The problem with that is you want to move this data, but you have to deal with schema complications, time types, all sorts of casting. It ends up being a process. The sooner you do it, the better off you'll be. I don't have a good strategy for you, I'm sorry. I think MoSQL is a thing worth looking into though.

Becoming a 'data driven' company

So it will never happen in your company until people that don't know SQL can ask questions on their own. I'm being a little self-serving when I say this, but we aren't the only tool that does this. Certainly people have to have the ability to ask questions and get results without having to go to a person and wait.

When they have to go to a person and wait, they end up just doing the thing anyways, without being data-driven. When a lot of people have to go to a person and wait, it generally means they have to wait for a long time, so people just give up.

The thing that I would definitely advocate, I think I had a slide on it, was to give people a tool that gives them the data. Quite often what people do is they think, "Okay well I'm an engineer. I can code myself out of this problem. I'll just build ad hoc query tool that kind of works. That works for a little bit, it's a reasonable thing to do when you're a smaller-sized company, but it's a crazy waste of resources in the long run, because it's not your core competency.

The other thing I would say is, a lot of people trend towards dashboardy tools because they think that makes them data-driven. It's almost the opposite. It's almost negative for the company because it makes people think they're looking at things to make decisions. Dashboards don't let you make any decisions for the most part. They're usually pretty bad for you.

It's tabular information that answers a question that you have about the business. Without dashboards, you could literally have a wonderfully working data-driven company that didn't have a single dashboard in it, because they're basically useless. The questions are always ad hoc. They're always about making a specific decision at that time.

Looking at that stuff across time is almost never useful. Someone wants to do something. They want to create a new landing page. They want to open a new market somewhere. They want to hire more people. Whatever it is, they have a specific question and the only way you can help a bunch of people that don't know SQL is to answer specific questions.

Give them a tool that writes SQL for them. Which sounds kind of obnoxious, I understand, but that's just the way it is. You won't be data-driven until people can answer their own questions.

Thank you!

You've been here a while...

Are you learning something? Share it with your friends!

Want developer focused content in your inbox?

Join our mailing list to receive the latest Library updates. After subscribing, tell us your preferences to receive only the email you want.

Thanks for subscribing, check your inbox to confirm and choose preferences!