
Ep. #86, 12 Years and 100 Million Customers with Amarilis Campos of Nubank
In episode 86 of o11ycast, Ken Rimple and Jessica Kerr sit down with Amarilis Campos from Nubank. They explore how Brazil’s largest digital bank uses observability to build resilience and autonomy at scale. Amarilis shares how Nubank’s culture of “acting like owners, not renters” and its journey from metrics and logs to tracing with Honeycomb helped the company grow to over 100 million customers.
Amarilis Campos is a Product Manager for Reliability and Observability at Nubank, one of the world’s largest digital banks. Based in Brazil, she leads efforts to help engineers run their applications with confidence, driving tools and culture that enable autonomous, data-driven development. Her work focuses on scaling observability across massive microservice ecosystems while fighting complexity through simplicity and reliability.
transcript
Amarilis Campos: Resilience means to really help every Nubanker run their applications with confidence. That's our mission here. So you see reliability not just as a technical tool, but as how you can really help the teams provide a better experience for your customer thinking, their customer journey itself.
Jessica "Jess" Kerr: Nubank has an interesting mission and architecture. So tell us, what's different about Nubank? What do you do?
Amarilis: The difference about Nubank, I think is how we have the autonomy to build things. So the teams here are very horizontal. In every pack, every squad, every pack, they have very high autonomy to really be developing on their own.
So firstly we work with microservices, so it's not monolith, which is very good for us. So you can build small parts of the application without affecting other parts. And that's what makes Nubank grow very fast in a very sustainable, scalable way.
And that's how we are having now over one hundred million customers in Brazil and we have operations also in Mexico, in Colombia. And this is a company that was created in 2013, so 12 years.
Ken Rimple: Wow, that's a lot of customers.
Amarilis: Yeah, yeah, a lot of customers. So the owners and the co -founders, they also mentioned that they never thought that you would have like over 100 million customers in Brazil in a very short period of time, and we were able to do this because of the culture and the autonomy that you have here.
Jess: So the engineering culture?
Amarilis: Yes, engineering culture, people culture. So everyone can give opinions about something. You don't need to be asking for something to do it. You can just do it and discuss it with people. And if something goes wrong, that's okay, just hold back and redo it or change the way you are doing something. So here, even junior people that just got in Nubank, they can have opinions about something.
We are very vocal about things. We do not just accept what somebody else is saying. We can be the owners of our own story here. So what we say here is that we truly act like owners and not renters, and we always change the status quo. That's the success behind the Nubank.
Change the status quo and act like owners, not renters. And I think that's what everybody does here, since day one.
Jess: Wow. And so your team helps everyone have the confidence to do all of that?
Amarilis: I think so. Yeah. I'm very confident that we really help people to do that. Once they are developing something, deploying new services, they are already able to go through the logs, metrics and now tracing to understand what's going on with their services, but not just with the microservices that they are responsible to, but how the other microservices iare affecting their services and how the whole flow has been affected by the way you are deploying new things.
So with tracing and with Honeycomb, we really can understand the whole flow. And they have the end-to-end view of a journey, of a process, which was a bit harder before.
Ken: So what was the journey for you to get to the point where you were able to have tracing happen for you? Was it original metrics and logs? What was that migration? What did it look like for you?
Amarilis: Yeah, people are very used to using metrics and logs here. So when something happens, we have alerts and dashboards, we have some canonical dashboards, but also have some customized dashboards.
But metrics, it's okay, you can understand what's going on there, but you cannot have the whole view. And with logs, it's very hard to really see and understand the whole history behind a specific crash or incident.
So you really need to have traces to have a broader understanding of what is going on in your journey, in your flow. So we have been talking to the users here internally and they are really excited to have a tool that can really help them to understand the whole flow and the whole view.
Actually, during this week I always, as a product manager I have to talk to my users, to my customers. And today I have two calls to talk about their experience with Honeycomb and with traces. And they mentioned that they're really excited to use it and they mentioned how it was harder to do it before we implemented tracing with Honeycomb at Nubank.
So for example, one user mentioned that now it's much easier to understand the bottlenecks in a flow, in the latency and how easy it is to do a query at Honeycomb and how visual the things are, that pop up to them.
So with logs they could do it, but it's much harder. And the way you query, it's not that easy. And now with this experience with tracing with Honeycomb, it's much easier for them to really just fill the steps and boom, the magic key happens.
Jess: And they see the whole flow of a transaction. And Nubank has particular reasons that the whole flow matters, right? Because you have to be very efficient?
Amarilis: Yes, we have to be very efficient because we have been growing a lot and also we need to attain some regulations. So there's some SLAs with the regulator that we need to be able to respond if something happens.
We have some SLAs that really need to be very restricted to it. And also with tracing and Honeycomb, we could be able to understand the whole flow from app from mobile, from app until the back end going through the database as well. And also understand if there's any gap with an external provider.
Jess: Right. And you're in the financial industry, right? Yes. I mean, I guess it's in your name that you're a bank. Nubank.
Amarilis: Yeah. Yeah.
Jess: But you're not a normal bank.
Amarilis: Yeah, it's not. We say it could be considered a financial technology. We are a technology company. Our goal here is to fight complexity. So yeah, it started to really help make the life of users much better.
So the first product was the credit card. So it was much easier to get a credit card with much less bureaucracy compared to the traditional banks. But Nubank is not just a regular bank. We really exist to fight complexity. It's much more of a money platform. Currently at Nubank, you can schedule trips. I have bought some flight tickets.
Jess: Schedule trips? Wow.
Amarilis: Yeah. Hotels too. So it's not just an application where you leave your money there. Also it's much easier for you to organize your life, financially speaking. You have some money boxes where you can save money over there and you have an objective for each box, a goal for each money box.
But also you can buy a bus ticket, for example, through the marketplace that you have inside the application.
Jess: Wow.
Ken: So you use the lifestyle of the user, you're making it easier for them to go through their day, get things done, organize how they're saving and what they're going to be putting money towards and giving them the ability to do all these things.
We talked in the beginning about microservices, we mentioned databases, but you also mentioned the apps. What are you observing, closest to the user?
How far do you get up in observability? Do you go all the way to like a mobile app and observe what's happening in there or a web browser or is it the services that are called from those. Where do you end up?
Amarilis: I guess a little bit of everything. Like mobile is very important for us. We are really focused to really improve the app experience. So to reduce the latency from the app to go to the microservices and how to see the connection between each service and how the things are connected in the back end.
And also understands all the databases, the Datomic, the latency and how the things are really internally connected. So I would say it's like from app, from mobile up to back end and going through the databases as well.
Jess: Yeah. And the database Datomic is kind of a unique database too. Can you talk about how that's different? And I hear you're instrumenting it?
Amarilis: Yeah, yeah we are instrumenting. We have the common Datomic that you already instrumented at Honeycomb and tracing. And again, this week I was talking to the Datomic team because they're really closing together to how you can. It's not just implementing the tracing, but being really careful about the data quality that you have in the platform.
So how can we really extract the most valuable thing from the platform? And people in Datomic, the users, the Datomic folks, they are extremely excited with the tool. Datomic is very powerful because it can help you in the audit sea.
So financially speaking you can really understand the whole flow. You can really handle a massive amount of data. And now the Datomic company, actually it's part of Nubank.
Jess: Oh, nice.
Amarilis: Cognitech. Cognitech.
Ken: Cognitech, yeah. That's the vendor.
Amarilis: Yeah, yeah, that's the vendor. Cognitech is part of Nubank right now.
Jess: Nice.
Amarilis: We have basically two type of teams, let's say, like the datum core, the people responsible to really be developing and improving the Datomic, not just for Nubank but for other clients that they have.
And also you have the abstraction team responsible to do some abstraction focus on our internal use case like focus on Nubank. So talking about those two different teams.
And one, for them, and why tracing is very important for them, is talk about the first thing which is the Datomic core, they really use the tracing for developing new features to do new rollouts to really be improving, developing the Datomic itself as a product and how they can leverage Datomic.
So tracing is very helpful to leverage the Datomic capability and understanding how they can really improve as a product. And the abstraction teams, we have been working together to really improve the data quality that is going to be displayed at tracing at Honeycomb.
Jess: Oh, that's cool. So Datomic team and the team around it are implementing Open Telemetry inside the database layers, it sounds like both so that the Datomic core team can improve Datomic, and so that the other teams at Nubank are getting really good data on how they're using it.
Amarilis: Yeah, yeah, that's correct. And it has been a game changer because the team uses it heavily to debug the distributed system and to optimize the performance. So with tracing, it can be really faster and more powerful than relying on logs and metrics.
So they're really working to make the data quality much better and they're really excited to be working with this because that's the tool that they can really do a faster debugging, understand the performance, and delivering the value much faster.
Jess: Sweet. I heard a talk by Cat Swetel at the Agile Conference this summer. She was talking about how important the performance of a whole transaction flow is at Nubank because part of the business model is being able to serve customers that aren't profitable for other banks.
And they do that by being really efficient. And that's your engineering teams making their microservices really efficient.
Amarilis: Yeah, that's true. Yeah. Currently if you think or if you check for a service isolated like okay, each service has their default alerts or custom alerts or dashboards, but seeing a problem isolated did not really bring value, cannot really improve. And sometimes you waste a lot of time trying to understand where the problem is, and where the issue is.
So my team now, we are thinking about not just providing a better reliability to the services itself, but how it's connected to the whole customer journey.
So if you think about some credit card transactions, or we have here some other type of lending transaction, lending operations or any other type of operations, we need to be able to think how healthy my service is. Not only that, but how my service is connected to the final user experience, thinking someone opened the app and trying to really make a landing or really trying to send their money to their parents, to their sister or to another person for them to be able to buy something that's really important.
So reliability plays a very important role here to provide, not a good observability for specific services, but how it's connected to the whole business and customer journey.
Ken: And I wonder if part of that is, going back to the front end again, part of that is you have this thing you're doing and your user is basically taking an activity and trying to complete it.
So from a, let's say a mobile app perspective, what are the kind of challenges that you run into in mobile app observability and how that affects having the front end involved?
How does that affect your visibility into what's going on and what might you do differently based on problems you see, like in one microservice versus another being called by that mobile application?
Amarilis: I would say maybe the massive amount of data, it can be one of the things because as you have a lot of data coming from different places, from app to back, it can be very challenging for us to be handling a very expressive volume of data that grows significantly month over month.
And when we are doing a debug or during a crash, how you can provide a data that really brings value ? Because we can just display all the information in the software in the reliability applications, but is that the data that my user and my clients internally really needs to do a good debugging faster and not be wasting time trying to find where the issue is?
I guess for me the challenge is thinking about data governance and data quality and how to handle massive amounts of data and how you can really just display at the screen what's most valuable for the user so they will not be wasting time to trying to find where the problem is. That's the most challenging thing about the data governance and providing a good data quality.
Ken: So going through the whole journey of a user going through the application process, I'm sure one of your goals is to make it as quick and painless as possible, but I know that, especially even in the United States, we have lots of agreements you have to click through and take a look at as you step your way through the process.
So do you use observability right now to kind of track, did it go from step to step, step, step, step by customer and see where people stop and give up? That kind of thing? And then improve it based on that? Are you at that point in your journey yet?
Amarilis: That's something that I'm in the area where I provide capabilities for the teams to do that. We have so many business areas, business units and squads. And we are the area responsible to provide these tools like in observability, logs, metrics, tracings and provide the telemetry signals for people to be really able to build this journey and connect the signal, understand where the issue is.
So I'm not the person that's going to be doing this, but I'm the person responsible to provide the software and all capabilities so each team can have their autonomy to track the whole flow and understand where the bottlenecks are.
Jess: You said you've been at Nubank for over three years. Is that how long your team has been a thing?
Amarilis: No, my team has been around a little bit longer than that. We have been going through a lot of changes and we have been creating the scope.
So back in 2000, I would say maybe 19, 18, there's just two people or three people working in reliability and observability at Neubank. And now our team has over 20 people working in the reliability squad.
Jess: Wow.
Amarilis: Yeah.
Jess: Okay, so did the 20 people like work on internal tooling or are they out helping application teams?
Amarilis: Working on internal tooling, basically. That's something that we want to do and we think that's very important to do to be more enablers, working with the teams to help them to make a better observability.
Observability is a culture that sometimes can be really challenging for people to understand the value of it. Because at some point the engineers, they're just focused on deploying new services and deploying new features and deploying new applications. But our role here is really making the engineers understand that it's not just about deploying new features and deploying new services, but how you're going to observe and understand that your services are healthy.
Jess: Yeah. Whether they're working, whether they're being useful.
Amarilis: Yeah. So it's not just, "okay, just deploy a service and I'm not going to see these services anymore."
No. You need to build, run and maintain. And to maintain and to understand what's going on, observability is very important for you.
It's not just about writing a code and not coming back and review it. It's not about that. It's for the teams to really understand how important it is to maintain your code in your application. Your ways of doing this is through observability as well.
Jess: Yeah, that gets back to the "owners, not renters."
Amarilis: Exactly. So you need to act like an owner, not renter. That's correct.
Jess: Yeah.
Amarilis: And that's why we are much more like, we are responsible to provide the tool and make sure that the tool is resilient, is reliable. We have good data.
You know, you have our internal SLAs, SLIs and we have a good performance, good stability, scalability that you need to assure that as a platform owners of observability tools. But the other teams need to be responsible to set up their services and to build their dashboards, set up alerts, provide custom alerts and they should be responsible to understand the whole flow.
Jess: Nice. Okay, so each team does have their own alerts and also the whole flow. That's cool. So is everybody on call for their own code?
Amarilis: Yeah, we have been making good progress in the past month, to be honest. Every time that we create our services, there's some default alerts already. We have some default observability in the services.
So even though the users do not customize their observability, we already provide a basic and default observability for each service, which is great. So each service has default alerts, some default dashboards already and we have already some common libraries already set up in the services.
So even though people are not really keen to improve their observability, we already provide a default observability which makes us very fast to identify some issues and act faster in an incident.
Jess: Nice. So you can always tell when their service is having some problem?
Amarilis: Yes, we can always tell that based on the default setup that we already provide to users. Obviously if you customize, your services would be much better because as an observability and reliability team, I'd now be able to understand the customer and the business needs for each squad.
But as we go to help every Nubanker run their applications with confidence, we provide some default observability for each service, which can be enough in a lot of cases. But in other cases, obviously it's important too for each team to customize their own services based on their business needs.
Jess: That's where you get the really high quality data.
Amarilis: Yeah, exactly. Yeah.
Ken: You could find out it blew up, you could find out how slow it is, you could find out all these basic things, what the queries are coming in and such. But to really understand what they're doing, you have to inject that little bit of knowledge that will help you understand what the business need is for the thing you're working on.
Amarilis: Yeah, yeah.
Ken: That's the custom sauce right there, you know.
Jess: Yeah. Because some of them need a credit card number and an account ID and others of them need an airline and a flight number.
Amarilis: Yeah. And that's why the observability culture is very important across the company. And that's how we have been growing a lot as a team.
You know, like we have people of 2, 3, 4 and now you have over 20 people. And that's part of the game as well. Do not just provide the tools, but help them to run their applications and make their applications more reliable.
Jess: And that's necessary to support so much complexity. Because you said earlier that Nubank is a technology company and your goal is to fight complexity. Now, is that complexity for the customer?
Amarilis: Yeah, complexity for the whole population actually. Because you don't need to go to a place to open a new account, for example, you do everything online. You don't have a branch or bank agency across these streets. Everything's remote.
So that's part of fighting complexity and reduce the bureaucracy to even open a new account. We do not need to be talking to someone to open a new account. Do not need to call someone to really check your balance, you can do this through the app.
But for you to be able to access the app in a few steps, you can see you're checking your balances, you need to have a good performance, a good performance application, a good observability, a good reliability. So that's part of the system. Everything is connected.
So fight complexity. Think about the whole population, about the people that really want to make their life easier and access just the application through their phone and here internally.
Also, it's our mission to fight complexity in terms of data, handle a massive volume of data, and how you make this more reliable for our internal users as well.
Jess: And at the same time to reduce complexity for the population, you have to put a lot of that complexity into the software. It has to do a lot of things and have a lot of capabilities. And your team supports that.
Amarilis: Exactly. Yeah, yeah. That's the challenge. It's not something that everything needs to be connected and we are in a part of be very fragmented and be like Observability 1.0. And now you need to move to Observability 2.0 and then be moving to Observability 3.0.
So we need to always be changing, always be improving in the ways that we provide your services. Because yes, we do provide services as well as our reliability area, as our observability area, you should provide services.
And it's not only that, but understand the different use cases that you have internally because it needs to be handled with like fraud team that needs to be running multiple fraud rules to prevent any frauds in your app, but also should be dealing with other types of complexity.
So each use case in each business area has a very high complexity and we have just one tool to attain all those different types of users that we have currently. So that's a lot of work that we need to do here to ensure that Nubank will grow and scale in a very sustainable way.
Jess: Yeah, it sounds like you have a great mission. It also sounds like a great place to work.
Amarilis: It's a great place to work. Yeah. I really like to work here because, you know, as I mentioned earlier, you have a very high autonomy to work. And as we have, one of our slogans is to act like an owner, not a renter. That makes us feel excited about running things on our own.
And think that every day is still the day one. It makes us never relax or calm down like you still need things to improve. Okay, we reach our goal. What's the next step? What's our next goal? What's our next level?
So that's how Nubank works and that's how we have been growing a lot in the past 10 years. We are one of the largest digital banks in the world in a very short period of time and we're reaching over 100 million customers across three countries. That's a lot.
Jess: Yeah. What do you think is the next level? What's Observability 3.0?
Amarilis: Well, AI has been playing a very specific role here. So everybody wants to be AI-first and I feel that everyone's fearing missing out.
And we are also looking to implement some AI here. We are already doing it. We have our login, so we developed an internal login tool here and you have some AI assistants already. So we are already helping users on a daily basis to run faster, as I have been mentioned here.
AI, I think in terms of observability, can really help us to be more proactive and less reactive and understand the trends. Not only the past trends, but how you can do some anomaly detection faster and how you can predict the future based on the past. AI can really help with that in a much faster way.
So we are working on that. That's the next step. Do some correlations, connect all the different telemetry signals, not be fragmented anymore. Like, how this whole ecosystem can be connected to each other and how AI can help us boost and leverage observability across Nubank, helping users be much more proactive and less reactive.
Jess: Great.
Ken: I did have a question around, so you have your team of about 20 people, and there's an infinite number of platforms and languages out there in the world. Right? And in terms of OpenTelemetry, it's a smaller Venn diagram, there's a smaller number of languages that support it than the universe.
Are you standardized as a company on a separate set of platforms or is that part of what your team does is when someone, let's say they bring in Python for the first time or they bring in something else for the first time, is that what your platform team does is takes a look and says, "okay, this type of microservice running in this platform needs some observability wrappers around it. So we'll help your team get up and running with that."
And then now you have that as a baseline for the next team that comes along with like a Python based service?
Amarilis: Yeah, that's a good point. That's a good question because that's what we are exactly doing right now. We are very good for standardized and canonical stack, but as we are growing a lot, we need to support different types of language, new MNAs, for example, different types of accounts that may be onboarded at Nubank.
So we must be a plug and play platform. So we should be able to onboard any type of language or any type of company in a very short period of time. That's what you're doing right now. We are really good at providing reliability for our standardized stack.
Jess: Is that Closure?
Amarilis: Yeah.
Ken: I was waiting for that. Yeah, I knew it had to be.
Amarilis: Yeah, Closure. We have Kubernetes. We are really good at providing that. However, we should not provide only that. We should provide different types of language, different types of technology and that is what we are doing right now.
Ken: So I guess you would have one of you partner with that team that's coming on board and build it up from the beginning with them, be a partner to that team, enable them, and then you can spread the word to others, right?
Amarilis: Yes. And that's exactly what we're doing with one of the teams. The team was responsible for external infrastructure, they have physical data centers and now we are onboarding them.
We should onboard them into our stack. Obviously it's going to take few months or even years to be fully onboarded but our goal is to think about how we can onboard new teams without taking so long and that's an example that I have right now.
That's a different team that they run infrastructure in a different way and they have some physical data centers and you should really be able to onboard them in our current stack.
Ken: Great.
Amarilis: Also Python, we should be able to support Python. We already do but there are some improvements that we need to be doing from our site. So yeah, as our goal is to really be a plug and play for different types of languages and technologies. Not only Closure or Kubernetes but for any type of technology.
Jess: I guess using OpenTelemetry helps with that.
Amarilis: We have analyzed that, definitely. We have OpenTelemetry for tracing but for log and for metrics. For Python, we have used OpenTelemetry as well but for logs no. And for our canonical metrics we do not use OpenTelemetry but something that we are currently running to see if you can use OpenTelemetry.
Jess: Nice.
Amarilis: Mhm. And should be really helpful for us to be correlated with different types of telemetry signals and it's part of our product strategy.
Jess: Great. Amarilis, how can people learn more?
Amarilis: We have our NuBank website where people can really learn more about Nubank, access the social media or even the website. We have a lot of information about Nubank over there.
Jess: Nice. Nubank is N-u-B-a-n-k. Great. Well thank you so much.
Amarilis: Thank you.
Ken: Yeah, thank you so much.
Content from the Library
O11ycast Ep. #84, Maddy Montaquila on .NET Aspire
In episode 84 of o11ycast, Ken Rimple and Martin Thwaites welcome Maddy Montaquila, lead PM for .NET Aspire at Microsoft. This...
Generationship Ep. #38, Wayfinder with Heidi Waterhouse
In episode 38 of Generationship, Rachel Chalmers sits down with Heidi Waterhouse, co-author of "Progressive Delivery." They...
O11ycast Ep. #83, Observability Isn't Just SRE on Steroids with Dan Ravenstone
In episode 83 of o11ycast, the Honeycomb team chats with Dan Ravenstone, the o11yneer. Dan unpacks the crucial, often...


