By Sooraj Rajmohan
When Gojek was still finding its feet, Fridays used to be a nightmare.
Jakarta, being the capital of the fourth most-populous country in the world, is home to many people who work in the city and travel to their hometowns on weekends. Many of them rely on Gojek as their preferred first mile connectivity option — resulting in a traffic spike on our systems on Friday evening. In those early days, this often triggered a system outage.
Every outage erodes the hard earned trust we build with our customers and driver partners.
Enter the RCA
We knew we couldn’t fix the failures overnight, but we could learn from them. After all, so many of these mistakes were common and easily overlooked. So we decided to embrace the ‘Root Cause Analysis’ (RCA). If something related to Gojek’s Engineering division failed, the person(s) who attended the support call and had most context of what happened would prepare a document. This document would contain a timeline, detail what went wrong, suggest corrective measures, and compile lessons learned.
This process ensured everyone across the organisation had visibility into what happened. As a result, even unaffected teams cited in the RCA could analyse their systems to ensure the same problem would not happen to them. More importantly, it provided a degree of accountability — and that’s important when you have 20+ products.
A post-mortem, and a prevention.
This post details what happens when a system failure happens at Gojek, and how it makes it way into an RCA.
1. The What
When something fails, it is important to understand the origin of the problem. Every team in Gojek sets up alerts which monitor the state of their systems. If a state change in the system causes a deviation from expected behaviour, an alerting service called pager duty
automatically dials the phones of the people responsible for that part of the system.
Here’s an example:
When a booking is created, we find a list of driver partners and send the order details to them, at which point they get a pop-up with trip details like estimated duration and approximate earnings. The idea is to give driver partners enough information to make the decision to accept the trip. But there was a problem.
2. The Why
One of the fields this prompt contains is a Booking ID, which is stored as an integer (which, in technical speak, has a limit of 32 bits). Unfortunately for us, the ID generated exceeded this limit.
Welcome to what we call Integer Overflow.
As a result of this, the driver app started crashing.
How bad can that be, you ask?
3. The Fallout
Drivers being unable to use the app means they can’t accept bookings. This means customers can’t book rides, send packages, get food, or use any service that depends on our driver partners.
As a result, Gojek’s order numbers (or what we call ‘concurrence’, if you want to get all technical about it) plummeted. Pager duty is hit, and cell phones start ringing off their proverbial hooks.
With driver partners stuck with malfunctioning apps, multiple Gojek services start reporting errors. As customers try and figure out why the app is behaving this way, engineers scramble to do the same.
4. The Response
When the phones ring, the team whose alerts have been triggered immediately get to work figuring out what happened. If they identify the problem quickly and debug it, they notify other teams. The team then gets to work using information from the alerts and system dashboards to prepare an RCA.
This is, of course, the best case scenario.
If the concerned team cannot find a fix however, a war room is called.
The war room signifies a larger issue, and members of every available team drop what they’re doing and join the call. Sometimes, these are frantic Slack discussions and calls in the middle of the night. Other times, office boardrooms are blocked and everyone gathers to brainstorm collectively.
Devs, Team Leads, Product Managers, all hands on deck.
In a war room scenario, whoever has most context on the situation takes charge and delegates tasks as required. This central person also plays a key role in documenting the happenings in the war room — how many people were present, which teams were represented, who was handling what, etc. All this information plays a key role in the RCA. While this is going down, Driver and Customer Care centres are also notified, bracing for the inevitable flurry of complaints.
The fix may take the form of a few simple temporary hacks, or an hours long war room — but in the end, there is always a fix.
And a sense of camaraderie ?
Once the dust settles, the investigation begins. The person who managed the war room generally authors a document analysing what went wrong, using all the info from the alerts, dashboards, and firsthand accounts of the responders present. Typically every stakeholder in the organisation gets an email the next day with details of what went wrong — the RCA.
5. The Learnings
“Collaborate With Compassion”
These three words mean a lot at Gojek, and our RCAs reflect that. When you open an RCA mail, there is rarely even a mention of specific people, except to acknowledge those who responded to the distress call and played a role in finding a fix. Call out the ones who made the effort, never the ones responsible.
Most RCAs instead dwell on relevant, actionable information. Information that was being collected and monitored right from when the alert tripped:
The What: What was the problem?
The Why: Why did it happen?
The Fallout: Which services were affected, and for how long?
The Response: How was it fixed?
The Learnings: What can be done to avoid a repeat of this in future?
This simple process has helped us scale more safely and efficiently. It also allows for early identification of potential vulnerabilities in other systems. As you may have noticed, there is no mention of who was responsible, no finger pointing, no blame games. Collaborate with compassion.
To get a better sense of how we write RCAs at Gojek, read a sample RCA.
If you’d like to start a culture of RCAs as well, here’s our RCA template, courtesy GoPay CTO Ranjan Sakalley, who also occasionally drops invaluable insights in the ‘Learnings’ section.
RCAs have played an integral part in our journey to becoming a SuperApp. Investigating, analysing, and documenting problems in production help us build better, more scalable systems, and tackle new problems in a mature manner without fear of retribution.
The days of weekly production issues are now a thing of the past. While we won’t be so brash as to say we never have problems, embracing RCAs and a culture of compassionate collaboration have helped us get to where we are today.
We’ll be writing about more interesting case studies on issues faced in production. Keep an eye on this blog, or subscribe to our newsletter for updates on our stories in a neat little email.