Cutting our crash rate: a Sentry and Crashlytics triage workflow

The crash rate on one app I inherited was bad enough that I won't quote the exact number. The point is we got it to a place we were happy with, and people sometimes assume that means we found one catastrophic bug and squashed it. It almost never works that way. It was process: a boring, repeatable triage loop run every week until the graph went where we wanted.

Here's the loop.

First, agree on the metric

"Crash rate" is ambiguous. The number that matters is crash-free users (and crash-free sessions), not raw crash count. Raw counts mislead - a single user in a reboot loop can generate hundreds of crashes and make things look apocalyptic, while a bug hitting 2% of users quietly might generate fewer events but hurt far more people.

So we tracked crash-free users as the headline, with crash-free sessions alongside. Both tools report these. Pick one as your North Star and stop arguing about counts.

Two tools, on purpose

We ran both Sentry and Firebase Crashlytics, which sounds redundant but wasn't:

Crashlytics is excellent for native crashes - the hard, platform-level stuff - and its grouping and device breakdowns are strong.
Sentry shines for the JavaScript layer: it captures JS errors with source-mapped stack traces, breadcrumbs of what the user did before the crash, and release tracking. For a React Native app, a lot of your crashes are actually JS errors, and Sentry's breadcrumb trail is what turns "it crashed" into "it crashed right after they tapped checkout with an empty cart."

The combination meant native and JS crashes both had a good home rather than one tool doing both jobs poorly.

The weekly triage loop

Every week, same steps:

Sort by impact, not recency. Order issues by number of affected users. The top of that list is your work. A scary-looking stack trace affecting three people waits; a dull-looking one affecting 4% of users goes first.
Read the breadcrumbs. Before guessing, look at what the user did in the seconds before the crash. Sentry's breadcrumbs turned hours of guessing into minutes more than once.
Check the device and OS spread. A crash that's 100% on one Android version or one manufacturer is a very different bug from one spread evenly. OEM-specific crashes are real and the breakdown points right at them.
Fix, release, verify against the release. Both tools tag issues by app version. After shipping a fix you watch that version's crash-free rate, not the global blended number, to confirm the fix actually worked and didn't just get averaged away.

The single highest-leverage setup step for the JS side: upload source maps to Sentry on every release. Without them, a JS crash stack trace is minified gibberish - a.b is not a function at index.android.bundle:1:428104. With them, you get the real file, function, and line. We automated the upload as part of the release pipeline so it could never be forgotten, because a release without source maps is a week of crashes you can't read.

# wired into the release build, not run by hand
sentry-cli releases files "$VERSION" upload-sourcemaps ./build \
  --rewrite

What the fixes actually were

Unglamorous, mostly. A few categories covered the bulk of it:

Unhandled promise rejections in data fetching that crashed the JS thread. Wrapping the right boundaries and handling errors at the network layer killed a whole cluster.
Null access on data that's usually there but sometimes isn't - the API returns an empty list, or a field is missing for a subset of users, and the optimistic code assumes it's present.
A couple of genuinely native, OEM-specific crashes that the device breakdown made obvious and that a library update fixed.

No single fix moved the needle dramatically. Together, across a couple of months of weekly triage, they moved it a lot.

The lesson

Crash rate is a maintenance discipline, not a project. The teams I've seen with great crash-free numbers aren't smarter - they just look every week, sort by impact, fix the top of the list, and verify against the release. The tooling matters (source maps especially), but the real lever is showing up to the boring loop consistently instead of waiting for a fire.

First, agree on the metric#

Two tools, on purpose#

The weekly triage loop#

Source maps, or you're flying blind#

What the fixes actually were#

The lesson#

First, agree on the metric

Two tools, on purpose

The weekly triage loop

Source maps, or you're flying blind

What the fixes actually were

The lesson