Unbreaking On-call // Tim's TILs

While we had a great mentality of sharing the burden of on-call duties by requiring everyone to participate, we had some seriously broken on-call hygiene.

An engineer would start an on-call shift, vanish into an abyss of alerts, and limp into work the next day with just enough motivation to wish the next victim the best of luck.

Some alerts were common enough that filters were put in place to silence the pages and others flat out ignored. “You can ignore that one,” and “just silence those” were common hand-off quotes.

After an important alert was ignored - either intentionally or lost in the noise - we had a conversation on Slack where we tried to figure out what happened.

One of our developers sparked a lot of conversation by saying: “Honestly, I don’t want to learn. I just want to follow instructions. I want to be competent on-call without changing careers.”

Reaction was strong. Anger, resentment, and an increasing divide between those who write applications and those who support them. Then, the real “a-ha!” moment: he’s not wrong.

We were failing one another, repeatedly.

Engineers weren’t owning the stability of their applications. There was an invisible barrier between what was considered “dev work” and “ops work.” The most striking realization is that being on-call was a burden so great that it was looked at as being a career unto itself. Worse, ambitious engineers were feeling beat-down; the desire to push a button and go back to sleep was more alluring than fixing the underlying fault or adopting better instrumentation. Measuring customer satisfaction was no longer done by delighting them with features but rather doing just enough triage to not expose errors.

What we failed to do for ourselves and each other was to recognize that we were on a treadmill of pain: nothing changes if everything stays the same, and we weren’t in a position to change.

What did we do?

We realized that we had a lot of alerts that weren’t urgent at all. We put filtering in place to surface fewer of these “noise” alerts during evenings. Quality of sleep improved dramatically.
We started analyzing alert data. Patterns and trends are vital. We had a lot of tooling at our disposal, but we hadn’t done a good job of correlating any of it to be useful. We could identify root causes more readily and take action with confidence that we’d be eliminating a pain point.
We started a formal bug-fix day where the entire team spent time working through backlogged defect work in place of new features.
We killed off some features. If a feature is causing operational pain, sometimes the best approach is to kill it off until it’s more reliable.

Did it work?

Yes and no. Product owners were unhappy that feature cadence slowed. Developers begrudgingly worked on technical debt from years ago. Ultimately, major bugs were found and eliminated and pager rotations were less painful.

Ultimately, we would have found more success if we had implemented good practices around on-call: alerting, observability, useful metrics, and a healthy dose of humility throughout the entire stack.