Reader

Why you should embrace more incidents (seriously!)

| engineering on Grafana Labs | Default

We’re all looking for ways to improve on our incident response. We investigate various metrics and methodologies—all in the name of making sure our customers see the reliable and performant systems we’ve sought to build. 

In fact, all these efforts are leading us, as an industry, to finally realize the power of surprising anomalous events in our systems. They give us an opportunity to reexamine our expectations and see how our models of the sociotechnical system differs from reality. We note what isn’t working, and, sometimes equally important, what has worked to keep our systems able to bounce back from failure.

This is all for the better, but it’s not quite enough. We’re going to need more incidents.

Reframing incidents

You read that right—we need more. I imagine you might be confused by this. “Now, Will,” you may be asking yourself, “aren’t we trying to prevent incidents? They typically make our customers pretty unhappy, what with the inability to use our service and all.” 

I understand the skepticism. Management and on-call teams alike sweat a bit at the proposal. Plenty of us have a pretty good grasp on what we mean when we think of an incident, even if we’re a little fuzzy around the edges: We see a major failure mode in our production environment and swoop in to fix things.

But if I can put on my pedant hat for just a moment, let’s reconsider what we mean by incident

We’ve all experienced plenty of events that can defy our given assumptions. For example, your incident can have a negative start value. How’s that? Say your team is running their favorite frontend framework. They need a new feature in an updated version of the library, so they deploy it. A week later, it’s disclosed that there’s a vulnerability in the previous version you’ve just upgraded away from. You may even see sketchy requests in your access logs of scanners probing for said vulnerability. Was there an incident?

At this point, you might be thinking: “OK, Will, I’ll humor you for a second—we need more incidents. But are we just shutting off cloud instances and deploying bad code then? We need a better methodology than that.” Well, having more incidents doesn’t necessarily mean creating more incidents. The goal here isn’t to wreak havoc in our systems.

Instead, it’s worth looking at where and when we apply that label. We want to get more comfortable declaring incidents so we have explicit collections of events we want to focus on under the “incident” umbrella. So if you find yourself saying, “Well, that one doesn’t count because…,” that might be an indicator there’s something deeper to delve into.

What to analyze

Something else to consider: Are we hesitating when declaring incidents? That’s an opportunity to explore why that’s the case, and often it’s done in response to punitive measures attached to a number of metrics. 

Want to make sure you get that promotion? Hit your target for P1 severities this quarter? Maybe you bump that severity down or handle it out of public channels to obscure it from management. No need to scare folks, after all. 

But what if you choose to elevate these events instead (and ditch arbitrary metrics that become self defeating), thereby removing the disincentive to sweep things under the rug? Experts become experts when they have a chance to learn from others, especially in the open.

So what good does declaring more incidents do? For one, we get to practice! Coordination during an incident can be challenging, and having more opportunities can help. They don’t have to be P1s, either. Low-risk incidents are a great place to get different people to run as incident commander or challenge existing assumptions outside of gamedays. 

Normalizing incidents can also prove invaluable even for the most veteran responders. They’re stressful enough as is—that practice on less impactful events can be a fantastic way to develop confidence you’ll need to run through more trying incidents. 

We also declare an incident because it’s notable, and putting more eyes on a constructed series of events means we’re focusing on uncovering truths about our system that we might not otherwise realize.

What to do with more incidents

Data is great. It helps inform and instruct, guiding us on how to plan for what’s next—even when we don’t have a full picture of what’s around the bend. So let’s say you’re coming around to the idea: We want to declare more incidents to gather more data on the invisible “socio” parts of our sociotechnical system. But how do we manage all this data? If we’re not ready to work with the data, what can we do with it?

For starters, you don’t have to do a full post-incident review for every failure in the system. One of the reasons we don’t declare incidents more often is because our time and energy are finite. If you’re doing on-call handoffs at the end of a given work week, having that data summarized for you is super useful. There are also plenty of incidents where past bumps that seemed small or trivial at the time wind up informing our actions or understanding of the situation, the “how we got here.” Data is cheap enough to store these days and having it at the ready, mental indexes pointing to notable changes in the system, can help speed up that work.

Fire drills, game days, chaos engineering—there are lots of useful exercises that help us plan for “the big one.” These smaller incidents can set up that planning with real world data. 

Unsure where to begin with testing your runbooks? Take a few minutes to review a low stakes incident to see how the coordination during it fared. Worried about what escalation looks like during major failure modes? Practice more often on the frequent small issues that crop up on what escalating up the chain looks like. 

Even near misses can be useful, often showcasing parts of our system that feel scary to explore: for example, the code you’re not quite sure how it works or haven’t had a chance to model yet. Lots of smaller incidents are often small because experienced folks on the sharp end are handling them before they become worse, but the easy ones are fantastic to train up newer or more inexperienced members of a team.

Putting it into practice

The idea of adding more incidents will likely be challenging to sell. You’re asking your teams to coordinate and plan more, to put themselves in more stressful situations with greater frequency, and to force additional context switching. Resilience isn’t free and rarely does it show direct correlation, at least at first. 

You can help assuage some of those fears through moderate experimentation on the frequency and impact of declaring incidents, within notable workloads and expected time bounds. It can also prove out results with qualitative metrics such as assessing team stress levels towards being on call or their handling of incidents.

Wondering where to start? Try these steps:

  • Consider some “easy" ones—incidents that are commonplace and frequent enough, such as where your runbooks fully cover the challenging parts.
  • Crunched for time? Take a 10 minute morning stand up to level set with your team on what went well with a near miss to so narrowly avoid something much worse.
  • Incorporate a fun fact or learning in your on-call handoffs. Sort through the noise and pick out a notable dashboard or metric that helped inspire a key insight.
  • Declare an incident before your next release. Capture the coordination involved, especially if everything goes smoothly!

Your incident reviews don’t have to be large meetings—make it a lunch and learn or a 1:1 on your way for an afternoon coffee.

Or you can wait for the next big one and see how it goes.

Interested in learning other ways to improve your incident management? Check out the new unified Grafana Cloud IRM app, which can help you respond to and fix incidents faster by putting alert groups and incidents in a single view in Grafana Cloud.