Incident estimation and tracking

8 replies

04:23 pm July 21, 2025

I have a question on how you track incident and requests that are received sprint on sprint. 1. Do you estimate them during planning considering the history of incidents received by the team. 2. If you estimate, how do you predict complexity and size? 3. If not, which metric you use to project incident effort and how do you handle burndown n velocity in such cases

Thomas Owens

07:15 pm July 21, 2025

Incidents and other requests are very different.

Most requests can and should go on the Product Backlog. There, the Product Owner can discuss the request with stakeholders to gain an understanding of it, ensure it aligns with the overall product vision, and order it among the other items on the Product Backlog. The team will be able to refine it and, when it's appropriate, pull the Product Backlog Items into Sprints.

However, incidents, at least as the term is often used, are different. Incidents are quality reductions and service interruptions. Often, incidents can't wait to go through analysis, refinement, and planning. The team will likely need to take immediate action to restore services and return quality to acceptable levels.

Generally, I'm not a fan of estimating. However, if the team feels that estimating helps them plan and execute their Sprints, then it's a practice they can use. Estimation should be part of the refinement process. However, if there's an incident that cannot go through refinement and planning, I don't see the value in estimating it. Waiting only slows down the restoration of the service.

If you are estimating and using burndown and velocity, there are a few ways to handle interruptions, like incidents. One would be to not reflect the incident or unplanned work on a burndown chart or in velocity. Instead, it would be inherently reflected in that the burndown of planned work or the velocity of the team would be negatively impacted for the Sprint.

Ian Mitchell

07:29 pm July 21, 2025

Presumably the incidents you describe are quality related, and show that work is not in the state people thought it was. Burndowns and velocity measures cannot therefore be trusted at all, since work is not Done.

In other words you have technical debt:

Improve the Definition of Done so defects cannot recur.
Quantify the technical debt incurred so far on the Product Backlog
Come up with a policy of paying the debt off Sprint by Sprint.

Subbulakshmi K

04:21 am July 22, 2025

Hi Ian,

Thanks for your response. However, incidents occur mostly post a prod deployment of a new feature or long after the deployment of an existing feature. As it's not caught pre-deployment, there's no way we can mark the relevant feature as 'Not Done'. As incidents are unpredictable, we have a clarification on how to estimate and track them as part of the sprints, especially when the team gets considerable number of incidents every sprint. Count of incidents is predictable, however the issue causing the incidents are unpredictable.

Ian Mitchell

03:56 pm July 22, 2025

As it's not caught pre-deployment, there's no way we can mark the relevant feature as 'Not Done'.

The work isn't Done, regardless of how you currently mark it. That's what causes the unpredictability you are experiencing in the first place.

Work can appear to be Done and yet you can still incur technical debt, because the Definition of Done subsequently proves to be inadequate. It's best to establish transparency over the matter. Fix the Definition of Done, meet that continuously improving standard, and account for and resolve any newly discovered undone work.

Maciej Jarosz

05:48 pm August 14, 2025

Ian, one cannot have a "done" thing unless you'd test 100% of everything everywhere and good luck with that. There are defects that can fly under the radar no matter how thorough your testing skills & capabilities would be.

So it's not a matter of DOD but a matter of approach to risk management given such uncertainties existence.

Ian Mitchell

06:22 pm August 14, 2025

Ian, one cannot have a "done" thing unless you'd test 100% of everything everywhere and good luck with that.

My advice is to challenge and improve the Definition of Done each Sprint, so technical debt is kept under control. The Sprint Retrospective provides a formal opportunity to do this.

Daniel Wilhite

05:08 pm August 15, 2025

There are defects that can fly under the radar no matter how thorough your testing skills & capabilities would be.

True, but as @Ian implies, each one of them discovered is an opportunity to inspect the work that you say needs to be completed before you can say it is "done". Each defect found is an opportunity to inspect and adapt your Definition of Done. Even the ones found during the development process should be evaluated at some level to understand why it happened. And that inspection should lead to some kind of mechanism to prevent it. Whether it be a change to coding standards or a specific listing on the Definition of Done.

I believe that your statement will always be true, even if we get to the point that AI is writing all of the code. There will ALWAYS be some way that the code is written in a way that it will not provide the exact results wanted. Is that a reason to just shrug our shoulders and say that it can't be helped? Maybe but that doesn't mean we should just accept it and allow it to continue.

Maciej Jarosz

12:00 pm August 16, 2025

Is that a reason to just shrug our shoulders and say that it can't be helped?

Actually it may be so, depending on your appraoch to risk management. If you decide with "acceptance" approach then sure, ¯\_(ツ)_/¯ and work another day, maybe fix some bugs later. It is as viable approach as 3 others, depending on your local context.