Estimating incidents

7 replies

03:27 am August 11, 2015

I am the productowner of a devops team that is responsible for a fairly complex Oracle application. One of our tasks is to resolve incidents.For the devops team it is hard to estimate the amount of work to resolve these incidents. The problem is that before you start, not much is known about what causes the incident. Once analyzed, resoltion usually is not more than changing a few lines of code. Thinging about, I see essentially two solutions:

- Do all analyzis during backlog refinement and accept this is not estimated. I think this is not a very good idea
- Use an estimation and if you really have no idea accept a very inaccurate estimate based on previous experience with incidents.

Can you please share your experience with this problem? What works for you and may be a direction to choose for my team?

Ching-Pei Li

05:13 am August 11, 2015

Hi Rudolf,

I try to repeat your question as bellow:
There is a Scrum team in your Organization.
You are the Product Owner of the Team.
The team is responsible for building and maintain a complex application built upon Oracle.
There are some runtime errors raised from the product environment.(Maybe technical debt)

So,
Are you a Product Owner or a project manager?
Is there a Scrum Master in your Team? What does s/he think about these issues?
What kind of incidents are there? System domain? Application domain?
If these were accumulative technical debt, the problem domain should be how to tackle these technical debts.

Rough right is much better than precise wrong.
Don’t care too much on precise estimating the cost to tackle these incidents.

I’ve experience to maintain/upgrade a legacy system. My strategies are:
1. Conduct a serials of Unit testing and Exploratory testing to identify potential technical debts.
2. Avoid accumulating new technical debts.
3. Pay off some of debts at each sprint.
4. For ad hoc issues, incidents, development team just adds to To-Do-List in Sprint Backlog. DT always reserves some capacity to tackle this kind of ad hoc tasks.

Jitesh Dineschandra

05:31 am August 11, 2015

Hi Rudolf, how volatile is the queue of incidents (in terms of priority)? This may help the team decide whether SCRUM is the right framework.

You may want to have a look at kanban which is more appropriate in environments with a high degree of variability and priority. At the moment the team is unable to batch work into timeboxed sprints that can be left alone which is more suited to SCRUM.

In kanban the team members simply pull the next unit of work from the backlog and proceed with implementing it (its unit based as opposed to batch based). Estimation is optional however some teams still choose to carry out the estimation in order to have more predictability - the focus here is on cycle time rather than velocity. Cycle time is the amount of time it takes for a unit of work to travel through the team’s workflow (including any analysis work)–from the moment work starts to the moment it ships. By optimizing cycle time, the team confidently forecast the delivery of future work.

Jitesh

Ching-Pei Li

05:52 am August 11, 2015

I agree with Jitesh.

If most of your tasks is maintain/support, Kanban should be the best choice.

If the system is ungoing developed by your team, Scrum + WIP of Kanban should be fine.
As mentioned above, DT adds the ad hoc tasks to Sprint Backlog. When team members pull the unit of work from the backlog should be governed by the WIP limitation.

Ian Mitchell

09:15 pm August 11, 2015

Is a single, coherent Sprint Goal being negotiated each Sprint, and does the goal provide value? Or is each work item essentially independent of the others in terms of the releasable value it provides?

Rudolf Jan Heijink

07:58 am August 13, 2015

Thank you all for your replies. My team is mainly involved creating new or changed functionality (so I think kanban is not a good solution). Scrum is fairly new for my company and for my team including me and the scrum master. In the past incidents did not have a great priority with the organization fot both valid and not so valid reasons, so some incidents remained unsolved for a long period (using workarounds)..

For most of my team members the product is a fairly unknown area of business. It is extremely complex by regulatory requirements, the nature of the product and by the companies culture.

So as a prodcu woner I must seek a proper balance between crating nice things for the business, increasing team capabilities and resolving incidents. All three have high business value. People advising the team on the estimation subject have different opnions on what is the best way to handle this. That's why I am interested in experience in other comapnies. At the moment the devops team is experimenting with a way of work to pull issues one by one in the sprint backlog, and allw a time boxed period of maximum 2 days to find out what's wrong. If they cannot located the root cause of the problem inside this period, we discuss how to proceed. The advantage is that in this way the work we do is transparant to our stakeholders. The disadvantage is that the devops team is working on issues that are not refined well enough. My gut feeling says me I best way to proceed is keep the transparency and accept the uncertainty.

@Ian, yes the incidents in most cases are not releated to a specific sprint goal. We are still working on defining proper sprint goals. In most sprints up till now we worked on two or three larger issues and some small ones.

Ian Mitchell

09:03 am August 13, 2015

Scrumban is sometimes used as a compromise, in an attempt to balance incident handling with substantial pieces of new work:

http://agiletv.com/why-stretched-teams-do-scrumban

In the past I have also addressed this class of problem with a split into two teams, a Kanban for incident response and a Small Change Scrum.

Andrzej Zińczuk

03:36 pm February 12, 2016

Why do you need to estimate incidents? More valuable data may come from history in such case. From incident solving perspective it's better to provide cycle time for particular type/size of incident rather focusing on predicting what kind of effort is there.

Having bugs which are hard to track to find cause might need some deeper dive (5why?) to check what practices might help to speed up analysis - some additional metrics, checkpoints, architectural change etc can be added to system to speed finding out what is happening.