Scaleout problems with the definition of done

A sprint represents a window of time in which to deliver a defined set of work. The notion of delivery is tied up with the definition of done, the contract between the stakeholders which allows them to scope the work.

What happens when fulfilling your definition of done becomes onerous in the face of a changing reality? We found that scaleout issues made it increasingly difficult to deliver – and to estimate – work.

Much of our definition of done had implicitly been suffixed with “… and the change is delivered to the live environment”. My team’s board actually had an “in review or pipeline” column. Our working assumption for a long time was that the pipeline was reliable and took negligible time.

Over time our ability to delivery through our pipeline degraded. A combination of:

increasing dev head count
addition of features and related tests
underlying transient infra stability problems under load

meant that dealing with issues in the pipeline became increasingly whack-a-mole, with effects like

breaks affected more and more commits
release frequency went from times-per-day to times-per-fortnight
estimating work became impossible
sprints rolled tickets with potentially no work remaining
velocity affected by multiple external factors

We struggled for too long before realising that a changing reality had broken the baseline assumptions we held about our work. We weren’t able to justify rolling tickets, and became unable to consistently estimate the number of story points we could deliver in a sprint.

In a similar situation, how can you retake control of the process?

One way is to redefine the definition of done. If we had a QA team we’d be delivering to them, and bugs and changes would get new tickets. Treat the pipeline as the QA team and state that a ticket is done on merge. If there are breaks or changes needed because of the pipeline, that’s a new ticket.

The benefit is that you can decouple development work from delivery, and have tickets that represent concrete problems with your own code rather than your average velocity being dependent on a wide variety of external factors.