The race to rebase - Tim Barrass

As the number of devs working on a project grows then – hopefully – the number of changes in the pipeline also grows. Each time a change is merged into, or rebased onto, main everyone else needs to merge it into work (or rebase in turn), effectively causing them to go through the build-and-approve cycle again. This can lead to a race between developers to make sure that their change is the next to get merged. In turn this means wasted time, both purely in rebasing but also in context switching. Worse, it can lead to pathological behaviour - like agreements between devs to take it easy on MRs to finally get stuff out.

So what’s going on, and what can we do about it?

We can make a simple model of the situation. At the start of your CI pipeline you have a merge-request step that takes MRs from gitlab, builds them and runs the test suite over them. Only on successful completion of that merge-request step and approval of an MR review can they be rebased onto the main code branch. When a change gets rebased onto the main branch there are \(N_{0}\) other MRs in flight. At worst, all \(N_{0}\) need to spend time rebasing and running tests locally, getting through the merge-request step and getting re-approved. For the lucky one, that means a single cycle. For the least lucky, it means \(N_{0}\) cycles.

Making some sweeping assumptions – that the time to rebase and the time to get approval is the same for every branch; that the build time is constant and unthrottled; and that builds never break – this means that the total amount of dev time spent is a sum of

\[\sum_{N=1}^{N_0} N(r + m + a)\]

if \(r\) is the time to rebase locally, \(m\) the time for the merge-request step and \(a\) the time to approve. After each cycle there’ll be \(N\) rebase candidates. Immediately our main concern is whether any of this effort is wasted, but I think the wider effect of repeated rebase misses on dev mindset is also important.

Is any of this effort wasted? It’s easy for a developer to see the time spent in \(r\) as waste as it doesn’t move their own MR on. It is, however, a baseline cost in any shared project. Moreover, it’s often better to make changes in many small efforts rather than one large effort; complexity of a merge, and therefore difficulty, increases as the number of changes increases (I’m being glib; I’ll have to find out whether this has been quantified anywhere. General advice is that your feature branches should be synched with main frequently but generally that means once a day). There are exceptions that, in hindsight, represent wasted time, changes that need to be reverted and so on; however, even these in general aren’t wasted time at the time.

Time spent in approval \(a\) is slightly different. Re-approval is required after a branch is changed, and the act itself can be trivial if the MR is unaffected by other activity on main. However, this isn’t solely a “hard” dev change: it’s a social activity that requires sustained attention from at least one developer, and can be delayed or extended because, for example, the original reviewer isn’t available (for an hour, or at all). Arguably approval is only required at the point where a branch is going to be rebased onto the main branch, and time spent on - and spent chasing - intermediate approvals is wasted.

Finally, there’s time spent in the merge-request step \(m\). In principle this is hands-off time for the developer as an automated process is running. However, if your merge-request step is mercifully short (say 10 minutes .. maybe up to 30) then the developer’s not left with a lot of time to get into something else before their attention is captured again, either to ensure that their MR gets into the pipeline proper, or to rebase and restart the cycle. For the unluckiest developer that means spending at least \(N_{0}m\) of whatever units of time handling repeated interruptions by an automated task whose output is thrown away even though it succeeded.

This is the happy path; we assumed that there was no controversy, no prioritisation, and nothing ever broke, and yet it still leads to frustration, although your developers will get a lot of practice at rebasing and/or merging. Even this path potentially leads, especially for that \(N_{0}m\) developer, to pathological behaviour: developers extend their day or work weekends to hit quiet periods; they team up to make merge reviews easier; they throw their hands up in frustration and walk the dogs.

At some stage, \(\sum_{N=1}^{N_0} N(r + m + a)\) exceeds the amount of time you have.

Ultimately what developers need is a stretch of uninterrupted time to change both their mental model of the code and the code itself. At the same time, they need to manage delivery - we’d like to avoid having pipeline managers whose sole job it is to coordinate this process.

So what can be done?

Short term fixes

The problem is in communicating the state of the pipeline. None of the devs know whether they’ll be the one to get their change in next, so they all have to rebase and go through the cycle at the same time. If that process takes the same amount of time for everyone then at the end of it they have a 1 in \(N_{0}\) chance of being next. Arguably it’s worth everyone investing the time to rebase at that stage. Bearing in mind that only 1 in \(N_{0}-1\) MRs are going to make it through the next cycle, however, is it possible to reduce the need to waste attention on \(m\) or \(a\)?

People are part of the system. Redundant competition is inefficient. Help people cooperate and communicate.

One way of doing that is queuing. Say all MRs go into a queue at first approval, and the first in the queue then taken to go through the merge-request step. Only when that MR is successfully rebased onto main (or when the merge-request fails and it gets dropped from the queue) does the next MR in the queue need to be rebased, and the cycle start again. If the queue is visible, and the average cycle time known, then its possible to estimate when a given MR must have attention again.

Deciding one whether you can take a just-in-time approach to managing your delivery and only rebase at the last moment is a complex interplay between attention required for another task, and the complexity of other concurrent MRs. However, a queue does give you some level of choice; it frees the developer from the tyranny of feeling they must react to every new change to main with a rebase, just to stay in the race. It also frees them (and their reviewers) from having to drop out of other activities to re-approve. It also means that MRs can be prioritised, and gives developers a framework for talking about better - or required - ordering of changes.

Practically speaking there’s a number of ways of implementing this queue:

If the problem is infrequent then a relatively ad-hoc agreement on order within the interested group of developers is probably enough. Remember though that this is most effective if all developers are aware of what’s going on, or are locked out from merge-request, as we realised only yesterday. Having one channel to co-ordinate this, and using it, is pretty much essential.
If it’s a more frequent occurrence then maybe a more formal queuing step with some tools for displaying queue state and metrics might be worth developing.
For some organisations its enough of a problem that they have proprietary tools for managing their pipeline entry in this way - but that’s a longer-term solution.

For me the problem’s just another scaling issue that an organisation will encounter repeatedly as it evolves. The most important thing is that it’s recognised and mitigated by putting a suitable structure in place.

Longer term fixes

You can go all the way and spend your dev time building a merge-request queue that’ll automate this process, and present you with attractive metrics about merge throughput, and indeed people have: Shopify is a good example, with hundreds of developers contributing to a monolithic app, but see also similar problems at Microsoft. At that point, however, it’s worth questioning whether this solves the right problem, or whether you ahve so many developers because changing a monolithic system is such a complex activity.

The fix for ten people failing to ride the same horse to market is not necessarily a bigger horse.

Instead, it’s worth asking whether you can split projects, or solutions, or modules, and their related pipelines so that internal work is independent. You don’t need to buy into some microservices architecture, but you do want to think about breaking the work up into chunks that a small team can hold in their heads. There will still be times of higher activity that means some level of coordination is needed, but if the group of people who need to coordinate it is small then it’s manageable in a lightweight way. Communication is both essential and costly, and its cost rises exponentially with the number of people involved.

Pragmatically the race to rebase is a sign, it’s the system telling you that you’ve hit a scaling threshold. It’s important to listen to it, but also important to realise that it’s not something you need to “solve” by implementing a technical solution which then requires support and maintenance itself. Do you want something formal all the time? Or do you just need a collaboration technique when you see frustration rise again?