Blue-green deployments in legacy batch environments
Reading a recently published article about continuous deployment it struck me that the kind of large scale legacy batch app that you find in banking presents a few obstacles to the use of blue-green deployment. (As a heads up this post is more of an aide memoire for me than a how-to description of reliable mechanisms for handling blue-green deployments in large banking apps).
From the same article, one description of blue-green deployment is:
“To reduce downtime and mitigate risks, organizations can consider what’s known as blue/green deployment. In this method, when a change is made and a new deployment (blue) is triggered, it is deployed in parallel to the old one (green). Both deployments run side by side, and initially, a small amount of traffic is routed to the blue deployment. If it is successful, the rest of the traffic is slowly routed to the blue deployment and the green deployment is gradually removed.”
This model is easily understood for a company that provides a webservice. In that case you might easily consider devoting a fraction of your compute capacity to handling a small subset of customers on the new version, because the scale of the work required for each client is small enough to not require the whole infrastructure. How would it work for a bank that relies on typically somewhat monolithic legacy batch compute?
In such systems releases generally couple providers and consumers: we can’t release to a subset of consumers, and we can’t do a latent release and gradually shift users over to it.
We need to find a way to cut the cost of rolling back a release to zero, and we need to find a way of decoupling users from the release. This is where blue-green deployment helps. The problem is that a monolithic batch systems generates large quantities of data for consumption by all users, as opposed to being processes that search and filter large quantities of data for a single user. For a typical bank batch system to generate the data within SLA times requires the whole system. That means there’s an obstructive cost to rolling back which tends to generate what’s been called Risk Management Theatre, or extremely strong controls on change.
Blue-green deployment could help, but these constraints mean we can only scale out in whole-system sized steps to support blue-green deployment. Once you’ve fought the battle to get hold of the hardware in th first place, it’s typically hard to say that you need the same again to support a new deployment strategy.
However, just doubling your capacity and releasing a new version alongside the old is the easiest way of handling the problem, and means that a) your users are isolated from the release, and can go back to the old code whenever they want to, and b) you gain the ability to migrate subsets of users over to the new version.
Do you need to permanently double your capacity, however? That depends on your migration strategy. If you intend to move users over to the new system gradually, or if you’re not in control of that migration, then yes you might need to. However, if you’re usng blue-green deployment just to remove the operational risk associated with rolling back a deployment, then you only need that capacity for as long as it takes to be confident that your release has worked. Flexible cloud hosting enables us to meet that need at a lower cost – depending on a number of factors including the particular release cadence of your project.
Another alternative is to break the monolithic app up in such a way that it means you can deploy multiple versions of the same components on a single system, and use a specific chain of component versions to deliver for a specific client. This assumes a good separation between your underlying infrastructure and your business logic, and needs a strong decoupling of your components so that different versions can coexist in the same space. Don’t forget that most monolithic batch systems also need file storage, which might change shape between different versions, and often rely on relational databases where the load associated with schema change between versions might be high, leading you to attempt to support multiple versions from the same schema.