Will Microservices Fix the Rollback?

This morning's release didn't go so well. There was a problem with account opening -- some unintended interaction due to turning off a feature -- and we had to roll back the release. The problem wasn't in our team's systems; they didn't have an issue (this time) but we will still need to lose a day of work when we bring people in at midnight for a second time during the week to push out the release.

Someone I was talking with mentioned "I wish we were further along at moving to microservices and a componentized architecture, because then we wouldn't have to roll back." I hear this sentiment a lot, but I think it is wrong, and I would like to explain why I think that.

Having smaller independently deployable components has several advantages. And one is that it would make it technically feasible to deploy several components of our architecture and not others. Even with just the level of separation we have today that would be possible -- our team's systems are independent from the ones that had the problem; we certainly could have kept our changes while they rolled back theirs. But we don't, because we don't know that it would work: we've never tested it that way.

If we were further along our journey to microservices then we would have a complex web of small interacting services and components. And in a situation like this morning's, where we were trying to do a large, coordinated deployment and found that one piece had to be rolled back, we would have only three options: leave the code in place, roll everything back, or go with some particular combination of service and component versions that had never been tested. We run a bank; accuracy and reliability is paramount; for us the better choice is to roll back and try it again in a day or so.

Although "microservices" may not be a silver bullet that solves this problem, there is something else that DOES solve it (actually, a combination of two things): versioning and gradual rollout (closely related to blue-green deployment).

Versioning simply means this: any layer that is ever called by another system must avoid doing a sudden cutover to a new version; instead both the old API and the new one must exist side-by-side for at least some short period of time. For REST APIs that could mean providing a version number and keeping the old systems up until all clients stop calling them; for a database it might mean providing a new version of the stored procedures and only retiring the old SPs when all clients have stopped calling them; for the service layer that supports our mobile apps it might mean that changes must be 100% backward compatible (with default values for all new input fields and so forth) until every last customer using an older version of the mobile app has upgraded to the new one.

I use the term "gradual rollout", rather than "blue-green deployment" because the most common thing that people do and call "blue-green deployment" is not actually good enough. Blue-green deployment means that you keep 2 copies of the production environment (now, with use of the "cloud" that may just mean you rent 2 copies of it during the few hours of the deployment itself). One is running the current version of your system while you deploy the new version to the other environment and test it. Once you are happy with it, you cut over to the new version and shut down the old.

When I talk about "gradual rollout", I like every portion of that process except the final step where you cut over to the new version. I think you should start by testing out the new servers while they are taking 0% of the traffic, but then you should give them 1% of traffic for some time, then 20%, then eventually ramp it up to 100% of traffic and only after THAT do you shut down the older systems.

The gradual rollout approach has some disadvantages. It requires some sort of load balancer capable of directing your traffic as described, and probably keeping each user session on a single system -- this really MUST be an actual load balancer, not just a DNS-based traffic manager. Secondly, it mandates that you do versioning as I described before: otherwise it will be impossible to run both the old and the new system side-by-side for a time. (But ANY rollout plan that does not involve an outage will still require side-by-side, the only difference is in how long they run together.) But the advantages of the gradual rollout outweigh these disadvantages. First and foremost is the fact that with gradual rollout you will NEVER have a rollback.

If we were using gradual rollout then in a situation like today's we would never have had to roll back the changes. Instead, the worst that could have happened would be that we were forced to delay the transition to the new system. We might have ramped the traffic up to 10% on the new servers before discovering the issue, then ramped it back down to 0% again while we worked to fix the issue, but at no point would we have been forced to uninstall the new code and come back to try it all over again. For that matter, with gradual deployments we would not even have needed to come in during the night to do our deployments -- why bother, when the change does not risk a dramatic customer impact.

So anyway, I hope this explains why I am focusing some of my efforts on trying to figure out load balancing tools for doing gradual deployments instead of just on microservices.

Posted Wed 06 April 2016 by mcherm in Software Development