Lift and Shift Lessons

I've been doing a lot of work lately on moving to the cloud. A couple of years ago we started using AWS and our first move was a "lift-and-shift": we just re-created at AWS the same servers we had previously in our data center. Two years on, we have learned a lot and we are doing it quite differently. Here are a few of our learnings:

Let me start with what NOT to do. What we found that we should NOT do is take a look at the servers we have in our data center and then spin up an EC2 instance at AWS to replace each and every one of those servers. That will work – it will give us exactly the same behavior that we had in our data center. The problem is that it will do nothing more than that. And, to be honest, in the long run AWS hosting is probably going to be a little more expensive and roughly the same amount of trouble to maintain as having our own data center if we do exactly the same thing in each location. What we need is to change our architecture some to take advantage of the cloud.

In the data center model, we buy a machine and set it up (which is a big effort), then install the software on it. We regularly update the software with newer versions. Over time, this machine accumulates a series of little changes: we installed version 4.7, then edited that one config file, then upgraded the OS, then installed version 4.8. And if there’s another machine or two in the cluster (or copies of the machines in the QA environment) they may not experience the exact same series of changes. Most of the time, none of the behavior is path-dependent, but occasionally something is – you’ve experienced that every time you had most of the machines working fine but one machine in the cluster wasn’t working right.

The cloud way of thinking says: create yourself a script which sets up the machine and deploys the code. (Some people use just a “push” approach, others “push” some tool like Chef, then have that “pull” the appropriate version of the code. It doesn’t matter much which you choose.) Run this script every time you need a machine. Use the same script for Dev, QA, and Prod (it’s got a few environment-specific parameters). Once the machine is deployed, you don’t mess with it: you just leave it running. This approach eliminates all path dependency: we never re-use a machine, we only “throw them away” when done. It would be impossible if “get a new machine” were a big effort as it is in the traditional data center model, but that doesn’t apply in the cloud. This way, you get reliable, repeatable environments.

Also in the data center model, there are a couple of approaches to deployment. There’s the “outage” approach: stop the machine, install the new code, then start it up again. There’s the “blue-green” approach: buy 2x as many machines as you need. While the app runs on “blue” deploy new code to “green”. Shift over (probably all at once, perhaps more gradually), then you can shut down the “blue” servers. For the next deployment, switch colors.

Cloud deployment takes “blue-green” deployments a step further. Instead of having two banks of servers and swapping back and forth, just spin up brand new servers every time there is a deploy. Once the switchover happens (and hopefully it’s gradual, not all-at-once), you can terminate the old instances. No need for 2x as many servers (except during the brief period of an outage), and no need to be working on only one deploy at a time (if you happen to need 3 banks of servers during a busy deploy period, that’s just fine).

In the data center model, we give each team (well, at least the well-funded teams) their own Dev environment which is primed and ready for their use whenever they happen to need it. In the cloud model we give NO ONE a dev environment that sits around in case they need it, but allow ANYONE to run a script that stands up a dev environment on demand. Occasionally a team may need 2 dev environments at once because they are working on two different issues; more often we SAVE servers because teams only create the dev environment when they actually need it. We might even turn them all off at night.

Scaling in the data center consisted of determining how many servers you need for peak load and running a cluster of that many. In the cloud scaling consists of adding servers when the load picks up and removing them when it drops.

Every one of these examples seems to stem from two basic principles: (1) fully scripted setup, and (2) treat servers as disposable. I would say those were my most important learnings. Lots of other things are different-but-equivalent (VPCs vs firewalls; security groups vs access controls) but these seem to change the way people actually work, and we didn’t really get the advantages of the cloud until we started working differently.

Posted Sat 07 January 2017 by mcherm in Programming