Circuit Breakers, Hystrix, And Dealing With Failing Back Ends

When you are writing middleware (be it SOAP services, REST APIs, or something else) an important point to realize is that *Back Ends Fail*. They fail in strange and interesting ways. Code for talking to back ends should always be robust: it should never make a call without some timeout, should always be prepared for the response to be badly formatted, and should always test whether fields are valid before relying on that. (Of course, when one of these things fails it is fine for the middleware service to return an error to the front-end calling it -- it just isn't OK for the middleware service to do something like lock up a thread or crash the server.)

But despite all that defensive programming, sometimes back ends will fail in a way that causes errors. After all, the defensive programming was probably not part of your unit tests or test cases, and we all know that untested code sometimes fails. So what happens when the back end goes haywire and somehow starts bringing down your middleware?

Well, what happens is that the production support team springs into action. Situations like this are exactly why we have a team of skilled professionals who carry a beeper and provide 24/7 support for our critical systems. The monitoring that we have recognizes that a problem has occurred, the people involved either recognize the problem ("Oh, look: TSYS is acting up again!") or they try to rapidly diagnose it ("Quick: check the Oracle connections. It's affecting all the clusters so there's a chance that it's the database."). Once they know the problem they perform rapid triage (which, honestly, is usually just to take down or reboot the affected servers) and call in the Tier-3 support team to identify the root cause and provide a fix. Often these problems are short-lived or intermittent and after a few hours things start working again.

But could we do better? What if there were a way to partially automate the effort that the production support team makes in this case? We can't automate the judgement needed to understand the problem and to provide a fix, but we might be able to automate the process of shutting down the offending parts of the system.

That is exactly what the Circuit Breaker Pattern does. This pattern says to wrap any problematic code (like code that talks to a back-end) in some code that manages the connection. It will count the number of errors and when that exceeds some threshold it will assume that the back-end is misbehaving. Then the circuit breaker STOPS TRYING TO CALL THAT CODE. Instead, all calls will immediately return with an error. The circuit can be restored manually (after the production support team decides that things are stable again) or automatically Sample view of Hystrix Console (allow through 1 attempt ever x minutes and restore things if it works), depending on what behavior you desire.

Netflix is a company famous for their approach to building software so it is rugged, and works even in difficult circumstances. These are the folks who invented and deployed Chaos Monkey, an application that literally runs around breaking things in their data center just to keep them on their toes. And they have released a library for implementing the circuit breaker pattern. The Hystrix library is a Java implementation of the pattern and it has a good number of bells and whistles like the console you can see in the image to the right.

Within the company where I work, we have been using the Hystrix library for some time now. Since its introduction it has proved to be useful and reliable so we have been expanding its use. I definitely recommend the library for those who want an automated means of recognizing problems and shutting them off quickly (within tens of milliseconds -- far faster than any production support person could possibly react) in order to limit the damage done by misbehaving systems.

Posted Wed 23 July 2014 by mcherm in Programming, Technology