Category Archives: DevOps

The cost of retries – part 1

In the previous post I discussed how bad it is for proxies to retry. In that post I mentioned offhandedly that the proxy retrying was not only going to make your app slower but also more expensive. This is a first look at that problem.

For your convenience I have created another visualization on GitHub.

Retries in linear system

Imagine you have a perfect system and it has constant response time no matter how heavily you load it and it will never drop anything from its queue. If we hit the system with more requests than it can handle it will keep processing requests at that same speed but the responses will be ever more delayed because the requests are stuck in a request queue.

No retries

In the no retry case, as we slowly load up the system the total number of requests in the queue eventually exceeds the speed with which they can be processed (about 1500s with default settings). At this point both the queue length and the response time start increasing. Eventually the response time exceeds the timeout on the client (about 1800s with default settings). From the client’s perspective at this point all requests start failing. Note that this situation persists well past the point where the load has dropped below what the system can handle (about 2200s with default settings) because there are still so many – effectively dead – requests stuck in the queue. Only when the queue size drops significantly does the response time drop back below the 15s timeout (about 2350s with default settings) and requests start succeeding again.

All this is bad, but it is expected. One could argue that in this model there is only a problem if we are running very close to the limit of the ability of the system to handle load and that at that point some failures are expected.

With retries

Of course when the request fails it is likely the client will retry. In the simplest case we add a single retry. Everything remain the same until the first timeouts. At this point the number of requests on the system increases significantly because for every request more than 15s old a new request is added to the queue. This can be observed in the steep increase of the overall length of the request queue (bold red line). At the same time the average response time for requests also increases (orange line) because there are now even more (still dead) requests in the queue that need to be processed – and dropped by the client) before recovery can happen.

The thin red lines show the individual contributions of the original requests and the retries. As expected with a single retry these each contribute about half of the overall queued requests. If you look very carfully you will see that the contribution of the original requests initially almost follows the no-retry case but then with the increase of the retries suddenly increases to about the same level as the retries. This may initially seem counterintuitive but can easily be understood if you consider that original requests and retry requests are indistinguishable in the queue and the back end will process them as they come in. In other words, for every retry the back end will not process a first request and as a result more and more original requests will remain in the queue for longer.

To be continued …

I will leave you to ponder the implications of all this while I go off building a couple more models – specifically (1) a model where the system is running with a perfectly fine amount of headroom but gets hit by a requests spike – be it Black Friday or a network outage – and (2) a more realistic model where the response time is not constant with load to give you a more visceral sense of how much headroom a system really needs.


Proxies must not retry

You live in the cloud. Your app lives in the cloud. Mostly. You’ve decided to add access controls via a simple proxy. Your service is supposed to have “100%” uptime, so of course the proxy has to have “100%” uptime.

So far so good – except that the back end only has 99.9% uptime and your stupid ops people have set up alarms that check service uptime via your proxy. Since you don’t want to get dinged you figure you’ll retry. No alarms, no problem. Right?

Truth is you’ve just made your app slower. Probably a lot slower. And more expensive. And less stable.


Look at the data

Have a look at this picture. This is a test for a proxy that retries after 15s.


Let’s focus on the orange data. You’ll have to trust me when I say there is orange dots under the green dots. What you see is that the retry works really well: typical response time is about 2s and if that fails we get responses after about 17s (15+2) and if that fails we get responses after about 32s (2*15+2) and if that fails we get responses after about 47s (3*15+2). This is great! The proxy works!

Does it though? What should the client do? Should it wait for 50s? Or should it retry retry 25 times after 2s in the hopes that a single call will take the expected 2s? ? 10 times after 5s to account for some spread? Exponential backoff?

Based on the orange lines the client should absolutely retry every 3-5s. Of course that will kill your proxy and back end because each of the “timed out” calls will still go through the full proxy/back-end retry cycle. You just DoSed yourself.

Or course the blue data is more realistic. Under load there is actual spread. Some calls really do take up to 15s. So really you want exponential backoff. But even now you are abandoning calls to the retry pattern and DoSing your self. Not as badly but still.

In both of the above cases you client contains retry code. Now, why would you have retry code in your proxy?

I don’t believe you!

Ok. Just for you I have created this cool little toy on GitHub which allows you to walk through this step by step. Let’s say your server takes at least 2s to respond and at most 6s. Let’s model this as a gaussian because they are pretty:

Bad retry example
The blue line shows instantaneous probability that your request will be served at this time. The green line is the integrated probability, meaning that your request will be served by this time. Basically at 6s it is all but guaranteed that you received a reply.

So far so good. Now let’s have a look at the red line and what happens if we retry. If we retry early then we give up on any chance of the old request being fulfilled and start the wait again at the beginning. What this shows very nicely is that for any retry before you are guaranteed completion at 6s your performance will get worse.

How’s that different from the client doing the retry? Admittedly it isn’t. Except the client now has to wait until it’s guaranteed that the proxy would return!