Exponential backoff is the go-to algorithm for dealing with transient remote API errors in client applications. Indeed, all the major cloud service providers document it (AWS, Google, Microsoft) and incorporate it in their respective SDKs (e.g. AWS’s Javascript SDK). It is simple to explain (“double the retry delay after each error, limit to a max value, add some jitter” – more on the latter below), trivial to implement, and it “just works”.

The problem is it’s not very good at throttling your requests and avoiding errors – you could say it’s appallingly bad. This graph is telling:

exponential backoff request rate

This is the result of a discrete event simulation (DES) where a client tries to perform 2000 requests (initially generated at a 1000/s rate) against a server that only allows up to max_busy = 50 requests at a time. I picked plausible values for other simulation parameters, but the exact values don’t affect the overall behavior. (The connection/request time is 100ms; succesful responses are served in 500ms, and error responses in 50ms.) In real life, we do not have the luxury of knowing max_busy, and indeed, it will normally not be a fixed value: it can fluctuate over time (network conditions, overall load, busy neighbors…). I chose a round number for easier interpretation, and again, this does not affect in any way the qualitative conclusions.

Request rate

Note how the request rate grows very quickly within the first few seconds. The initial retry delay is 50ms, so after a couple seconds the client is performing thousands of requests per second, and nearly all of them are retries bound to fail. The rate grows so wildly a logarithmic scale shows better what’s going on:

exponential backoff request rate (logscale)

The logarithmic scale graph shows what the previous one couldn’t: within the first few seconds, essentially no requests are completed because the server is swamped with retries. It is only later, after several doublings of the retry periods, that request rate decreases enough to let the server perform actual work. Although it’s not visible in the graph, a few requests are performed in the 47s mark – these are “unfortunate” ones which maxed out the retry period to 30s.

Error rate, overhead and extra costs ($)

In the model I used for the simulation, error responses keep the server busy for a brief period of time; this captures the fact that (failed) requests have a performance cost and contribute to server load, or, as we’ve all suffered, the F5 effect (in practice, both the client and the network also have extra work, which makes things even worse). This is why essentially no actual work is completed in the first few seconds until the request rate decreases.

Performance cost is not the only burden: in the public cloud, each request entails a monetary cost, so overhead is not only a technical issue, but also an economic one. Consider how the number of attempts grows with that of operations:

exponential backoff total requests

Total attempts have superlinear growth relative to operations, and efficiency (percentage of succesful requests vs. total) plummets the moment exponential backoff kicks off. It’s under 50% with as little as 100 operations, and although not shown in the graph, it’s under 10% when there are 5000 operations. In other words, each operation is needing over 10 attempts on average, and is costing an order of magnitude more than it should.

This looks bad enough as is, but it’d be even worse without jitter. As I was writing the DES code and before I added max bounds and jitter, retry periods grew so large they overflowed the 64-bit counters used to track time. Requests keep “colliding” (overloading the server) after each succesive doubling, in a manner analogous to the thundering herd problem, and retry delays grow exponentially. Error rates (and number of attempts) also grow because the server is left with nothing to do most of the time, only to face extreme request rates in successive, short bursts.

Problem mismatch

Exponential backoff is a strategy for multiple concurrent clients. We don’t always have 1000 clients sending a request to a server, though. Sometimes, we have rather 1 client sending 1000 requests (or, likely, 10 clients/processes sending each 100). Or, in more practical terms, you sometimes don’t have 1000 clients uploading a file to AWS S3, sometimes you have a handful of them uploading 1000 files or file chunks concurrently (the Sync Appliance does both). Exponential backoff is appropriate for the former, but as shown above not so much for the latter.

Inspiration

An early documented use of exponential backoff is controlling access to a shared medium, such as the spectrum, or, in the original 10BASE5 Ethernet protocol, a coaxial cable. This happens in medium access control (MAC) sublayer of the data link layer. Higher in the TCP/IP stack, the transport layer deals with a problem remote API throttling can be likened to: congestion control.

In TCP, a peer can a priori send up to receive window (rwnd) non-acknowledged bytes (this is the flow control mechanism), but sends fewer than that depending on the network conditions (congestion window cwnd). A parallelism with the remote API problem can be established:

network capacity             max req rate supported by server
congestion window (cwnd)     max outstanding requests
ACK                          sucessful request
ACK timeout                  error response

TCP’s congestion control algorithm looks basically as follows:

cwnd is initialized to an agreed initial value: 1 to 10 maximum segment sizes (MSS)
after each acknowledgement (ACK) is received, cwnd is increased by 1 (slow start phase) until it reaches the [ssthresh] value, at which point it is increased by [MSS / CWND] instead
on timeout, ssthresh is halved, cwnd is set to the initial value (TCP Tahoe) or ssthresh (TCP Reno).

This is how the congestion window behaves with TCP Tahoe:

TCP congestion control

(Luca Ghio [CC BY-SA 3.0)])

Adapting TCP congestion control

Based on the above parallelism, we can come up with a suitable request throttling algorithm. The transposition is obvious, except for a couple details:

there’s no equivalent to the receive window, so the expression used to update cwnd must be adjusted to prevent boundless growth when many requests are sent at a rate below the supported one; otherwise cwnd would keep increasing by 1.0/cwnd
in TCP, a single ACK accounting for all previous sent bytes is expected, but in the case of individual requests, when a burst of requests is sent and the sustainable rate is exceeded, we will get several error responses in quick succession. This would quickly bring sshthres to 1 if it were always halved when an error response is obtained. Instead, error responses for outstanding requests must be ignored after we get the first error.

Performance

This is how this additive increase/multiplicative decrease algorithm fares, again with 2000 requests performed against a server that sustains at most 100 concurrent requests:

congestion control request rate

The algorithm is keeping the request rate (fairly) close to, but below the one that causes errors nearly all the time, leading to few repeated requests, good resource utilization and small overall latency. All requests complete within 25s, which is close to the optimal 20s, and a large improvement on the 48s required in the above exponential backoff simulation. The client performs 2085 request attempts overall (i.e., 85 fail), compared to 17392. That’s an order of magnitude decrease in costs related to number of requests, in addition to the diminished processing load in the client.

Updated algorithm

The final algorithm looks as follows:

use 2 scalar variables cwnd and sshthres (suitable starting values are e.g. 20 and 1024), two sets of request ids flying and ignored which holds a set of requests ids (initialized to the empty set), and a queue to hold pending requests
add new requests to the queue
every time a new request is queued, or you get a response to a request (both success and error), and as long as the number of outstanding requests (size of flying) is less than the current value of cwnd, extract a request from the queue and send it. Add the ids of sent requests to flying, and remove them when you get the corresponding responses/errors.
on succesful response, set cwnd to max (cwnd, min(size(flying) + 1, cwnd + 1)) if size(flying) < sshthres or max (cwnd, min(size(flying) + 1, cwnd + 1.0/cwnd)) otherwise
on error response, if the request id is not in ignored,
- set sshthres to cwnd * 0.5 and cwnd to either the initial value (“Tahoe”) or the new sshthres (“Reno”).
- discard the current value of ignored, replacing it with the current value of flying (i.e., you need an immutable/functional structure or a deep copy).

If the allowed rate is supposed to be fairly stable, use a factor closer to 1 when updating ssthresh (say, 0.9) to make the request rate stay closer to the maximum on average.

This is how cwnd and ssthresh behave in the above simulation with the two factors 0.9 and 0.5:

congestion control cwnd

Take-home message

exponential backoff is suitable in situations where you have many independent clients sending few requests to the server/cloud service. If you do use it, add jitter: it makes a huge difference if you have request bursts.
if you have a limited number of clients sending more requests, exponential backoff is a terrible strategy that causes high error rates, lots of repeated requests (and thus costs) and increased latencies
give a congestion control algorithm like the above a try: it finds and stays fairly close to (but below, nearly all the time) the sustainable rate, adapts quickly to changes, reduces error responses, extra requests, and both individual and group latency.

Improve on exponential backoff

Published May 16, 2019 by at /blog/posts/improve-on-exponential-backoff/

Request rate

Error rate, overhead and extra costs ($)

Problem mismatch

Inspiration

Adapting TCP congestion control

Performance

Updated algorithm

Take-home message

Regain control over your data

Products

Solutions

Support

About