Exponential backoff is the go-to algorithm for dealing with transient remote API errors in client applications. Indeed, all the major cloud service providers document it (AWS, Google, Microsoft) and incorporate it in their respective SDKs (e.g. AWS’s Javascript SDK). It is simple to explain (“double the retry delay after each error, limit to a max value, add some jitter” – more on the latter below), trivial to implement, and it “just works”.
The problem is it’s not very good at throttling your requests and avoiding errors – you could say it’s appallingly bad. This graph is telling:
This is the result of a
discrete event simulation (DES) where a
client tries to perform 2000 requests (initially generated at a 1000/s rate)
against a server that only allows up to max_busy = 50
requests at a time. I
picked plausible values for other simulation parameters, but the exact values
don’t affect the overall behavior. (The connection/request time is 100ms;
succesful responses are served in 500ms, and error responses in 50ms.) In real
life, we do not have the luxury of knowing max_busy
, and indeed, it will
normally not be a fixed value: it can fluctuate over time (network conditions,
overall load, busy neighbors…). I chose a round number for easier
interpretation, and again, this does not affect in any way the qualitative
conclusions.
Request rate
Note how the request rate grows very quickly within the first few seconds. The initial retry delay is 50ms, so after a couple seconds the client is performing thousands of requests per second, and nearly all of them are retries bound to fail. The rate grows so wildly a logarithmic scale shows better what’s going on:
The logarithmic scale graph shows what the previous one couldn’t: within the first few seconds, essentially no requests are completed because the server is swamped with retries. It is only later, after several doublings of the retry periods, that request rate decreases enough to let the server perform actual work. Although it’s not visible in the graph, a few requests are performed in the 47s mark – these are “unfortunate” ones which maxed out the retry period to 30s.
Error rate, overhead and extra costs ($)
In the model I used for the simulation, error responses keep the server busy for a brief period of time; this captures the fact that (failed) requests have a performance cost and contribute to server load, or, as we’ve all suffered, the F5 effect (in practice, both the client and the network also have extra work, which makes things even worse). This is why essentially no actual work is completed in the first few seconds until the request rate decreases.
Performance cost is not the only burden: in the public cloud, each request entails a monetary cost, so overhead is not only a technical issue, but also an economic one. Consider how the number of attempts grows with that of operations:
Total attempts have superlinear growth relative to operations, and efficiency (percentage of succesful requests vs. total) plummets the moment exponential backoff kicks off. It’s under 50% with as little as 100 operations, and although not shown in the graph, it’s under 10% when there are 5000 operations. In other words, each operation is needing over 10 attempts on average, and is costing an order of magnitude more than it should.
This looks bad enough as is, but it’d be even worse without jitter. As I was writing the DES code and before I added max bounds and jitter, retry periods grew so large they overflowed the 64-bit counters used to track time. Requests keep “colliding” (overloading the server) after each succesive doubling, in a manner analogous to the thundering herd problem, and retry delays grow exponentially. Error rates (and number of attempts) also grow because the server is left with nothing to do most of the time, only to face extreme request rates in successive, short bursts.
Problem mismatch
Exponential backoff is a strategy for multiple concurrent clients. We don’t always have 1000 clients sending a request to a server, though. Sometimes, we have rather 1 client sending 1000 requests (or, likely, 10 clients/processes sending each 100). Or, in more practical terms, you sometimes don’t have 1000 clients uploading a file to AWS S3, sometimes you have a handful of them uploading 1000 files or file chunks concurrently (the Sync Appliance does both). Exponential backoff is appropriate for the former, but as shown above not so much for the latter.
Inspiration
An early documented use of exponential backoff is controlling access to a shared medium, such as the spectrum, or, in the original 10BASE5 Ethernet protocol, a coaxial cable. This happens in medium access control (MAC) sublayer of the data link layer. Higher in the TCP/IP stack, the transport layer deals with a problem remote API throttling can be likened to: congestion control.
In TCP, a peer can a priori send up to receive window (rwnd
)
non-acknowledged bytes (this is the flow control mechanism), but sends fewer
than that depending on the network conditions (congestion window cwnd
). A
parallelism with the remote API problem can be established:
network capacity max req rate supported by server
congestion window (cwnd) max outstanding requests
ACK sucessful request
ACK timeout error response
TCP’s congestion control algorithm looks basically as follows:
cwnd
is initialized to an agreed initial value: 1 to 10 maximum segment sizes (MSS)- after each acknowledgement (ACK) is received,
cwnd
is increased by 1 (slow start phase) until it reaches the [ssthresh] value, at which point it is increased by [MSS / CWND] instead - on timeout,
ssthresh
is halved,cwnd
is set to the initial value (TCP Tahoe) orssthresh
(TCP Reno).
This is how the congestion window behaves with TCP Tahoe:
(Luca Ghio [CC BY-SA 3.0)])
Adapting TCP congestion control
Based on the above parallelism, we can come up with a suitable request throttling algorithm. The transposition is obvious, except for a couple details:
there’s no equivalent to the receive window, so the expression used to update
cwnd
must be adjusted to prevent boundless growth when many requests are sent at a rate below the supported one; otherwisecwnd
would keep increasing by1.0/cwnd
in TCP, a single ACK accounting for all previous sent bytes is expected, but in the case of individual requests, when a burst of requests is sent and the sustainable rate is exceeded, we will get several error responses in quick succession. This would quickly bring
sshthres
to 1 if it were always halved when an error response is obtained. Instead, error responses for outstanding requests must be ignored after we get the first error.
Performance
This is how this additive increase/multiplicative decrease algorithm fares, again with 2000 requests performed against a server that sustains at most 100 concurrent requests:
The algorithm is keeping the request rate (fairly) close to, but below the one that causes errors nearly all the time, leading to few repeated requests, good resource utilization and small overall latency. All requests complete within 25s, which is close to the optimal 20s, and a large improvement on the 48s required in the above exponential backoff simulation. The client performs 2085 request attempts overall (i.e., 85 fail), compared to 17392. That’s an order of magnitude decrease in costs related to number of requests, in addition to the diminished processing load in the client.
Updated algorithm
The final algorithm looks as follows:
use 2 scalar variables
cwnd
andsshthres
(suitable starting values are e.g. 20 and 1024), two sets of request idsflying
andignored
which holds a set of requests ids (initialized to the empty set), and a queue to hold pending requestsadd new requests to the queue
every time a new request is queued, or you get a response to a request (both success and error), and as long as the number of outstanding requests (size of
flying
) is less than the current value ofcwnd
, extract a request from the queue and send it. Add the ids of sent requests toflying
, and remove them when you get the corresponding responses/errors.on succesful response, set
cwnd
tomax (cwnd, min(size(flying) + 1, cwnd + 1))
ifsize(flying) < sshthres
ormax (cwnd, min(size(flying) + 1, cwnd + 1.0/cwnd))
otherwiseon error response, if the request id is not in
ignored
,- set
sshthres
tocwnd * 0.5
andcwnd
to either the initial value (“Tahoe”) or the newsshthres
(“Reno”). - discard the current value of
ignored
, replacing it with the current value offlying
(i.e., you need an immutable/functional structure or a deep copy).
- set
If the allowed rate is supposed to be fairly stable, use a factor closer to
1 when updating ssthresh
(say, 0.9) to make the request rate stay closer
to the maximum on average.
This is how cwnd
and ssthresh
behave in the above simulation with the
two factors 0.9 and 0.5:
Take-home message
exponential backoff is suitable in situations where you have many independent clients sending few requests to the server/cloud service. If you do use it, add jitter: it makes a huge difference if you have request bursts.
if you have a limited number of clients sending more requests, exponential backoff is a terrible strategy that causes high error rates, lots of repeated requests (and thus costs) and increased latencies
give a congestion control algorithm like the above a try: it finds and stays fairly close to (but below, nearly all the time) the sustainable rate, adapts quickly to changes, reduces error responses, extra requests, and both individual and group latency.