The problem is it’s not very good at throttling your requests and avoiding errors – you could say it’s appallingly bad. This graph is telling:
This is the result of a
discrete event simulation (DES) where a
client tries to perform 2000 requests (initially generated at a 1000/s rate)
against a server that only allows up to
max_busy = 50 requests at a time. I
picked plausible values for other simulation parameters, but the exact values
don’t affect the overall behavior. (The connection/request time is 100ms;
succesful responses are served in 500ms, and error responses in 50ms.) In real
life, we do not have the luxury of knowing
max_busy, and indeed, it will
normally not be a fixed value: it can fluctuate over time (network conditions,
overall load, busy neighbors…). I chose a round number for easier
interpretation, and again, this does not affect in any way the qualitative
Note how the request rate grows very quickly within the first few seconds. The initial retry delay is 50ms, so after a couple seconds the client is performing thousands of requests per second, and nearly all of them are retries bound to fail. The rate grows so wildly a logarithmic scale shows better what’s going on:
The logarithmic scale graph shows what the previous one couldn’t: within the first few seconds, essentially no requests are completed because the server is swamped with retries. It is only later, after several doublings of the retry periods, that request rate decreases enough to let the server perform actual work. Although it’s not visible in the graph, a few requests are performed in the 47s mark – these are “unfortunate” ones which maxed out the retry period to 30s.
Error rate, overhead and extra costs ($)
In the model I used for the simulation, error responses keep the server busy for a brief period of time; this captures the fact that (failed) requests have a performance cost and contribute to server load, or, as we’ve all suffered, the F5 effect (in practice, both the client and the network also have extra work, which makes things even worse). This is why essentially no actual work is completed in the first few seconds until the request rate decreases.
Performance cost is not the only burden: in the public cloud, each request entails a monetary cost, so overhead is not only a technical issue, but also an economic one. Consider how the number of attempts grows with that of operations:
Total attempts have superlinear growth relative to operations, and efficiency (percentage of succesful requests vs. total) plummets the moment exponential backoff kicks off. It’s under 50% with as little as 100 operations, and although not shown in the graph, it’s under 10% when there are 5000 operations. In other words, each operation is needing over 10 attempts on average, and is costing an order of magnitude more than it should.
This looks bad enough as is, but it’d be even worse without jitter. As I was writing the DES code and before I added max bounds and jitter, retry periods grew so large they overflowed the 64-bit counters used to track time. Requests keep “colliding” (overloading the server) after each succesive doubling, in a manner analogous to the thundering herd problem, and retry delays grow exponentially. Error rates (and number of attempts) also grow because the server is left with nothing to do most of the time, only to face extreme request rates in successive, short bursts.
Exponential backoff is a strategy for multiple concurrent clients. We don’t always have 1000 clients sending a request to a server, though. Sometimes, we have rather 1 client sending 1000 requests (or, likely, 10 clients/processes sending each 100). Or, in more practical terms, you sometimes don’t have 1000 clients uploading a file to AWS S3, sometimes you have a handful of them uploading 1000 files or file chunks concurrently (the Sync Appliance does both). Exponential backoff is appropriate for the former, but as shown above not so much for the latter.
An early documented use of exponential backoff is controlling access to a shared medium, such as the spectrum, or, in the original 10BASE5 Ethernet protocol, a coaxial cable. This happens in medium access control (MAC) sublayer of the data link layer. Higher in the TCP/IP stack, the transport layer deals with a problem remote API throttling can be likened to: congestion control.
In TCP, a peer can a priori send up to receive window (
non-acknowledged bytes (this is the flow control mechanism), but sends fewer
than that depending on the network conditions (congestion window
parallelism with the remote API problem can be established:
network capacity max req rate supported by server congestion window (cwnd) max outstanding requests ACK sucessful request ACK timeout error response
TCP’s congestion control algorithm looks basically as follows:
cwndis initialized to an agreed initial value: 1 to 10 maximum segment sizes (MSS)
- after each acknowledgement (ACK) is received,
cwndis increased by 1 (slow start phase) until it reaches the [ssthresh] value, at which point it is increased by [MSS / CWND] instead
- on timeout,
cwndis set to the initial value (TCP Tahoe) or
This is how the congestion window behaves with TCP Tahoe:
(Luca Ghio [CC BY-SA 3.0)])
Adapting TCP congestion control
Based on the above parallelism, we can come up with a suitable request throttling algorithm. The transposition is obvious, except for a couple details:
there’s no equivalent to the receive window, so the expression used to update
cwndmust be adjusted to prevent boundless growth when many requests are sent at a rate below the supported one; otherwise
cwndwould keep increasing by
in TCP, a single ACK accounting for all previous sent bytes is expected, but in the case of individual requests, when a burst of requests is sent and the sustainable rate is exceeded, we will get several error responses in quick succession. This would quickly bring
sshthresto 1 if it were always halved when an error response is obtained. Instead, error responses for outstanding requests must be ignored after we get the first error.
This is how this additive increase/multiplicative decrease algorithm fares, again with 2000 requests performed against a server that sustains at most 100 concurrent requests:
The algorithm is keeping the request rate (fairly) close to, but below the one that causes errors nearly all the time, leading to few repeated requests, good resource utilization and small overall latency. All requests complete within 25s, which is close to the optimal 20s, and a large improvement on the 48s required in the above exponential backoff simulation. The client performs 2085 request attempts overall (i.e., 85 fail), compared to 17392. That’s an order of magnitude decrease in costs related to number of requests, in addition to the diminished processing load in the client.
The final algorithm looks as follows:
use 2 scalar variables
sshthres(suitable starting values are e.g. 20 and 1024), two sets of request ids
ignoredwhich holds a set of requests ids (initialized to the empty set), and a queue to hold pending requests
add new requests to the queue
every time a new request is queued, or you get a response to a request (both success and error), and as long as the number of outstanding requests (size of
flying) is less than the current value of
cwnd, extract a request from the queue and send it. Add the ids of sent requests to
flying, and remove them when you get the corresponding responses/errors.
on succesful response, set
max (cwnd, min(size(flying) + 1, cwnd + 1))if
size(flying) < sshthresor
max (cwnd, min(size(flying) + 1, cwnd + 1.0/cwnd))otherwise
on error response, if the request id is not in
cwnd * 0.5and
cwndto either the initial value (“Tahoe”) or the new
- discard the current value of
ignored, replacing it with the current value of
flying(i.e., you need an immutable/functional structure or a deep copy).
If the allowed rate is supposed to be fairly stable, use a factor closer to
1 when updating
ssthresh (say, 0.9) to make the request rate stay closer
to the maximum on average.
This is how
ssthresh behave in the above simulation with the
two factors 0.9 and 0.5:
exponential backoff is suitable in situations where you have many independent clients sending few requests to the server/cloud service. If you do use it, add jitter: it makes a huge difference if you have request bursts.
if you have a limited number of clients sending more requests, exponential backoff is a terrible strategy that causes high error rates, lots of repeated requests (and thus costs) and increased latencies
give a congestion control algorithm like the above a try: it finds and stays fairly close to (but below, nearly all the time) the sustainable rate, adapts quickly to changes, reduces error responses, extra requests, and both individual and group latency.