almessadi.
Back to Index

Retries Need Backoff, Jitter, and a Clear Budget_

Retries improve resilience only when they spread load and stop at the right time. Without backoff and jitter, they can turn a transient failure into a broader outage.

PublishedSeptember 28, 2024
Reading Time9 min read

Retries are one of the easiest resilience features to get wrong because they feel harmless at small scale.

A single service retrying a failed request is not a problem. Ten thousand instances retrying on the same schedule often is.

That is where retry logic turns from recovery mechanism into load amplifier.

The Naive Pattern

This is common and dangerous:

for (let attempt = 0; attempt < 3; attempt += 1) {
  try {
    return await callPaymentApi();
  } catch (error) {
    await sleep(1000);
  }
}

All callers fail together. All callers sleep for the same amount of time. All callers wake up together and hammer the dependency again.

What Better Retry Logic Includes

Good retry behavior usually combines:

  • exponential backoff
  • jitter
  • a maximum retry budget

For example:

function backoffMs(attempt: number) {
  const base = 250 * 2 ** attempt;
  const jitter = Math.random() * 0.3 * base;
  return Math.min(base + jitter, 5000);
}

The point is not mathematical elegance. The point is to stop thousands of clients from retrying in lockstep.

The Trade-Off

Retries only make sense for errors that are likely to be transient. They are usually wrong for:

  • validation errors
  • permanent authorization errors
  • requests that are unsafe to repeat without idempotency

That is why retry policy and idempotency policy belong together.

Further Reading