What an Error Budget Means in SRE

An error budget is the operationalised form of a service level objective. Instead of stating a reliability target as an abstract percentage, an error budget expresses it as a concrete allowance — a quantity of downtime or failures that a service is permitted to accumulate before it violates its SLO. Teams track this allowance in real time to understand how much room remains and how fast it is being consumed.

The formula: turning an SLO into a budget

The calculation is straightforward:

Error budget = (1 − SLO target) × measurement window

For a service with a 99.9% availability SLO measured over a 30-day window:

Error budget = (1 − 0.999) × 30 days
             = 0.001 × 43,200 minutes
             = 43.2 minutes

That 43.2 minutes is the total amount of downtime permitted in the month before the service is out of compliance with its SLO. Every minute of outage draws from this budget. When it reaches zero, the budget is exhausted and the SLO has been breached.

Budget consumed and budget remaining

At any point in the measurement window, the budget has two useful derived values:

  • Budget consumed = total downtime (or bad requests) accumulated so far in the window
  • Budget remaining = total budget − budget consumed

These two numbers tell you whether the service is within tolerance and how much headroom is left. A service that has consumed 5 minutes of a 43-minute budget midway through the month is in a healthy position. A service that has consumed 40 minutes in the first two weeks is at serious risk of breaching its SLO before the window closes.

Budget remaining is often expressed as a percentage of the total budget: (remaining / total) × 100. A remaining budget of 90% is healthy. A remaining budget of 10% late in the window is a signal to slow down risky changes.

Burn rate: how fast the budget is being spent

Burn rate is the ratio of how fast the error budget is being consumed relative to the rate at which it accumulates.

Burn rate = (budget consumed in window / total budget) / (elapsed time / total window)

A burn rate of 1 means the service is consuming its error budget at exactly the rate the SLO allows — it will arrive at the end of the window with zero budget left. A burn rate of 2 means the budget is being consumed twice as fast as intended — at that rate, the budget will be exhausted halfway through the window. A burn rate of 0.5 means consumption is well below the SLO threshold.

Burn rate alerts are one of the most practical reliability alerting patterns from Google's SRE workbook. A fast burn over a short window (for example, burn rate > 14 over the last hour) indicates an active incident that needs immediate attention. A slower but sustained burn over a longer window (burn rate > 2 over six hours) indicates a degradation that is not yet an emergency but will exhaust the budget before the window closes.

How teams use error budgets operationally

Error budgets are useful beyond reliability reporting. In SRE practice, they serve as a mechanism for balancing feature velocity against reliability investment:

  • When the budget is healthy, teams have more confidence to ship changes, run experiments, and accept the risk that comes with deploying new code. The remaining budget functions as a calculated risk allowance.
  • When the budget is nearly exhausted, it provides a concrete justification for slowing deployment frequency, pausing risky infrastructure changes, and prioritising reliability fixes over new features.
  • When the budget is fully consumed, some teams implement a formal policy: no new feature deployments until the next measurement window opens or the underlying reliability problems are resolved.

This framing shifts reliability conversations away from subjective arguments about "how stable is stable enough" and toward a shared, quantified policy that both engineering and product teams can reason about.

Multi-window alerting in practice

A single burn rate alert can be noisy. Google's SRE workbook recommends combining a short window (1 hour) and a longer window (6 hours) to reduce false positives: alert only when burn rate is elevated in both windows simultaneously. This catches genuine incidents while filtering out brief spikes that self-correct.

A common implementation: alert when burn rate over the last hour exceeds 14 AND burn rate over the last six hours exceeds 2. The first condition catches fast-burning incidents. The second condition catches slow degradations that would otherwise accumulate unnoticed. Together, they cover approximately 5% of the monthly budget in an hour (fast burn) and 10% over six hours (slow burn).

What counts as a failure? Error budgets can be defined against different SLIs (service level indicators) — availability (was the service responding?), latency (did it respond within the target time?), or error rate (what fraction of requests returned errors?). Each SLI produces its own error budget. Teams typically start with availability because it is the most straightforward to measure, then add latency budgets once the basic model is established.

Frequently asked questions

What is the formula for an error budget?

Error budget = (1 − SLO target) × measurement window. For a 99.9% SLO over 30 days: (1 − 0.999) × 43,200 minutes = 43.2 minutes of allowed downtime.

What is burn rate?

Burn rate is how fast you are spending your error budget relative to the rate at which it accumulates. A burn rate of 1 means you are on track to end the window with exactly zero budget remaining. A burn rate of 3 means you will exhaust the budget in one third of the window at your current pace.

What happens when the error budget runs out?

The SLO has been breached. Some teams implement a policy of freezing non-critical deployments until the next measurement window opens or until reliability improves. The practical consequences depend on whether the SLO is internal or tied to a customer-facing SLA with financial penalties.

What is the difference between an SLO and an SLA?

An SLO (service level objective) is an internal reliability target. An SLA (service level agreement) is a contract with a customer that specifies what happens — usually service credits — if reliability falls below a stated level. SLOs are often set more strictly than SLAs so teams have a buffer before a customer-facing breach occurs.

How often does an error budget reset?

Typically monthly, though rolling windows are also common. A rolling 30-day window means the budget is always measured against the last 30 days rather than resetting on the first of each month. Rolling windows catch degradations that straddle month boundaries.

What counts as a failure for error budget purposes?

Depends on the SLI (service level indicator) the budget tracks. For an availability SLI, a failure is a period where the service was unreachable or returning errors above a threshold. For a latency SLI, a failure is a request that exceeded the target response time.

Does error budget only apply to availability, or to latency too?

Both. You can have separate error budgets for different SLIs — one for availability and another for latency. Most teams start with availability because it is the simplest to measure and explain, then add latency budgets once the model is established.