Defining Error-Rate Thresholds That Trigger Rollback
Problem Statement
βRoll back if errors get badβ is not a plan β under pressure it produces hesitation while error rates climb. You need concrete error-rate numbers, evaluated over fixed windows, that page the right person or auto-fire the reversal. This is the detailed companion to Rollback Trigger Thresholds, covering how to set 5xx and 4xx ceilings, choose windows, and wire alerting that turns a metric into an action.
When to Use This Approach
- You are in the soak window after a cutover and need unambiguous abort criteria.
- Your monitoring exposes per-status-code request rates at edge and origin.
- You want certain breaches to auto-fire and others to require owner confirmation.
- You need to distinguish origin failures (5xx) from broken routing (4xx).
- Stakeholders demand a defensible, pre-agreed basis for any rollback decision.
Step-by-Step Instructions
1. Separate the 5xx and 4xx Ceilings
Treat the two error classes distinctly: 5xx signals origin/application failure, 4xx signals broken redirects or missing resources. Set a 5xx ratio ceiling (e.g. 2% over 5 minutes) and a separate 4xx ceiling relative to baseline.
# 5xx ratio across all requests, evaluated for a 5-minute window
sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.02
2. Choose Windows That Filter Noise but Bound Damage
Pick a window long enough to ignore a cache-warm blip yet short enough to cap exposure. Five minutes suits 5xx; ten minutes suits 4xx, which is noisier from crawlers.
# Prometheus: require the breach to persist for the full window before firing
- alert: Rollback5xxBreach
expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.02
for: 5m
labels: {severity: page, action: rollback}
3. Add a Fast Auto-Fire Tier for Severe Breaches
Define a second, stricter tier that auto-triggers without confirmation when the breach is unambiguous, so catastrophic failures do not wait on a human.
# Severe tier: 5xx above 5% for 2 minutes auto-triggers the reversal
- alert: Rollback5xxSevere
expr: sum(rate(http_requests_total{code=~"5.."}[2m])) / sum(rate(http_requests_total[2m])) > 0.05
for: 2m
labels: {severity: critical, action: auto-rollback}
4. Wire Alerting to a Pager, Not a Dashboard
Route the breach to the named rollback owner via pager and a synthetic backstop, so detection never depends on someone watching a screen.
# Synthetic backstop: page if the critical path returns 5xx three times in a row
fails=0
while true; do
c=$(curl -s -o /dev/null -w '%{http_code}' https://www.example.com/checkout)
[ "${c:0:1}" = "5" ] && fails=$((fails+1)) || fails=0
[ "$fails" -ge 3 ] && echo "PAGE: sustained 5xx ($c)" && break
sleep 20
done
Worked Example
A retailer cut over its checkout flow at 09:00. Baseline 5xx was 0.1% and 4xx was 1.2%. Thresholds were set at 5xx > 2% for 5 minutes (page), 5xx > 5% for 2 minutes (auto-rollback), and 4xx > baseline + 10 points for 10 minutes (page).
At 09:06 the new originβs connection pool saturated; 5xx climbed to 3.4% and held. The Rollback5xxBreach alert fired at 09:11 after the full 5-minute window, paging the release owner, who confirmed the abort. Separately, a broken redirect map pushed 4xx from 1.2% to 13%, breaching the 4xx ceiling at 09:16 β this pointed at routing rather than origin, so recovery used Redirect Rollback & Recovery rather than a DNS revert. Had 5xx instead spiked to 6%, Rollback5xxSevere would have auto-fired at 09:08, skipping the confirmation step entirely. The two distinct error classes drove two distinct recovery paths, exactly as the threshold design intended.
Verification
- Inject a controlled 5xx burst in staging and confirm
Rollback5xxBreachfires only after the full 5-minute window, not on a transient spike. - Confirm
Rollback5xxSevereauto-fires within ~2 minutes on a >5% burst, and that the page reaches the ownerβs device. - Drive a test path to 404 and confirm the 4xx alert fires while 5xx stays quiet, proving the classes are independent.
FAQ
What is a sensible default 5xx threshold for a migration soak? A ratio of 2% sustained over 5 minutes for the paging tier, plus a 5% over 2 minutes auto-fire tier. These work as absolute floors even without a clean baseline; tighten them toward baseline + a small margin once you have stable pre-migration numbers.
Why evaluate 4xx separately from 5xx? They indicate different failures. A 5xx surge means the origin or application is failing and a DNS or origin rollback is likely; a 4xx surge usually means redirects or resources are broken and the fix lives in the routing layer. Conflating them sends you down the wrong recovery path.
Should every error-rate breach automatically roll back? No. Reserve auto-fire for severe, unambiguous breaches. Moderate breaches should page the named owner for a confirmation call, because softer rises can stem from benign causes like a crawler hitting deprecated paths or analytics tagging breaking during the move.
Related
- Rollback Trigger Thresholds
- Migration Rollback Playbooks
- Redirect Rollback & Recovery
- DNS Rollback Procedures
β Back to Rollback Trigger Thresholds