Our payment service had 3 cascading failures in 6 months — all triggered by a downstream timeout. Should we add a circuit breaker, switch to async processing, or both?
Implement a circuit breaker using Resilience4j (Java)
Decision
Implement a circuit breaker using Resilience4j/Polly/equivalent — a library, not a new service. Configure: 50% failure rate threshold over a 20-request sliding window, 30-second open duration, 3 half-open probe requests, and 5-second downstream call timeout (replacing the likely 30s+ default that causes thread pool exhaustion). When the circuit opens, return HTTP 503 with Retry-After: 30 header. Add in-process retries with exponential backoff (2s, 4s, 8s, max 3 attempts) using existing task queue or scheduled executor — no new infrastructure. Critical failure mode: intermittent failures at ~40% error rate never trip the circuit. Mitigate by adding an 80% slow-call rate threshold at 5 seconds alongside the failure rate threshold. The economics are clear: a 30-second false trip costs ~$375 in rejected transactions versus $180K per cascading failure outage. One part-time senior engineer can deliver this in 5-8 working days. This is a library-level change, not an architecture change.
Next actions
Council notes
Evidence boundary
Observed from your filing
- Our payment service had 3 cascading failures in 6 months — all triggered by a downstream timeout. Should we add a circuit breaker, switch to async processing, or both?
Assumptions used for analysis
- The downstream payment gateway timeout is currently set to 30s+ and thread pool exhaustion is the cascading failure mechanism
- The team has access to a circuit breaker library (Resilience4j, Polly, or equivalent) compatible with their stack at zero additional cost
- The payment service processes enough requests that a 20-request sliding window provides meaningful signal (not so low-volume that the window covers hours of traffic)
- The $180K per outage estimate is roughly accurate, making the $375 false-trip cost an acceptable trade-off
- 1 part-time senior engineer is available for 5-8 working days of implementation
- current scale defaulted: moderate scale assumed (not_addressed)
Inferred candidate specifics
- Implement a circuit breaker using Resilience4j/Polly/equivalent — a library, not a new service. Configure: 50% failure rate threshold over a 20-request sliding window, 30-second open duration, 3 half-open probe requests, and 5-second downstream call timeout (replacing the likely 30s+ default that causes thread pool exhaustion). When the circuit opens, return HTTP 503 with Retry-After: 30 header. Add in-process retries with exponential backoff (2s, 4s, 8s, max 3 attempts) using existing task queue or scheduled executor — no new infrastructure. Critical failure mode: intermittent failures at ~40% error rate never trip the circuit. Mitigate by adding an 80% slow-call rate threshold at 5 seconds alongside the failure rate threshold. The economics are clear: a 30-second false trip costs ~$375 in rejected transactions versus $180K per cascading failure outage. One part-time senior engineer can deliver this in 5-8 working days. This is a library-level change, not an architecture change.
- Write the circuit breaker configuration class using Resilience4j (or stack equivalent) with these exact parameters: 50% failure rate threshold, 20-request sliding window, 80% slow-call rate at 5 seconds, 30-second open duration, 3 half-open probes, and wire it around the downstream payment gateway client with a 5-second call timeout replacing the current default.
- b003 had the highest confidence (0.94), survived 3 rounds of adversarial debate including splits and strengthening, named specific library recommendations, exact configuration parameters, quantified failure mode costs ($375 false trip vs $180K cascade), provided implementation timeline (5-8 days), and identified two specific failure modes with mitigations. No other surviving branch approached this level of specificity.
- Lower confidence (0.80 vs 0.94), less specific configuration parameters, no cost analysis, no library recommendations, and the bulkhead pattern adds complexity without addressing the core thread-exhaustion mechanism. The inconsistent state failure mode with bulkhead is real but unquantified.
- Tagged as [reframe]. Valid strategic consideration but not actionable against the immediate problem of 3 cascading failures in 6 months. Offers no specific implementation path, no timeline, and no failure mode analysis. Service mesh is a significant infrastructure investment that contradicts the implied resource constraints.
- Architecturally superior long-term but requires extensive changes to checkout state machine, introduces new UX complexity (delayed payment confirmation), and requires significantly more engineering effort than the immediate problem warrants. Would be the right move if cascading failures persist after circuit breaker implementation.
- Structurally redundant with b003 but adds async queue complexity without specifying queue technology, idempotency key scheme, or how payment state consistency is maintained. Worst of both worlds: async complexity without async rigor.
- Interesting reframe (timeouts may signal upstream overload) but unsupported and low-confidence. Throttling legitimate checkout requests is a worse customer experience trade-off than circuit-breaking failed payment calls.
Inferred specifics table
| Value | Kind | Basis | Where introduced |
|---|---|---|---|
| Configure: 50% failure rate threshold over | threshold | synthetic | chosen_path |
| over a 20-request sliding window | estimate | synthetic | chosen_path |
| 30-second open duration | estimate | synthetic | chosen_path |
| the likely 30s+ default that causes thread | estimate | synthetic | chosen_path |
| return HTTP 503 with Retry-After: 30 header | estimate | synthetic | chosen_path |
| 2s | estimate | synthetic | chosen_path |
| 4s | estimate | synthetic | chosen_path |
| 8s | estimate | synthetic | chosen_path |
| failures at ~40% error rate never trip | threshold | synthetic | chosen_path |
| adding an 80% slow-call rate threshold at | threshold | synthetic | chosen_path |
| trip costs ~$375 in rejected transactions versus | estimate | synthetic | chosen_path |
| transactions versus $180K per cascading failure outage | estimate | synthetic | chosen_path |
| these exact parameters: 50% failure rate threshold | threshold | synthetic | next_action |
| 20-request sliding window | estimate | synthetic | next_action |
| 80% slow-call rate at 5 seconds | threshold | synthetic | next_action |
| 30-second open duration | estimate | synthetic | next_action |
| 0.94 | estimate | synthetic | selection_rationale |
| $375 false trip vs $180K cascade | estimate | synthetic | selection_rationale |
| vs 0.94 | version | synthetic | rejected_alternatives.rationale |
| 0.80 vs 0 | estimate | synthetic | rejected_alternatives.rationale |
Unknowns blocking a firmer verdict
- The actual current downstream timeout value is assumed to be 30s+ based on typical payment gateway defaults — the real value should be verified before configuring the 5-second replacement
- The $180K per outage figure and 4-hour outage duration are from the winning branch but are not verified against actual incident data — actual cost per outage should be measured
- Whether the downstream provider's failure pattern is truly random or correlated (e.g., end-of-month settlement spikes) affects whether a fixed sliding window is the right detection mechanism
- The killed branch b005's async payment pipeline may be the correct long-term architecture if circuit breaker alone doesn't reduce failure frequency — this should be revisited after 3 months of circuit breaker operation
- The killed branch b004 raised a valid point that timeouts may signal upstream overload rather than downstream failure — root cause analysis of the 3 incidents should confirm the actual failure mechanism