There’s a moment in every engineer’s career where you realize something uncomfortable: the system you’re responsible for is only as reliable as its weakest external dependency. And you don’t control that dependency.
Maybe it’s a payment gateway that times out during peak hours. Maybe it’s a legacy inventory system maintained by a team in another timezone that deploys on Fridays. Maybe it’s a shipping provider whose API returns 500s every time they push a “minor update.”
I’ve lived through all of these. And here’s what I’ve learned: you don’t design resilient systems by hoping everything works. You design them by assuming everything will break.
The Trust Boundary Concept
Before diving into patterns, let’s establish a mental model I use constantly: the trust boundary.
Every system has a boundary between what you control and what you don’t. Inside the boundary, you own the code, the deployments, the monitoring, and the on-call rotation. Outside the boundary, you’re at the mercy of someone else’s priorities, release schedule, and definition of “reliable.”
The mistake most teams make is treating external dependencies as if they’re inside the trust boundary. They call a third-party API and assume it’ll respond in 200ms. They integrate with an internal legacy service and assume the contract won’t change. They trust.
Don’t trust. Verify, isolate, and plan for failure.
Here’s how I draw trust boundaries in practice:
- Tier 1 (full trust): Services my team owns and deploys. We control SLAs.
- Tier 2 (conditional trust): Internal services from other teams. We have some influence but no control.
- Tier 3 (zero trust): Third-party APIs, legacy systems, anything behind a VPN we can’t monitor.
Each tier gets a different resilience strategy. Tier 3 dependencies get the full treatment.
Circuit Breaker Pattern: Stop Hammering Dead Services
The circuit breaker is the first pattern you should reach for when dealing with unreliable dependencies. The concept is borrowed from electrical engineering: when a downstream service is failing, stop sending requests to it.
Here’s a simplified implementation I’ve used in Node.js services:
class CircuitBreaker {
private failures = 0;
private lastFailure: number = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private readonly threshold: number = 5,
private readonly resetTimeout: number = 30_000
) {}
async execute<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.lastFailure > this.resetTimeout) {
this.state = 'half-open';
} else {
return fallback();
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
return fallback();
}
}
private onSuccess(): void {
this.failures = 0;
this.state = 'closed';
}
private onFailure(): void {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
}
}
}
In an e-commerce system I worked on, we wrapped every payment gateway call with a circuit breaker. When Stripe had intermittent issues (yes, even Stripe has bad days), the circuit would open after 5 consecutive failures. Instead of queuing thousands of failed payment attempts and making users stare at a spinner, we’d immediately route to the fallback flow: show a “payment processing delayed” message and queue the charge for retry.
Key insight: The circuit breaker doesn’t fix the problem. It prevents cascade failures and gives users a better experience during outages.
Bulkhead Isolation: Don’t Let One Failure Sink the Ship
Bulkheads in ships are watertight compartments. If one compartment floods, the ship doesn’t sink. The same principle applies to software.
In a real scenario: we had an e-commerce platform where the shipping rate calculator, the inventory checker, and the payment processor all shared the same HTTP connection pool. When the shipping provider started responding slowly, it consumed all available connections. Suddenly, payments couldn’t process either. One bad dependency took down an unrelated critical path.
The fix was bulkhead isolation:
- Separate connection pools per external dependency
- Separate thread pools (or in Node.js, separate queue workers) for different categories of work
- Separate timeouts tuned to each dependency’s expected behavior
const paymentClient = new HttpClient({
maxConnections: 20,
timeout: 5_000, // payments should be fast
retries: 2,
});
const shippingClient = new HttpClient({
maxConnections: 10,
timeout: 15_000, // shipping APIs are notoriously slow
retries: 3,
});
const inventoryClient = new HttpClient({
maxConnections: 15,
timeout: 3_000, // internal service, should be quick
retries: 1,
});
This is not over-engineering. This is the difference between “the shipping API is slow today” and “the entire checkout is down.”
Timeout Strategies: The Silent Killer
Most production incidents I’ve debugged didn’t start with a crash. They started with a timeout that was either too generous or nonexistent.
Here’s my timeout hierarchy for external calls:
- Connect timeout: How long to wait for a TCP connection. Keep this aggressive (1-3 seconds). If a service isn’t accepting connections, waiting longer won’t help.
- Read timeout: How long to wait for a response after the connection is established. This depends on the operation, but I rarely go above 10 seconds for synchronous calls.
- Total timeout: The overall budget for the entire operation, including retries. This is the one most people forget.
The rule I follow: every HTTP call to an external service MUST have an explicit timeout. No exceptions. Default timeouts in HTTP libraries are usually 30-120 seconds, which is an eternity in a request pipeline.
A practical example: in a checkout flow, the total budget is maybe 8 seconds before the user gets impatient. If you’re calling a payment API, an address validator, and a tax calculator in sequence, you don’t have 30 seconds per call. You have maybe 2-3 seconds each, with the payment getting the most generous allocation.
Fallback Mechanisms: Graceful Degradation in Practice
Here’s where it gets interesting. Fallbacks aren’t just about returning cached data. They’re about deciding what your system can still do without a dependency.
Real examples from systems I’ve built:
- Payment gateway down: Queue the order, show “processing” status, charge asynchronously when the gateway recovers. The user still completes checkout.
- Shipping rate calculator down: Show estimated flat rates based on historical data. Not perfect, but the user can still buy.
- Inventory service down: Allow the purchase but flag it for manual review. Accept the risk of overselling a few items vs. blocking all sales.
- Recommendation engine down: Show popular products instead of personalized ones. Nobody notices.
The pattern:
async function getShippingRates(order: Order): Promise<ShippingRate[]> {
try {
return await shippingCircuitBreaker.execute(
() => shippingClient.calculateRates(order),
() => getCachedFlatRates(order.destination)
);
} catch {
return getDefaultFlatRates(); // last resort
}
}
The business conversation matters here. Fallbacks have trade-offs. Showing flat rates when the real rates are different means you might lose money on some shipments. Queue-and-charge means you might charge a card that gets declined later. These are business decisions, not just technical ones. Have this conversation with product before the outage, not during.
Contract Testing: Catch Breaks Before Production
One of the most effective strategies I’ve adopted is consumer-driven contract testing. Instead of praying that the shipping API doesn’t change their response format, we codify what we expect and test it regularly.
Tools like Pact let you define contracts:
describe('Shipping API Contract', () => {
it('should return rates with expected structure', async () => {
const response = await provider.executeTest(async (mockServer) => {
const client = new ShippingClient(mockServer.url);
const rates = await client.getRates(testOrder);
expect(rates).toEqual(
expect.arrayContaining([
expect.objectContaining({
carrier: expect.any(String),
price: expect.any(Number),
estimatedDays: expect.any(Number),
}),
])
);
});
});
});
This won’t prevent the external API from breaking. But it’ll tell you immediately when the contract is violated, instead of discovering it through production errors at 2 AM.
The Trade-Off: Resilience vs. Simplicity
Here’s the thing I want to be honest about: all of this adds complexity. Circuit breakers, bulkheads, fallbacks, contract tests — they all add code, configuration, and cognitive overhead.
Not every system needs all of these patterns. A side project calling one external API? Just add a timeout and handle errors gracefully. An e-commerce platform processing thousands of orders per hour across a dozen external services? You need the full toolkit.
My rule of thumb:
- If downtime of the dependency costs money: circuit breaker + fallback + monitoring
- If downtime causes data loss: bulkhead isolation + retries with idempotency
- If the dependency is owned by another team: contract tests + explicit SLAs
- If it’s a third-party you can’t influence at all: all of the above
Closing Thoughts
The hardest lesson in system design isn’t learning patterns. It’s accepting that you can’t control everything. Your payment provider will have outages. The legacy system will change without warning. The team across the building will deploy a breaking change on a Friday.
Your job isn’t to prevent these failures. It’s to design systems that handle them with grace. Every resilience pattern is a bet: “When this dependency fails — and it will — here’s how we’ll keep going.”
The engineers who build truly reliable systems aren’t the ones who write the most clever code. They’re the ones who’ve been burned enough times to plan for the fire.