Back to all posts

Designing Systems When You Don't Control All the Dependencies

Building resilient systems when third-party APIs, legacy services, and external teams are unreliable. Circuit breakers, bulkheads, fallbacks, and the trust boundary concept from real e-commerce experience.

7 min read

There’s a moment in every engineer’s career where you realize something uncomfortable: the system you’re responsible for is only as reliable as its weakest external dependency. And you don’t control that dependency.

Maybe it’s a payment gateway that times out during peak hours. Maybe it’s a legacy inventory system maintained by a team in another timezone that deploys on Fridays. Maybe it’s a shipping provider whose API returns 500s every time they push a “minor update.”

I’ve lived through all of these. And here’s what I’ve learned: you don’t design resilient systems by hoping everything works. You design them by assuming everything will break.

The Trust Boundary Concept

Before diving into patterns, let’s establish a mental model I use constantly: the trust boundary.

Every system has a boundary between what you control and what you don’t. Inside the boundary, you own the code, the deployments, the monitoring, and the on-call rotation. Outside the boundary, you’re at the mercy of someone else’s priorities, release schedule, and definition of “reliable.”

The mistake most teams make is treating external dependencies as if they’re inside the trust boundary. They call a third-party API and assume it’ll respond in 200ms. They integrate with an internal legacy service and assume the contract won’t change. They trust.

Don’t trust. Verify, isolate, and plan for failure.

Here’s how I draw trust boundaries in practice:

Each tier gets a different resilience strategy. Tier 3 dependencies get the full treatment.

Circuit Breaker Pattern: Stop Hammering Dead Services

The circuit breaker is the first pattern you should reach for when dealing with unreliable dependencies. The concept is borrowed from electrical engineering: when a downstream service is failing, stop sending requests to it.

Here’s a simplified implementation I’ve used in Node.js services:

class CircuitBreaker {
  private failures = 0;
  private lastFailure: number = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private readonly threshold: number = 5,
    private readonly resetTimeout: number = 30_000
  ) {}

  async execute<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = 'half-open';
      } else {
        return fallback();
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      return fallback();
    }
  }

  private onSuccess(): void {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure(): void {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
}

In an e-commerce system I worked on, we wrapped every payment gateway call with a circuit breaker. When Stripe had intermittent issues (yes, even Stripe has bad days), the circuit would open after 5 consecutive failures. Instead of queuing thousands of failed payment attempts and making users stare at a spinner, we’d immediately route to the fallback flow: show a “payment processing delayed” message and queue the charge for retry.

Key insight: The circuit breaker doesn’t fix the problem. It prevents cascade failures and gives users a better experience during outages.

Bulkhead Isolation: Don’t Let One Failure Sink the Ship

Bulkheads in ships are watertight compartments. If one compartment floods, the ship doesn’t sink. The same principle applies to software.

In a real scenario: we had an e-commerce platform where the shipping rate calculator, the inventory checker, and the payment processor all shared the same HTTP connection pool. When the shipping provider started responding slowly, it consumed all available connections. Suddenly, payments couldn’t process either. One bad dependency took down an unrelated critical path.

The fix was bulkhead isolation:

const paymentClient = new HttpClient({
  maxConnections: 20,
  timeout: 5_000,       // payments should be fast
  retries: 2,
});

const shippingClient = new HttpClient({
  maxConnections: 10,
  timeout: 15_000,      // shipping APIs are notoriously slow
  retries: 3,
});

const inventoryClient = new HttpClient({
  maxConnections: 15,
  timeout: 3_000,       // internal service, should be quick
  retries: 1,
});

This is not over-engineering. This is the difference between “the shipping API is slow today” and “the entire checkout is down.”

Timeout Strategies: The Silent Killer

Most production incidents I’ve debugged didn’t start with a crash. They started with a timeout that was either too generous or nonexistent.

Here’s my timeout hierarchy for external calls:

The rule I follow: every HTTP call to an external service MUST have an explicit timeout. No exceptions. Default timeouts in HTTP libraries are usually 30-120 seconds, which is an eternity in a request pipeline.

A practical example: in a checkout flow, the total budget is maybe 8 seconds before the user gets impatient. If you’re calling a payment API, an address validator, and a tax calculator in sequence, you don’t have 30 seconds per call. You have maybe 2-3 seconds each, with the payment getting the most generous allocation.

Fallback Mechanisms: Graceful Degradation in Practice

Here’s where it gets interesting. Fallbacks aren’t just about returning cached data. They’re about deciding what your system can still do without a dependency.

Real examples from systems I’ve built:

The pattern:

async function getShippingRates(order: Order): Promise<ShippingRate[]> {
  try {
    return await shippingCircuitBreaker.execute(
      () => shippingClient.calculateRates(order),
      () => getCachedFlatRates(order.destination)
    );
  } catch {
    return getDefaultFlatRates();  // last resort
  }
}

The business conversation matters here. Fallbacks have trade-offs. Showing flat rates when the real rates are different means you might lose money on some shipments. Queue-and-charge means you might charge a card that gets declined later. These are business decisions, not just technical ones. Have this conversation with product before the outage, not during.

Contract Testing: Catch Breaks Before Production

One of the most effective strategies I’ve adopted is consumer-driven contract testing. Instead of praying that the shipping API doesn’t change their response format, we codify what we expect and test it regularly.

Tools like Pact let you define contracts:

describe('Shipping API Contract', () => {
  it('should return rates with expected structure', async () => {
    const response = await provider.executeTest(async (mockServer) => {
      const client = new ShippingClient(mockServer.url);
      const rates = await client.getRates(testOrder);

      expect(rates).toEqual(
        expect.arrayContaining([
          expect.objectContaining({
            carrier: expect.any(String),
            price: expect.any(Number),
            estimatedDays: expect.any(Number),
          }),
        ])
      );
    });
  });
});

This won’t prevent the external API from breaking. But it’ll tell you immediately when the contract is violated, instead of discovering it through production errors at 2 AM.

The Trade-Off: Resilience vs. Simplicity

Here’s the thing I want to be honest about: all of this adds complexity. Circuit breakers, bulkheads, fallbacks, contract tests — they all add code, configuration, and cognitive overhead.

Not every system needs all of these patterns. A side project calling one external API? Just add a timeout and handle errors gracefully. An e-commerce platform processing thousands of orders per hour across a dozen external services? You need the full toolkit.

My rule of thumb:

Closing Thoughts

The hardest lesson in system design isn’t learning patterns. It’s accepting that you can’t control everything. Your payment provider will have outages. The legacy system will change without warning. The team across the building will deploy a breaking change on a Friday.

Your job isn’t to prevent these failures. It’s to design systems that handle them with grace. Every resilience pattern is a bet: “When this dependency fails — and it will — here’s how we’ll keep going.”

The engineers who build truly reliable systems aren’t the ones who write the most clever code. They’re the ones who’ve been burned enough times to plan for the fire.

Found this useful?

Share it on LinkedIn, check out more posts, or connect with me to exchange ideas.

Keep reading