Back to all posts

Why Most Backend Systems Fail at Scale (And the Mistakes I've Seen Firsthand)

Seven real production failures I've witnessed — from N+1 queries to cache stampedes — and the fixes that actually worked. No theory, just battle scars.

7 min read

I’ve spent over six years building and scaling backend systems, and I can tell you this with absolute confidence: most systems don’t fail because of some exotic edge case. They fail because of well-known mistakes that teams keep repeating. Not because engineers are bad — because these problems are invisible at low traffic and only become lethal at scale.

Here are seven mistakes I’ve seen kill production systems, what actually happened, and how we fixed them. Every single one of these is from a real project.

1. The N+1 Query That Nobody Noticed

What happened: We had an endpoint that returned a list of orders with their associated products. At 50 orders, it was fine — 200ms response time. At 500 orders during a flash sale, it took 14 seconds and the database CPU hit 98%.

The ORM was doing exactly what we asked: load orders, then for each order, load its products. One query for orders, 500 queries for products. Classic N+1.

What we did: The immediate fix was eager loading — one query with a JOIN. Response time dropped to 180ms regardless of order count.

// Before: N+1 disaster
const orders = await orderRepo.find({ where: { userId } });
// This triggers 1 query per order to load products

// After: single query with relation loading
const orders = await orderRepo.find({
  where: { userId },
  relations: ['items', 'items.product'],
});

But the real fix was adding query logging in staging with a threshold alert. Any endpoint that executes more than 10 queries per request gets flagged automatically. We catch N+1s before they reach production now.

Lesson: ORMs make N+1 queries easy to write and hard to spot. You need automated detection, not just code reviews.

2. Missing Circuit Breakers

What happened: Our order service called a third-party shipping rate API. One afternoon, that API started responding with 30-second timeouts instead of its usual 200ms. Every single request to our checkout flow queued up waiting for shipping rates. Thread pool exhausted. Our entire checkout was down — not because OUR system broke, but because someone else’s did.

What we did: We implemented circuit breakers using a pattern similar to Netflix’s Hystrix (we used opossum in Node.js):

import CircuitBreaker from 'opossum';

const shippingBreaker = new CircuitBreaker(fetchShippingRates, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
});

shippingBreaker.fallback(() => ({
  rates: getCachedShippingRates(),
  isFallback: true,
}));

When the shipping API fails, the circuit opens, and we serve cached rates with a flag telling the frontend these are estimates. Customers keep checking out. We reconcile actual shipping costs asynchronously.

Lesson: Every external dependency is a liability. If you don’t have a fallback strategy for each one, you’ve coupled your uptime to theirs.

3. Shared Database Between Services

What happened: Two services — orders and inventory — both read and wrote to the same PostgreSQL database. The inventory service ran a heavy analytical query every 5 minutes for stock reports. During peak hours, that query locked rows that the order service needed to update. Orders started failing with lock timeout errors.

What we did: We gave each service its own database. Inventory changes get published as events via SNS, and the order service maintains its own read model of available stock. Yes, it’s eventually consistent. But “eventually consistent with working checkout” beats “strongly consistent with failing checkout” every time.

Lesson: A shared database is a shared fate. If two services share a database, they’re not two services — they’re a distributed monolith.

4. No Backpressure on Queues

What happened: We used SQS for processing order confirmations. Each message triggered an email, a PDF invoice generation, and an inventory update. During a promotional event, 50,000 orders came in within an hour. The SQS queue backed up to 200,000 messages (some orders generated multiple messages). Our consumer tried to process all of them at full speed, overwhelmed the email service, and started throwing 429 rate limit errors. Failed messages went to the dead-letter queue. 12,000 customers didn’t get their confirmation emails.

What we did: Three changes:

// SQS consumer with controlled concurrency
const consumer = Consumer.create({
  queueUrl: process.env.ORDER_QUEUE_URL,
  batchSize: 10,
  visibilityTimeout: 60,
  handleMessage: async (message) => {
    await rateLimiter.acquire('email-service', { maxPerSecond: 50 });
    await processOrderConfirmation(JSON.parse(message.Body));
  },
});

Lesson: A queue without backpressure is just a way to convert a traffic spike into a cascading failure with extra steps.

5. Ignoring Cold Starts

What happened: We moved a critical service to AWS Lambda for cost savings. Average response time: 80ms. But after periods of inactivity, the first request took 4-6 seconds due to cold starts. This was the authentication service. Users experienced random 5-second login delays. Support tickets piled up.

What we did: We evaluated three options:

Lesson: Serverless is fantastic — for the right workloads. Latency-sensitive, always-on services aren’t it. Match the compute model to the actual requirements, not the trend.

6. Cache Stampede

What happened: We cached our product catalog in Redis with a 5-minute TTL. At the exact moment the cache expired, 3,000 concurrent requests hit the endpoint. All 3,000 saw a cache miss. All 3,000 queried the database simultaneously. Database connection pool exhausted. 500 errors for everyone.

What we did: We implemented a cache lock pattern (sometimes called “probabilistic early expiration”):

async function getProductCatalog(categoryId: string): Promise<Product[]> {
  const cacheKey = `catalog:${categoryId}`;
  const cached = await redis.get(cacheKey);

  if (cached) {
    const { data, expiry, delta } = JSON.parse(cached);
    const now = Date.now();

    // Probabilistic early recomputation
    // As we approach expiry, increasingly likely to refresh
    if (now - delta * Math.log(Math.random()) < expiry) {
      return data;
    }
  }

  // Distributed lock: only one process rebuilds
  const lockKey = `lock:${cacheKey}`;
  const acquired = await redis.set(lockKey, '1', 'EX', 30, 'NX');

  if (!acquired) {
    // Another process is rebuilding; serve stale if available
    if (cached) return JSON.parse(cached).data;
    // No stale data; wait briefly and retry
    await sleep(100);
    return getProductCatalog(categoryId);
  }

  try {
    const products = await productRepo.findByCategory(categoryId);
    const ttl = 300; // 5 minutes
    await redis.set(cacheKey, JSON.stringify({
      data: products,
      expiry: Date.now() + ttl * 1000,
      delta: ttl * 1000 * 0.1,
    }), 'EX', ttl + 60); // Extra buffer for stale serving
    return products;
  } finally {
    await redis.del(lockKey);
  }
}

Only one process rebuilds the cache. Everyone else gets slightly stale data. In an e-commerce catalog, nobody notices 30-second-old product data. Everyone notices a 500 error.

Lesson: Cache expiration is a coordinated event that can create thundering herds. Design for it explicitly.

7. Not Designing for Failure

What happened: This is the meta-mistake that encompasses all the others. We built our first version of the platform assuming everything would work. Happy path everywhere. No retries, no fallbacks, no graceful degradation. The system worked perfectly in development and staging. Production taught us otherwise within the first week.

What we did: We adopted a principle: every external call must answer three questions before it’s written:

  1. What happens if this call fails?
  2. What happens if this call is slow (10x normal latency)?
  3. What happens if this call returns unexpected data?

We codified this into our code review checklist. Pull requests that add external calls without answering these three questions get sent back.

We also started running game days — intentionally injecting failures in staging using AWS Fault Injection Simulator. You’d be amazed at how many assumptions break when you kill a database replica or add 500ms latency to an internal service.

Lesson: Failure isn’t an edge case. It’s a feature of distributed systems. If you haven’t designed for it, you haven’t designed the system.

The Pattern Behind All These Failures

Look at all seven mistakes. There’s a common thread: they’re all invisible at low scale. N+1 queries are fast when N is small. Missing circuit breakers don’t matter when dependencies are healthy. Cache stampedes don’t happen with 10 concurrent users.

This is why load testing and chaos engineering aren’t optional luxuries — they’re how you find these problems before your customers do. We run load tests that simulate 3x our peak traffic before every major release. We run chaos experiments monthly. It’s not paranoia. It’s professionalism.

The systems that survive at scale aren’t the ones built by the smartest engineers. They’re the ones built by engineers who assumed everything would break and planned accordingly. Build for the failure case first. The happy path takes care of itself.

Found this useful?

Share it on LinkedIn, check out more posts, or connect with me to exchange ideas.

Keep reading