Daniel Hidalgo - Redis in Production: Caching Strategies That Actually Work (And Some That Don't)

I’ve been running Redis in production on e-commerce platforms for several years now, and I can tell you that most caching tutorials leave out the parts that actually matter. They show you SET key value EX 300 and call it a day. Real production caching is about trade-offs, invalidation strategies, failure modes, and knowing when NOT to cache something.

This is everything I wish someone had told me before I learned it the hard way.

The Three Core Strategies (And When Each One Fits)

Cache-Aside (Lazy Loading)

This is the most common pattern, and for good reason. The application checks the cache first. On a miss, it reads from the database, writes to the cache, and returns the data.

async function getProduct(id: string): Promise<Product> {
  const cacheKey = `product:${id}`;

  // 1. Check cache
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // 2. Cache miss — read from DB
  const product = await productRepo.findById(id);
  if (!product) throw new NotFoundException();

  // 3. Populate cache
  await redis.set(cacheKey, JSON.stringify(product), 'EX', 600);

  return product;
}

When it works: Read-heavy workloads where stale data is acceptable for short periods. Product catalogs, user profiles, configuration data.

When it doesn’t: Data that changes frequently and must be immediately consistent. If you cache a user’s cart with cache-aside and they add an item, they’ll see the old cart until the TTL expires (unless you actively invalidate — more on that later).

Trade-off: Simple to implement, but every cache miss is a slow request. Cold cache after a deploy or Redis restart means a thundering herd to the database.

Write-Through

Every write goes to both the cache and the database simultaneously. Reads always hit the cache.

async function updateProduct(id: string, data: UpdateProductDto): Promise<Product> {
  // 1. Update database
  const product = await productRepo.update(id, data);

  // 2. Update cache (synchronously — part of the write path)
  const cacheKey = `product:${id}`;
  await redis.set(cacheKey, JSON.stringify(product), 'EX', 600);

  return product;
}

When it works: Data that’s read frequently and updated occasionally, AND where you need strong read consistency. Session data, feature flags, pricing rules.

When it doesn’t: Write-heavy workloads. You’re paying the latency cost of a Redis write on every single database write, even for data nobody might read.

Trade-off: Cache is always fresh, but writes are slower. And you still need a TTL as a safety net — if the write-through fails silently, the cache diverges forever without one.

Write-Behind (Write-Back)

Writes go to the cache first, and the cache asynchronously flushes to the database. This is the fastest for writes but the most dangerous.

When it works: High-frequency counters, analytics events, real-time metrics — where losing a few data points is acceptable.

When it doesn’t: Anything transactional. If Redis crashes before flushing to the database, that data is gone. I would never use this for orders, payments, or anything financial.

Trade-off: Maximum write performance, but you’re accepting potential data loss. In my experience, this pattern is rarely the right choice for core business data. We use it for view counts and analytics pipelines — that’s about it.

TTL Strategies Beyond “Set It to 5 Minutes”

The number one TTL anti-pattern I see is slapping the same TTL on everything. Your product catalog and your real-time inventory count have wildly different freshness requirements. Treat them differently.

Here’s how we think about TTL design:

Static reference data (countries, categories, tax rates): TTL of 24 hours. Invalidate explicitly on change.
Semi-static data (product details, user profiles): TTL of 10-30 minutes. Stale data is annoying but not harmful.
Volatile data (inventory counts, pricing): TTL of 30-60 seconds. Or don’t cache at all — sometimes the database is the right answer.
Session/auth data: TTL matches the session duration. No more, no less.

We also use adaptive TTLs based on access patterns:

async function getWithAdaptiveTTL(key: string, fetcher: () => Promise<any>): Promise<any> {
  const cached = await redis.get(key);
  if (cached) {
    // Extend TTL on access (most-recently-used stays warm)
    const currentTTL = await redis.ttl(key);
    if (currentTTL < 120) {
      await redis.expire(key, 600); // Reset to 10 min on access
    }
    return JSON.parse(cached);
  }

  const data = await fetcher();
  await redis.set(key, JSON.stringify(data), 'EX', 600);
  return data;
}

Frequently accessed data stays warm. Rarely accessed data naturally evicts. Your cache memory stays focused on what matters.

Cache Invalidation: The Real Hard Problem

Phil Karlton said there are two hard things in computer science: cache invalidation and naming things. He wasn’t wrong about the first one.

Here are the invalidation patterns we use, from simplest to most robust:

TTL-Based Expiration (Passive)

Let it expire naturally. This is fine when stale data is acceptable. It’s simple, and simple is underrated.

Explicit Invalidation (Active)

When data changes, delete the cache key:

async function updateProduct(id: string, data: UpdateProductDto): Promise<Product> {
  const product = await productRepo.update(id, data);

  // Delete, don't update — let the next read repopulate
  await redis.del(`product:${id}`);

  // Don't forget related caches!
  await redis.del(`catalog:${product.categoryId}`);
  await redis.del(`search:products:*`); // Pattern deletion for search caches

  return product;
}

Important: Delete, don’t update. If you update the cache and the database write fails or rolls back, your cache has phantom data. Delete is idempotent and safe.

But notice the search:products:* — that’s where invalidation gets ugly. A product change might affect category listings, search results, recommendation feeds, and homepage features. Missing even one creates a stale data bug that’s incredibly hard to track down.

Event-Driven Invalidation

This is what we settled on for anything non-trivial. Database changes publish events, and a dedicated cache invalidation consumer handles the cleanup:

// Event handler — separate service/consumer
@OnEvent('product.updated')
async handleProductUpdate(event: ProductUpdatedEvent) {
  const { productId, categoryId, previousCategoryId } = event;

  const keysToInvalidate = [
    `product:${productId}`,
    `catalog:${categoryId}`,
    `homepage:featured`,
  ];

  // If category changed, invalidate old category too
  if (previousCategoryId && previousCategoryId !== categoryId) {
    keysToInvalidate.push(`catalog:${previousCategoryId}`);
  }

  await Promise.all(keysToInvalidate.map(key => redis.del(key)));
}

This centralizes invalidation logic. When a new cache is added, you add its invalidation rule to the event handler — one place to maintain instead of scattered redis.del() calls across the codebase.

Redis Cluster on AWS ElastiCache: What We Learned

Running Redis on ElastiCache in cluster mode has some sharp edges:

Key distribution matters. Redis Cluster uses hash slots. If you use MGET across keys that land on different slots, it fails. Use hash tags ({product}:123, {product}:456) to colocate related keys on the same shard.
Failover isn’t instant. When a primary node fails, ElastiCache promotes a replica. This takes 10-30 seconds. Your application needs to handle connection errors gracefully during failover. We use ioredis with retryStrategy and reconnectOnError configured properly.
Memory management is critical. Set maxmemory-policy to allkeys-lru for caching workloads. We’ve seen teams leave it at noeviction (the default) and then wonder why Redis starts returning errors when it fills up.
Monitor replication lag. If your read replicas fall behind, reads from replicas serve stale data. We alert at 1 second of replication lag.

When NOT to Cache

This is the section most articles skip. Here’s when caching makes things worse:

Write-heavy, read-light data. If you write 10x more than you read, you’re paying the cache write cost without the read benefit.
Highly personalized data with low reuse. Caching a unique recommendation feed per user sounds good until you realize your cache hit rate is 3%. You’re using Redis as an expensive, volatile second database.
Data that must be real-time consistent. Account balances, inventory for limited-edition drops, auction bids. The cost of stale data exceeds the cost of a database query.
Large objects that rarely repeat. Caching a 2MB report PDF in Redis is a waste of memory. Put it in S3.

The rule of thumb: If your cache hit rate is below 80%, investigate whether that cache is earning its keep. Below 50%, it’s probably hurting more than helping.

Monitoring Cache Health

A cache you don’t monitor is a cache waiting to betray you. Here’s what we track:

Hit rate (target: >90%): cache_hits / (cache_hits + cache_misses). We expose this as a Prometheus metric and alert if it drops below 80%.
Latency percentiles: p50, p95, p99 for Redis operations. If p99 spikes, you might have a hot key or a network issue.
Memory usage: Percentage of available memory used. Alert at 80%.
Eviction rate: If keys are being evicted faster than expected, your cache is undersized for the workload.
Connection count: Suddenly high connection counts can indicate a connection leak in your application.

We use OpenTelemetry to instrument all Redis calls and ship traces to our observability stack. Every cache operation gets a span with the key pattern (not the full key — you don’t want high-cardinality labels), the operation type, and whether it was a hit or miss.

// Simplified OTEL instrumentation wrapper
async function cachedGet<T>(key: string, fetcher: () => Promise<T>, ttl: number): Promise<T> {
  const span = tracer.startSpan('cache.get', {
    attributes: { 'cache.key_pattern': key.split(':')[0], 'cache.ttl': ttl },
  });

  try {
    const cached = await redis.get(key);
    if (cached) {
      span.setAttribute('cache.hit', true);
      cacheHitCounter.inc({ pattern: key.split(':')[0] });
      return JSON.parse(cached);
    }

    span.setAttribute('cache.hit', false);
    cacheMissCounter.inc({ pattern: key.split(':')[0] });
    const data = await fetcher();
    await redis.set(key, JSON.stringify(data), 'EX', ttl);
    return data;
  } finally {
    span.end();
  }
}

The Bottom Line

Redis is one of the most powerful tools in a backend engineer’s toolkit, but it’s not magic. It’s a trade-off machine: memory for speed, consistency for performance, simplicity for resilience. Every caching decision should start with “what’s the cost of stale data?” and “what’s the cost of a cache miss?” — not with “let’s cache everything and hope for the best.”

Start with cache-aside for most things. Add write-through only where consistency matters. Use event-driven invalidation for anything complex. Monitor your hit rates religiously. And never, ever forget: the fastest cache miss is the one you prevent by not caching data that shouldn’t be cached in the first place.

Redis in Production: Caching Strategies That Actually Work (And Some That Don't)