Back to all posts

Lessons From Running Production Systems With Distributed Teams

Real lessons from managing production systems across timezones — observability, incident management, runbooks, deployment confidence, and building a culture of ownership.

8 min read

The 3 AM Alert That Changed How I Think About Teams

I was deep asleep when the PagerDuty alert hit. An e-commerce checkout service was throwing 500s at a rate that would make your monitoring dashboard look like a seismograph. The service ran on EKS, the queue was backed up in SQS, and our team was spread across three timezones — Ecuador, Argentina, and Spain.

That night taught me more about distributed systems AND distributed teams than any architecture book ever could. Because when production is on fire, it doesn’t care where your team lives.

Here’s what I’ve learned from years of running production systems with teams that never share an office.

Observability Is the Great Equalizer

When your team is distributed, nobody can tap a colleague on the shoulder and ask “hey, does this look weird to you?” Observability fills that gap. It becomes the shared language your team speaks across timezones.

We standardized on OpenTelemetry for instrumentation and structured logging everywhere. Not because it’s trendy — because when an engineer in Argentina picks up an incident at 9 AM their time, they need to see the SAME context that the engineer in Ecuador saw at 2 AM.

What actually works:

// Structured logging that actually helps during incidents
// Bad: logger.error("Payment failed")
// Good:
logger.error({
  event: 'payment_processing_failed',
  orderId: order.id,
  paymentProvider: 'stripe',
  errorCode: err.code,
  amount: order.total,
  currency: order.currency,
  traceId: context.traceId,
  customerSegment: order.customer.tier,
  retryAttempt: attempt,
  latencyMs: Date.now() - startTime,
});

The difference between those two log lines is the difference between “something broke” and “I know exactly what broke, for whom, and how to fix it.”

Incident Management Across Timezones

Incident management with a distributed team requires explicit structure. You can’t rely on “everyone jumps on a call” when it’s 3 AM for half your team.

Here’s the system we built:

The handoff is the hardest part. We use a template:

## Incident Handoff — [INC-2024-0847]
**Status:** Investigating / Mitigated / Monitoring
**Impact:** ~12% of checkout attempts failing in US-East
**What we know:** SQS consumer lag spiking on payment-processor queue
**What we've tried:** Scaled consumers to 8 replicas (no improvement)
**Current hypothesis:** Stripe webhook endpoint is rate-limiting us
**Next steps:** Check Stripe dashboard, consider circuit breaker activation
**Key links:** [Dashboard](link) | [Logs query](link) | [Trace](link)

That template has saved us hours of “wait, what happened while I was asleep?”

Runbooks That Actually Get Used

Most runbooks are graveyard documents. Written once, never updated, never consulted during an actual incident. I’ve been guilty of this. Here’s what changed.

Runbooks work when they’re:

# runbook: payment-queue-lag
trigger: sqs_queue_depth > 10000 for 5m
steps:
  - check: "Are consumers running?"
    command: "kubectl get pods -l app=payment-consumer -n production"
    if_no: "kubectl rollout restart deployment/payment-consumer -n production"
    if_yes: "continue"
  - check: "Is Stripe responding?"
    command: "curl -o /dev/null -s -w '%{http_code}' https://api.stripe.com/v1/health"
    if_5xx: "Activate circuit breaker: kubectl set env deployment/payment-consumer STRIPE_CIRCUIT=open"
    if_200: "continue"
  - check: "Consumer error logs"
    command: "kubectl logs -l app=payment-consumer --tail=100 | grep ERROR"
    action: "Analyze error pattern and escalate if unknown"

”You Build It, You Run It” — In Practice

This philosophy sounds great in conference talks. In practice, with a distributed team, it requires infrastructure.

You can’t expect developers to “run” what they build if:

What we actually implemented:

# Canary deployment strategy on EKS
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: checkout-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  progressDeadlineSeconds: 600
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m

Async Communication Patterns That Work

Distributed teams live and die by their async communication. Meetings across timezones are expensive and often unnecessary. Here’s what replaced most of ours:

The rule of thumb: if the information needs to survive longer than the meeting, it should never have been a meeting.

The Deployment Confidence Problem

The scariest moment in distributed team operations is deploying on a Friday afternoon when the team that built the feature is about to go offline for the weekend.

We solved this with layers of confidence:

The goal isn’t to slow down deployments. It’s to make every deployment boring. Boring deployments mean your system is healthy.

War Story: The Silent Queue

One Saturday, our order processing queue (SQS) stopped consuming messages. No errors. No alerts. Just… silence. Orders were piling up, but everything looked green on our dashboards.

The root cause? A Kubernetes node had been cordoned during maintenance, the consumer pods were evicted, and the Horizontal Pod Autoscaler had a minimum replica count of zero. So when the pods were evicted, the HPA was perfectly happy with zero replicas processing zero messages.

What we changed after:

No blame. Just learning.

Building a Culture of Ownership, Not Blame

Blameless post-mortems aren’t just a nice idea. They’re a survival mechanism for distributed teams. Here’s why: if people fear blame, they hide problems. If they hide problems, problems grow. If problems grow across timezones, you find out when it’s already a disaster.

Our post-mortem template has one rule: no human names in the root cause section. The system failed. Not a person.

We focus on:

The result? Engineers proactively report near-misses. They flag “this feels fragile” without fear. And the whole team gets better because information flows freely.

Documentation Is a Production Dependency

I’ll say it plainly: if your system can’t be operated by someone who wasn’t in the room when it was built, your documentation is failing. And with distributed teams, someone is ALWAYS not in the room.

We treat documentation like we treat tests. It’s not optional. It’s not “nice to have.” It’s a production dependency. If you ship a new service without an updated README, operational runbook, and architecture diagram, your PR doesn’t get merged.

Is it extra work? Yes. Does it pay for itself the first time someone in a different timezone needs to debug your service at 3 AM? Absolutely.

The Honest Truth

Running production systems with distributed teams is harder than doing it in a co-located office. Anyone who tells you otherwise is selling something. But it’s also possible, sustainable, and — when you get it right — incredibly rewarding.

The key isn’t fancy tools or complex processes. It’s building systems (both technical and human) that assume someone is always sleeping, someone is always debugging, and the documentation is always the first thing they’ll read.

Make that experience great, and everything else follows.

Found this useful?

Share it on LinkedIn, check out more posts, or connect with me to exchange ideas.

Keep reading