Daniel Hidalgo - Lessons From Running Production Systems With Distributed Teams

The 3 AM Alert That Changed How I Think About Teams

I was deep asleep when the PagerDuty alert hit. An e-commerce checkout service was throwing 500s at a rate that would make your monitoring dashboard look like a seismograph. The service ran on EKS, the queue was backed up in SQS, and our team was spread across three timezones — Ecuador, Argentina, and Spain.

That night taught me more about distributed systems AND distributed teams than any architecture book ever could. Because when production is on fire, it doesn’t care where your team lives.

Here’s what I’ve learned from years of running production systems with teams that never share an office.

Observability Is the Great Equalizer

When your team is distributed, nobody can tap a colleague on the shoulder and ask “hey, does this look weird to you?” Observability fills that gap. It becomes the shared language your team speaks across timezones.

We standardized on OpenTelemetry for instrumentation and structured logging everywhere. Not because it’s trendy — because when an engineer in Argentina picks up an incident at 9 AM their time, they need to see the SAME context that the engineer in Ecuador saw at 2 AM.

What actually works:

Structured logging with correlation IDs. Every request gets a trace ID that flows through every service, every queue message, every database call. When something breaks, you follow one ID through the entire system. No guessing.
Business-level metrics, not just infra metrics. CPU usage is nice. “Orders per minute dropped 40% in the last 5 minutes” is actionable.
Dashboards that tell stories. We build dashboards around user journeys, not services. The “Checkout Health” dashboard shows everything from cart creation to payment confirmation in one view.

// Structured logging that actually helps during incidents
// Bad: logger.error("Payment failed")
// Good:
logger.error({
  event: 'payment_processing_failed',
  orderId: order.id,
  paymentProvider: 'stripe',
  errorCode: err.code,
  amount: order.total,
  currency: order.currency,
  traceId: context.traceId,
  customerSegment: order.customer.tier,
  retryAttempt: attempt,
  latencyMs: Date.now() - startTime,
});

The difference between those two log lines is the difference between “something broke” and “I know exactly what broke, for whom, and how to fix it.”

Incident Management Across Timezones

Incident management with a distributed team requires explicit structure. You can’t rely on “everyone jumps on a call” when it’s 3 AM for half your team.

Here’s the system we built:

Follow-the-sun on-call rotation. Each timezone has primary and secondary on-call. The incident stays with whoever picks it up until it’s resolved OR explicitly handed off with a written summary.
Incident channels, not threads. Every P1/P2 gets a dedicated Slack channel with a bot that timestamps every update. When the next timezone wakes up, they read the channel top to bottom and know exactly where things stand.
The 15-minute rule. If you can’t identify root cause in 15 minutes, escalate. Not because you’re incompetent — because you’re managing time, not ego. The sooner you pull in help, the sooner users are unblocked.

The handoff is the hardest part. We use a template:

## Incident Handoff — [INC-2024-0847]
**Status:** Investigating / Mitigated / Monitoring
**Impact:** ~12% of checkout attempts failing in US-East
**What we know:** SQS consumer lag spiking on payment-processor queue
**What we've tried:** Scaled consumers to 8 replicas (no improvement)
**Current hypothesis:** Stripe webhook endpoint is rate-limiting us
**Next steps:** Check Stripe dashboard, consider circuit breaker activation
**Key links:** [Dashboard](link) | [Logs query](link) | [Trace](link)

That template has saved us hours of “wait, what happened while I was asleep?”

Runbooks That Actually Get Used

Most runbooks are graveyard documents. Written once, never updated, never consulted during an actual incident. I’ve been guilty of this. Here’s what changed.

Runbooks work when they’re:

Linked directly from alerts. Every PagerDuty alert includes a runbook link. You don’t search for it — it’s right there in the notification.
Written as decision trees, not novels. “If metric X is above Y, do Z.” Not three paragraphs of background context.
Tested regularly. We run “runbook drills” monthly. Pick a scenario, follow the runbook, see if it still works. If it doesn’t, update it on the spot.
Owned by the on-call team. The people who use runbooks are the ones who maintain them. Not a separate documentation team.

# runbook: payment-queue-lag
trigger: sqs_queue_depth > 10000 for 5m
steps:
  - check: "Are consumers running?"
    command: "kubectl get pods -l app=payment-consumer -n production"
    if_no: "kubectl rollout restart deployment/payment-consumer -n production"
    if_yes: "continue"
  - check: "Is Stripe responding?"
    command: "curl -o /dev/null -s -w '%{http_code}' https://api.stripe.com/v1/health"
    if_5xx: "Activate circuit breaker: kubectl set env deployment/payment-consumer STRIPE_CIRCUIT=open"
    if_200: "continue"
  - check: "Consumer error logs"
    command: "kubectl logs -l app=payment-consumer --tail=100 | grep ERROR"
    action: "Analyze error pattern and escalate if unknown"

”You Build It, You Run It” — In Practice

This philosophy sounds great in conference talks. In practice, with a distributed team, it requires infrastructure.

You can’t expect developers to “run” what they build if:

They don’t have access to production logs
They can’t deploy without filing a ticket
Alerts go to a separate ops team instead of the building team
There’s no way to safely roll back their changes

What we actually implemented:

Every team owns their service’s alerts, dashboards, and runbooks. Not ops. Not SRE. The team that writes the code.
Deployment pipelines with automated canary analysis. Engineers deploy their own code. The pipeline watches error rates for 10 minutes. If they spike, it rolls back automatically. This gives developers deployment confidence without requiring an ops gatekeeper.
Feature flags for everything user-facing. Deploy code anytime. Activate features during business hours when the team is awake and watching.

# Canary deployment strategy on EKS
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: checkout-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  progressDeadlineSeconds: 600
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m

Async Communication Patterns That Work

Distributed teams live and die by their async communication. Meetings across timezones are expensive and often unnecessary. Here’s what replaced most of ours:

ADRs (Architecture Decision Records) for every significant decision. Not because we love documentation — because when your teammate in Spain wakes up, they need to understand WHY we chose SQS over Kafka, not just that we did.
Video Looms instead of meetings. Five-minute video walkthrough of a PR or a design doc. The receiver watches at 1.5x on their schedule. Response in writing.
RFC documents with explicit deadlines. “Comments due by Thursday EOD UTC.” No ambiguity.
Daily async standups in Slack. Three lines: what I did, what I’m doing, any blockers. Posted when your day starts. Read when theirs does.

The rule of thumb: if the information needs to survive longer than the meeting, it should never have been a meeting.

The Deployment Confidence Problem

The scariest moment in distributed team operations is deploying on a Friday afternoon when the team that built the feature is about to go offline for the weekend.

We solved this with layers of confidence:

Feature flags — deploy code dark, activate when the team is online
Canary releases — automated traffic shifting with automatic rollback
Deployment windows — we deploy anytime, but activations happen during overlapping hours (our “golden window” is 10 AM - 1 PM UTC when all timezones are awake)
SNS notifications — every deployment triggers a notification to the team channel with commit summary, author, and rollback instructions

The goal isn’t to slow down deployments. It’s to make every deployment boring. Boring deployments mean your system is healthy.

War Story: The Silent Queue

One Saturday, our order processing queue (SQS) stopped consuming messages. No errors. No alerts. Just… silence. Orders were piling up, but everything looked green on our dashboards.

The root cause? A Kubernetes node had been cordoned during maintenance, the consumer pods were evicted, and the Horizontal Pod Autoscaler had a minimum replica count of zero. So when the pods were evicted, the HPA was perfectly happy with zero replicas processing zero messages.

What we changed after:

Minimum replica count for critical consumers is now 2. Always.
We added a “zero consumer” alert: if any critical SQS queue has zero active consumers for more than 2 minutes, page immediately.
Runbook updated same day.
Post-mortem shared across all teams, not just the affected one.

No blame. Just learning.

Building a Culture of Ownership, Not Blame

Blameless post-mortems aren’t just a nice idea. They’re a survival mechanism for distributed teams. Here’s why: if people fear blame, they hide problems. If they hide problems, problems grow. If problems grow across timezones, you find out when it’s already a disaster.

Our post-mortem template has one rule: no human names in the root cause section. The system failed. Not a person.

We focus on:

What controls were missing that allowed this to happen?
What signals did we miss and why?
What automation would have caught this before users noticed?

The result? Engineers proactively report near-misses. They flag “this feels fragile” without fear. And the whole team gets better because information flows freely.

Documentation Is a Production Dependency

I’ll say it plainly: if your system can’t be operated by someone who wasn’t in the room when it was built, your documentation is failing. And with distributed teams, someone is ALWAYS not in the room.

We treat documentation like we treat tests. It’s not optional. It’s not “nice to have.” It’s a production dependency. If you ship a new service without an updated README, operational runbook, and architecture diagram, your PR doesn’t get merged.

Is it extra work? Yes. Does it pay for itself the first time someone in a different timezone needs to debug your service at 3 AM? Absolutely.

The Honest Truth

Running production systems with distributed teams is harder than doing it in a co-located office. Anyone who tells you otherwise is selling something. But it’s also possible, sustainable, and — when you get it right — incredibly rewarding.

The key isn’t fancy tools or complex processes. It’s building systems (both technical and human) that assume someone is always sleeping, someone is always debugging, and the documentation is always the first thing they’ll read.

Make that experience great, and everything else follows.

Lessons From Running Production Systems With Distributed Teams