Software DevMarch 5, 20264 min read

Debugging Production Like a Detective

debuggingproductionobservability

The worst bugs happen in production. They're non-reproducible, they affect real users, and your only evidence is whatever your logging infrastructure happened to capture. No breakpoints. No debugger. Just logs and timestamps.

At Avoca, I learned to debug production systems the hard way. AI voice agents making real phone calls, handling restaurant reservations in real time. When something broke, there was no "try again" — the caller was already gone. Here's the playbook I built.

structured logging is non-negotiable

The first time I tried to debug a production issue by searching for string messages in Datadog, I knew the approach was broken. Searching "error processing reservation" across 200 log entries, each formatted slightly differently, each missing different context.

Structured logs fix this:

logger.error("reservation_failed", {
  reservation_id: "res_abc123",
  restaurant_id: "rst_456",
  caller_phone: "+1XXXXXXXXXX",
  failure_reason: "slot_unavailable",
  attempted_time: "2025-11-15T19:30:00Z",
  latency_ms: 340,
  trace_id: "tr_789xyz",
});

Every field is queryable. I can find all reservation failures for a specific restaurant in a time window in one query. No regex. No parsing. Just structured data.

the trace ID pattern

Every inbound request gets a trace ID. That ID propagates through every function call, every service boundary, every log line. When something fails, I search for the trace ID and see the entire journey of that request.

function createTraceContext() {
  const traceId = `tr_${crypto.randomUUID().slice(0, 12)}`;
  return {
    traceId,
    log: (event: string, data: Record<string, unknown>) => {
      logger.info(event, { ...data, trace_id: traceId });
    },
  };
}

At Avoca this was critical. A single phone call might touch the voice transcription service, the NLU pipeline, the reservation API, and the confirmation system. Without a trace ID tying them together, correlating logs across four services was impossible.

reproducing the unreproducible

Production bugs are often timing-dependent, data-dependent, or both. The trick is to reconstruct the exact conditions from logs.

My checklist when a bug report comes in:

Find the trace. Get the trace ID from the error log. Pull every log line for that trace. Read them chronologically.
Find the divergence. Compare the failing trace to a successful one for the same operation. Where do they diverge? That's your investigation starting point.
Check the inputs. What data was the function working with? If you logged the inputs (you should), you can reproduce locally with the exact same data.
Check the timing. Was this a race condition? Look at timestamps. Were two operations closer together than expected? Was something slower than usual?

One bug at Avoca: a restaurant's availability check was returning stale data, but only for a specific restaurant, only on Tuesday evenings. The logs showed the cache TTL was 5 minutes, but this restaurant updated their availability every 3 minutes during dinner rush. The cache was serving stale slots. Fix: per-restaurant cache TTL based on their update frequency.

Without structured logs with restaurant_id and cache_hit/miss fields, that would have taken days to find.

the debugging story that sticks

We had a bug where the AI agent would occasionally say the wrong restaurant name during a call. Not often — maybe 1 in 200 calls. No pattern in the error logs because technically nothing was erroring.

I added a log line comparing the restaurant name in the initial context vs. the name in the spoken transcript. Found the issue in 30 minutes: when a caller was transferred between two restaurants (asking about one, getting redirected to another), the context object was mutated in place. The second restaurant's name overwrote the first one in the shared context.

The fix was making the context object immutable — creating a new copy for each restaurant instead of modifying the original. Classic mutation bug. Classic production-only symptom.

the minimal debugging toolkit

You don't need a full observability platform on day one. Start with:

Structured JSON logs — not console.log("something went wrong")
Trace IDs on every request — propagated through the entire call chain
Request/response logging at service boundaries — what went in, what came out
Error context — not just the error message, but the state that caused it

That's it. Four things. They'll solve 80% of your debugging problems before you need any fancy tooling.

The remaining 20% is where Datadog, Sentry, or whatever your observability stack is comes in. But the foundation is always the same: structured data with correlation IDs. Everything else is a query layer on top.

Debugging Production Like a Detective

structured logging is non-negotiable

the trace ID pattern

reproducing the unreproducible

the debugging story that sticks

the minimal debugging toolkit

More in Software Dev

Git Workflow for Solo Founders

My Code Review Checklist

Docker for Developers, Not Ops