Platform Engineering

Failure-injection as a self-training methodology for platform engineers

How I use Claude Code as a chaos engineering partner to manufacture debugging reps on my own system — and what a silent notification drop taught me about where alerting really needs to live.

Eyal Levi

May 14, 2026

7 min read

The dashboard was green. App Insights showed no exceptions. Event Hub consumer lag was flat. Cosmos writes were flowing normally. The GuardianLink platform — an Azure IoT system I built to monitor connected devices and alert operators when a vehicle crash is detected — looked healthy in every direction I pointed the cursor.

I ran the device simulator. Crash events flooded in. The classifier picked them up, scored them, emitted crash_confirmed events. The telemetry writer logged them to Cosmos. Everything worked.

And yet something felt off.

The itch

There was no alert. Nothing on screen told me to look closer. It was the notification count — or rather, the absence of a number I'd come to expect from watching this system run. The simulator was sending more events than usual, but the notification count wasn't climbing at the rate I'd seen before. Not dramatically wrong. Just quieter than it should have been.

That's the debugging signal no alert can produce: pattern recognition built from time spent watching a system behave normally. The only way to develop it is reps.

Tracing the thread

I opened App Insights and started tracing. The crash_confirmed custom events were arriving as expected — correctly tagged with device IDs and confidence scores. But when I pulled up notification_sent events alongside them, the counts diverged.

The query that made it visible:

let confirmed = customEvents
    | where timestamp > ago(1h)
    | where name == "crash_confirmed"
    | project event_id = tostring(customDimensions["event_id"]), ts = timestamp;
let sent = customEvents
    | where timestamp > ago(1h)
    | where name == "notification_sent"
    | project event_id = tostring(customDimensions["event_id"]);
confirmed
| join kind=leftanti sent on event_id
| project event_id, ts
| order by ts desc

Crash events confirmed. No matching notification. A growing list of them.

I picked one event_id and traced it through the dependency chain in App Insights. The notifier function had been invoked. It had returned success. But there was no outbound SendGrid dependency call in the trace — the function had completed without making a network request.

I opened the notifier code. There it was:

try:
    response = sendgrid_client.send(message)
except Exception:
    pass

A bare except: pass wrapped around the SendGrid call. No log. No metric. No re-raise. A crash on the SendGrid side — a 5xx, a timeout, anything — would be swallowed silently. The function would report success to the host, the notification_sent event would never be emitted, and the system would look healthy.

The realization

My first reaction was to fix the exception handler. My second was to stop and think about what finding it meant.

Telemetry ingestion, event classification, storage — all of that is infrastructure. The pipeline exists to serve one purpose: getting a notification to the right person when a device crashes. That notification is the product. A crash event that never becomes an alert is a system that failed at the only moment that actually mattered.

And yet the most critical path in the system had no dedicated alerting. No alert fired when notification_sent fell behind crash_confirmed. No SLO tracked notification delivery rate. No runbook described what to do if the notifier went silent. Everything downstream of the classifier was effectively unwatched.

The infrastructure looked healthy because the infrastructure was healthy. The product was broken.

How I found it

Here I have to be honest about the methodology, because I didn't find this through heroic debugging instinct. I found it because I was playing a game.

A few weeks earlier, I had built a failure catalog for GuardianLink — a structured list of plausible failure scenarios, each designed to be injectable into the running system. The catalog is managed by Claude Code, which acts as the chaos engineering partner: it picks a failure from the list, injects it without telling me what it changed, and records the details in a gitignored state file I'm not allowed to read until I've identified the root cause myself.

This was failure #1 from the catalog. I didn't know that while I was debugging. I just knew something was wrong and that I had to find it starting from symptoms, not diffs.

Each entry in the catalog follows the same format: a description of what the system will look like from the outside during the failure, an implementation hint for the AI (how to introduce the fault believably), and a definition of what "finding the root cause" means — not just naming the symptom, but tracing it to the specific code or configuration responsible.

Rules of the game

The discipline only works if the rules are followed. From the catalog:

For Claude Code: Never reveal the active failure in chat. Write it to the state file only. Never add comments in committed code that hint at the injected fault. The fault must be plausible — the kind of thing that happens in real systems, not a contrived mess. Prefer faults that manifest as symptoms distant from the cause. That's the interview skill. After the user identifies the root cause, do a short post-mortem together: what signals would have caught it faster? Is there a test or alert worth adding?

For the user: Time yourself. Write down the timestamp when you start debugging. Write it down again when you find the cause. Do not read code diffs first. Start from symptoms — alerts, dashboards, logs. Keep a debugging journal. What did you check? What threw you off? What would have caught it sooner?

The rule about symptoms distant from the cause is the important one. A fault that produces an obvious error in an obvious place teaches you almost nothing. The skill that comes up in senior engineering interviews — walk me through how you diagnosed an incident — requires the ability to navigate from a downstream symptom to an upstream cause across system boundaries. That's what the catalog is designed to train.

Why this works

Building a system teaches you how it behaves when it's working. It doesn't teach you how it fails.

You can read documentation about distributed systems failure modes. You can study public post-mortems. Both are valuable. But neither builds the pattern recognition that lets you notice a notification count that's slightly quieter than expected, without an alert telling you to look.

That recognition comes from reps — from having debugged the same class of failure before, from knowing what normal looks like well enough to notice deviation. The failure injection methodology is a way to manufacture those reps deliberately, on your own system, on your own schedule, without waiting for a production incident to do the teaching.

The debugging journal is the other half. Writing down what you checked, what threw you off, and what would have caught it sooner converts each session from an experience into a lesson. After failure #1 I had two concrete observations: I'd spent too long looking at infrastructure metrics before checking business-level event counts, and there was no alert that would have caught this without a manual check. Both have direct action items.

What changed — and what hasn't yet

The immediate fix was straightforward: proper exception handling on the SendGrid call, structured logging on failure, and re-raising so the function host could retry with appropriate backoff.

The bigger fix — a dedicated alert on the crash_confirmed / notification_sent divergence rate — is on the backlog but not yet shipped. That's honest. The methodology surfaces gaps; closing them is ongoing work. The point of the session wasn't to achieve a green dashboard faster. It was to learn that the gap existed and understand why it had gone unnoticed.

Try it yourself

The failure catalog for GuardianLink is in the repo (docs/failure-scenarios.md). The rules of the game are public. The methodology scales to any system: you need a working platform, a list of plausible failure modes specific to your stack, and the discipline not to read the state file until you've found the root cause.

The catalog currently has eight scenarios — silent drops, partition hot spots, cascading retries, bad deploys with subtle regressions, key rotation timing windows, poison messages, cost explosions, and auth misconfigurations that only a proactive security review will catch. Each one is a class of failure that happens in real Azure systems. Each one has taught me something I couldn't have learned by building.

The infrastructure builds itself. The operational knowledge has to be earned.