Chaos Engineering Should Test What You Actually Depend On

Chaos engineering is easy to trivialize.

If all you hear is "randomly kill servers in production," it sounds irresponsible. If all you do is shut down one pod in staging, it sounds performative.

The useful middle ground is this:

Test the failure modes your architecture claims to tolerate.

The Right Question

If the platform says it is resilient across regions, availability zones, or replicas, you should be able to answer:

what detects failure?
what triggers failover?
what state is lost or delayed?
how does traffic shift?
how do operators know whether it worked?

If those answers only exist on a diagram, the system is not proven yet.

Why Game Days Work Better Than Randomness

Structured failure exercises are usually more valuable than blind disruption.

They let you:

define a hypothesis
choose a safe blast radius
capture metrics before and after
learn something actionable

That is a much better engineering loop than "break things and hope."

Start With Real Dependencies

For most systems, the first useful experiments are not dramatic regional takedowns. They are targeted failures such as:

database primary unavailable
queue backlog growth
dependency timeout increase
DNS or service-discovery failure
one region marked unhealthy

These tests reveal whether failover logic is real or just assumed.

The Goal Is Confidence, Not Theater

Chaos engineering is valuable when it turns resilience claims into observed behavior.

If the experiment teaches you:

which alert fires
how long recovery takes
whether the system degraded safely

then it did its job.

Chaos Engineering Should Test What You Actually Depend On_

The Right Question

Why Game Days Work Better Than Randomness

Start With Real Dependencies

The Goal Is Confidence, Not Theater

Further Reading

Related Writing.

Expand and Contract Is How You Change Schemas Without Breaking Running Code

eBPF Improves Observability Because It Sees Below the App