Designing systems for 'unknown unknowns'

January 22, 2026

Intro

In large scale distributed systems, anything can fail at some point. So as part of the system design, engineers are expected to list the anticipated failure modes and how their design mitigate and recover from them. So we cover the failure modes that we know with automated tests.

However, that is limited by only what is known. When the system is deployed to the wild, there will anyways be stuff that fails in unexpected ways, meaning failure is inevitable because of unknown unknowns.

Measuring resilience

We acknowledge we cannot prevent all failures with tests and now what matters for these failures is how fast we recover from them. There are standard metrics that measure this more formally:

MTTD (mean time to detection): how quickly do we find out something failed?
MTTA (mean time to acknowledgement): how quickly do we start responding?
MTTR (mean time to recovery): how quickly we recover.

For example,

\mathrm{MTTD} = \frac{\sum_{i=1}^{N} \left( t^{(i)}_{\text{detection}} - t^{(i)}_{\text{start}} \right)}{N}

where $N$ = number of incidents, $t^{(i)}_{\text{detection}}$ = detection time of incident $i$ , $t^{(i)}_{\text{start}}$ = start time of incident $i$ .

The smaller these numbers are, the more resilient your system is to failures.

How do we make these numbers smaller?

How do we improve?

I talked about the fact that unknown unknowns are inevitable. We cannot forecast them upfront, so the resilience of a system to these failures comes from how fast we detect, acknowledge, and recover when the system fails.

Fault injection

One way to deal with unknown unknowns is fault injection, in which engineers intentionally introduce failures into a system and observe how it behaves.

Different companies use different approaches for injecting failures to their system. Netflix developed Chaos Monkey, a tool that randomly terminates production servers to force services to tolerate server failures.

At AWS, this idea is applied through a game day process. A game day is an exercise where failures are simulated and teams respond using the same tools and processes they would use during a real incident, exposing gaps in detection, response, and recovery.

This is directly related to the metrics discussion. During a game day, you can measure how long it takes before someone notices something is wrong (MTTD), how long it takes before someone actively starts responding (MTTA), and how long it takes to mitigate or recover (MTTR).

Fault injection in practice

For example, in one of the game days I led, I introduced an artificial delay in a database client, which caused messages to pile up in a message queue. Error rates did not change, and the system looked “healthy” from the dashboards. After a few hours, the issue was eventually noticed through a report from a dependent team whose data was no longer getting refreshed.

That game day identified that the team would lose most of the time in detection (poor MTTD) and acknowledgement (poor MTTA), resulting in poor MTTR. As a result, we took action items to improve the system based on our learnings: we added an alarm on the age of the oldest message in the queue, turning the unknown unknown into a known unknown. The next time there is high database latency in production, the system is now able to detect it early.

Each game day surfaces gaps, and fixing them improves how the system responds next time.

Voilà, you have just improved your MTTD, MTTA, and MTTR!

This was originally a series of posts in my Telegram channel.