Issue №7June 23, 2026

The Ratio

A weekly newsletter on reliability economics

The Number

24 of 32

Three in four financial services organizations are under-investing in reliability prevention — the worst rate of any measured industry, and a direct contradiction of the assumption that regulation drives prevention spend.

24 of 32 financial services organizations in our data are classified as under-investing in reliability prevention. Only 2 of 32 are over-investing. Their failure costs run at more than 4x their prevention spend. In technology, the split is nearly even: 29 under-investing, 27 over-investing out of 57.

The Ratio Take:Firefighting

Financial services carries more exposure from outages than any other industry. Regulatory, financial, reputational. You'd assume compliance pressure alone would push these organizations toward over-investing in prevention. The data says the opposite. FinServ is the industry most likely to be absorbing massive failure costs while running lean on the spend that would prevent them. This is the equivalent of an insurance company writing policies against flood damage while refusing to reinforce the levee. The premium income looks fine until the water arrives.

The most regulated industry in the benchmark is also the most under-invested in preventing the failures that trigger those regulations.

This Week in Reliability

Runtime Truth Goes Production

AI agents and autonomous tooling are forcing a fundamental shift: the old CI/CD + observability stack can't see what's actually executing in production until it's already failed. Runtime instrumentation is becoming the new control plane.

Deep Reads

Why CI/CD Pipelines Miss Runtime Failures

Lightrun · Primary evidence—CI/CD gap

CI/CD pipelines validate code through static analysis and tests but miss runtime failures like reflection-based type mismatches that only surface during execution. Lightrun's MCP integration connects AI coding assistants to live production state—variables, call stacks, execution counts—without redeployment.

The Ratio Take:Prevention

This is the gap AI code generators are about to make catastrophic. When agents ship dozens of changes per day, the 'compiles clean, tests pass, fails in prod' cycle becomes your entire operational budget. Runtime observability isn't nice-to-have anymore—it's the only way to instrument what's actually running before customers find it.

Green pipelines hide what reflection does at runtime.

The next era of software needs runtime control

LaunchDarkly · Vendor response—agent control

LaunchDarkly is launching AgentControl to manage not just feature flags in production code but also the AI agents acting autonomously on behalf of engineering teams. The solution extends runtime control from static deployments to dynamic agent behavior.

The Ratio Take:Prevention

Feature flags were the first runtime control plane; now we need a second one for the autonomous systems shipping the flags. The economics are brutal: agents move faster than approval workflows, so you either instrument them at runtime or discover their mistakes at customer scale. This is what 'shift left' looks like when left keeps moving.

Control what agents do, not just what code does.

Why Deterministic AI Engineering Requires Runtime Truth

Lightrun · Agent-specific runtime requirements

AI agents need runtime sensors to ground their work in live production truth, instrumenting and querying running systems on demand without restarts or redeploys.

The Ratio Take:Prevention

Determinism in AI engineering isn't about the model—it's about knowing what the deployed artifact is actually doing right now.

Ground AI agents in runtime state, not assumptions.

Anyshift meets ServiceNow: production context for incident workflows

Anyshift · Runtime data for reactive workflows

ServiceNow manages incident workflows; Anyshift adds production runtime context—cause, blast radius, owner attribution—so those workflows operate on actual execution state instead of ticket metadata.

The Ratio Take:Firefighting

Incident management without runtime context is just expensive form-filling; this closes the loop between 'who owns the ticket' and 'what's actually broken in prod.'

Tickets finally know what's running.

Anyshift meets Postman: production-impact API checks before release gates run

Anyshift · Runtime input for prevention workflows

Postman runs API test collections; Anyshift adds live production context showing which API paths, consumers, owners, and monitors are in active use before release gates execute.

The Ratio Take:Prevention

Pre-deployment API tests that don't know which endpoints are actually serving traffic are theater; this makes the gate intelligent.

Test what's used, not what's defined.

The Crowd Favorite

Superstition - Single Version — Stevie Wonder ↗ — Correlation in your dashboards is not causation in prod. Chase the wrong signal and you burn 40 minutes while the real fault compounds.
One More Time — Daft Punk ↗ — Retry without exponential backoff turns a five-second hiccup into a thundering herd. The loop amplifies the fault it was built to absorb.
Piano Man — Billy Joel ↗ — A weekly traffic peak that still pages is a capacity model that was never written.
Jump - 2015 Remaster — Van Halen ↗ — A feature flag with no kill switch is a deploy with no rollback path. Blast radius grows every minute it stays live.
With Or Without You - Remastered 2007 — U2 ↗ — Circular service dependencies without circuit breakers mean one downstream timeout cascades through both callers.

The Ratio Take:Prevention

Prevents the cascade before it starts

The Challenger — The Over-Engineering Award

This week's winner: the team that replaced a two-line cron job with a distributed saga orchestrator, three queues, a custom retry-state machine, and its own Slack channel for alerts. All to run a database cleanup every six hours.

Prevention investment climbed. Incidents didn't move. The system built to eliminate toil became the toil.

Here's the trap. Prevention spend feels virtuous, so nobody questions it the way they question reactive spend. A war room gets a postmortem. A gold-plated pipeline gets a high-five. But spend that doesn't lower failure isn't prevention. It's complexity wearing prevention's badge.

The test is brutal and simple: did failure go down?

If prevention rose and incidents stayed flat, you didn't buy reliability. You bought overhead. And you'll pay to maintain it forever.

The Ratio Take:The Ratio

Over-investment signal

The Ratio is a weekly newsletter by Florian Hoeppner.

Take the assessment → reliabilityeconomics.com/benchmark
Reply to this email with your take.

Our weekly newsletter on reliability economics.

The Ratio

24 of 32

Runtime Truth Goes Production

Why CI/CD Pipelines Miss Runtime Failures

The next era of software needs runtime control

Why Deterministic AI Engineering Requires Runtime Truth

Anyshift meets ServiceNow: production context for incident workflows

Anyshift meets Postman: production-impact API checks before release gates run

Our weekly newsletter on reliability economics.