The Ratio
A weekly newsletter on reliability economics
The Number
24 of 32
Three in four financial services organizations are under-investing in reliability prevention — the worst rate of any measured industry, and a direct contradiction of the assumption that regulation drives prevention spend.
24 of 32 financial services organizations in our data are classified as under-investing in reliability prevention. Only 2 of 32 are over-investing. Their failure costs run at more than 4x their prevention spend. In technology, the split is nearly even: 29 under-investing, 27 over-investing out of 57.
The Ratio Take:Firefighting
Financial services carries more exposure from outages than any other industry. Regulatory, financial, reputational. You'd assume compliance pressure alone would push these organizations toward over-investing in prevention. The data says the opposite. FinServ is the industry most likely to be absorbing massive failure costs while running lean on the spend that would prevent them. This is the equivalent of an insurance company writing policies against flood damage while refusing to reinforce the levee. The premium income looks fine until the water arrives.
The most regulated industry in the benchmark is also the most under-invested in preventing the failures that trigger those regulations.
This Week in Reliability
Runtime Truth Goes Production
AI agents and autonomous tooling are forcing a fundamental shift: the old CI/CD + observability stack can't see what's actually executing in production until it's already failed. Runtime instrumentation is becoming the new control plane.
Deep Reads
Why CI/CD Pipelines Miss Runtime Failures
Lightrun · Primary evidence—CI/CD gap
CI/CD pipelines validate code through static analysis and tests but miss runtime failures like reflection-based type mismatches that only surface during execution. Lightrun's MCP integration connects AI coding assistants to live production state—variables, call stacks, execution counts—without redeployment.
The Ratio Take:Prevention
This is the gap AI code generators are about to make catastrophic. When agents ship dozens of changes per day, the 'compiles clean, tests pass, fails in prod' cycle becomes your entire operational budget. Runtime observability isn't nice-to-have anymore—it's the only way to instrument what's actually running before customers find it.
Green pipelines hide what reflection does at runtime.
The next era of software needs runtime control
LaunchDarkly · Vendor response—agent control
LaunchDarkly is launching AgentControl to manage not just feature flags in production code but also the AI agents acting autonomously on behalf of engineering teams. The solution extends runtime control from static deployments to dynamic agent behavior.
The Ratio Take:Prevention
Feature flags were the first runtime control plane; now we need a second one for the autonomous systems shipping the flags. The economics are brutal: agents move faster than approval workflows, so you either instrument them at runtime or discover their mistakes at customer scale. This is what 'shift left' looks like when left keeps moving.
Control what agents do, not just what code does.
Why Deterministic AI Engineering Requires Runtime Truth
Lightrun · Agent-specific runtime requirements
AI agents need runtime sensors to ground their work in live production truth, instrumenting and querying running systems on demand without restarts or redeploys.
Determinism in AI engineering isn't about the model—it's about knowing what the deployed artifact is actually doing right now.
Ground AI agents in runtime state, not assumptions.
Anyshift meets ServiceNow: production context for incident workflows
Anyshift · Runtime data for reactive workflows
ServiceNow manages incident workflows; Anyshift adds production runtime context—cause, blast radius, owner attribution—so those workflows operate on actual execution state instead of ticket metadata.
Incident management without runtime context is just expensive form-filling; this closes the loop between 'who owns the ticket' and 'what's actually broken in prod.'
Tickets finally know what's running.
Anyshift meets Postman: production-impact API checks before release gates run
Anyshift · Runtime input for prevention workflows
Postman runs API test collections; Anyshift adds live production context showing which API paths, consumers, owners, and monitors are in active use before release gates execute.
Pre-deployment API tests that don't know which endpoints are actually serving traffic are theater; this makes the gate intelligent.
Test what's used, not what's defined.
The Crowd Favorite
- Superstition - Single Version — Stevie Wonder ↗ — Correlation in your dashboards is not causation in prod. Chase the wrong signal and you burn 40 minutes while the real fault compounds.
- One More Time — Daft Punk ↗ — Retry without exponential backoff turns a five-second hiccup into a thundering herd. The loop amplifies the fault it was built to absorb.
- Piano Man — Billy Joel ↗ — A weekly traffic peak that still pages is a capacity model that was never written.
- Jump - 2015 Remaster — Van Halen ↗ — A feature flag with no kill switch is a deploy with no rollback path. Blast radius grows every minute it stays live.
- With Or Without You - Remastered 2007 — U2 ↗ — Circular service dependencies without circuit breakers mean one downstream timeout cascades through both callers.
The Ratio Take:Prevention
Prevents the cascade before it starts
The Challenger — The Over-Engineering Award
This week's winner: the team that replaced a two-line cron job with a distributed saga orchestrator, three queues, a custom retry-state machine, and its own Slack channel for alerts. All to run a database cleanup every six hours.
Prevention investment climbed. Incidents didn't move. The system built to eliminate toil became the toil.
Here's the trap. Prevention spend feels virtuous, so nobody questions it the way they question reactive spend. A war room gets a postmortem. A gold-plated pipeline gets a high-five. But spend that doesn't lower failure isn't prevention. It's complexity wearing prevention's badge.
The test is brutal and simple: did failure go down?
If prevention rose and incidents stayed flat, you didn't buy reliability. You bought overhead. And you'll pay to maintain it forever.
The Ratio Take:The Ratio
Over-investment signal
The Ratio is a weekly newsletter by Florian Hoeppner.
Take the assessment → reliabilityeconomics.com/benchmark
Reply to this email with your take.