The AI Endurance Test: Why Task Half-Life Is a Strategic Imperative

The defining tension in 2025-era AI is that the same systems demonstrating brilliant capability on specific tasks remain brittle when run continuously. Five-minute demos succeed; eight-hour deployments break. This is not anecdote; it is now a measured pattern.

Figure 1: AI agents demonstrate a clear trend of completing longer tasks over time (at 50% success rate). Chart adapted from Kwa et al. (2025), via Ord (2025).

Recent work by Toby Ord at Oxford, building on empirical analysis by METR (Kwa et al., 2025), proposes a useful framework for the pattern: the AI success half-life. Understanding what this implies for system design — and what it implies for which companies become structurally defensible — is where most of the venture-scale leverage in AI now sits.

What the half-life describes

Imagine an AI agent attempting a complex engineering task. The Ord model proposes that for every unit of effort a human would expend, there is a small, consistent probability the agent will fail. The longer the task, the more these failure probabilities accumulate. The result is an exponential drop in success — much like radioactive decay.

The consequence is non-intuitive. A sixteen-hour task is not twice as hard as an eight-hour task. If the eight-hour task succeeds 50% of the time, the sixteen-hour task succeeds roughly 25% of the time. The harder the deployment, the steeper the curve.

Figure 2: AI model success probability curves illustrating the 'half-life' effect; success horizons shorten as tasks lengthen or higher reliability is required. Charts adapted from Ord (2025), based on Kwa et al. (2025) data.

Four structural responses

If AI has a half-life, the response is systems that either reduce the per-step failure probability or “reset the clock” — bounding the consequence of any individual failure. The most defensible AI companies over the next five years will be the ones doing both well.

Resilience and modularity. Complex tasks decompose into chains of sub-tasks; a single bad link breaks the chain. The opportunity lies in systems with robust error detection, recovery mechanisms, and the ability to learn from sub-task failures without compromising the overall mission. This is engineering, not capability.
Human-AI collaboration. Ord’s analysis notes that human performance on long tasks decays more slowly than AI’s. Humans excel at course correction and adaptive reasoning over extended periods. The strongest systems are symbiotic: AI handles high-speed segments; humans manage oversight, integration, and recovery from novel failures.
Realistic roadmaps. A 50% success rate on a one-hour task is a milestone. What is the path to 90% on an eight-hour operational task? That trajectory — measured, sustained, demonstrable — is where venture-scale value compounds. Demo capability is not deployment capability.
Architectural change. The current half-life shape is not fixed. Better memory handling, sub-agent decomposition, externalised state, and novel verification approaches all flatten the decay curve. The next generation of foundationally different AI companies will be the ones rebuilding the curve, not riding the current one.

What to look for

Three patterns separate the AI companies that compound from the ones that compound briefly and plateau.

Architectures built for resilience. Modularity, sophisticated error handling, deliberate human-in-the-loop integration — not as fallback, as design.
Pragmatic, time-bounded roadmaps. Founders who can articulate why the curve flattens for their problem, and what intermediate milestones will demonstrate it. Vague “we’re working on reliability” answers do not survive contact with the half-life pattern.
High reliability on bounded, valuable problems. Better to be 99.5% reliable on a four-hour task with clear ROI than 50% reliable on an eight-hour task with a long PR cycle. The economic value of bounded reliability is consistently underestimated.

Closing observation

The half-life concept is grounding. It reframes the AI race from a sprint of capability demos to a sustained engineering exercise in dependability. The most consequential AI companies of the next decade will not be the ones that produced the most impressive five-minute demonstrations. They will be the ones that quietly figured out how to extend the operational half-life of useful work — and then did it, repeatedly, for problems that mattered.

Acknowledgments: The framework on AI endurance draws on Toby Ord’s “Is there a Half-Life for the Success Rates of AI Agents?” and the empirical research from METR (Kwa et al., 2025).