Why Agents Can't Reliably Evaluate Themselves

There is a question that recurs across centuries of human thought, which surfaces wherever power meets accountability: who watches the watchmen?

Blogs

Monday, February 23, 2026

Engineering

Why Agents Can't Reliably Evaluate Themselves

There is a question that recurs across centuries of human thought, which surfaces wherever power meets accountability: who watches the watchmen?

Juvenal asked about Roman guards. Enlightenment philosophers asked about sovereigns. The question endures because it points to something structural, a tension that no clever arrangement fully resolves.

We are now asking the same question about AI agents.

As these systems grow more autonomous, planning, executing, operating for hours or days without human oversight, we face a version of the problem that is not merely political but mathematical. Can a sufficiently powerful system evaluate the quality of its own reasoning? Can it determine whether its outputs are true, whether it is actually pursuing the objective we intended rather than some proxy that merely resembles it?

The seductive answer is yes; that self-evaluation is just another capability, and that with enough scale and the right prompting, the agent can serve as its own judge. This idea is clean, cheap, and increasingly popular. It is also, for reasons that are both formally provable and empirically measurable, wrong.

Not useless. But fundamentally incomplete, in ways that don't yield to more parameters, better training, or cleverer prompts. The limits come from the structure of self-reference itself, and they have been known in various forms.

The Formal Walls

For agents, the implication is direct: a system that tries to internally certify "my reasoning is globally reliable" cannot do so in a fully general way without stepping outside itself. You can't be both the student and the final exam.

Undecidability sets hard ceilings. Turing's halting problem tells us there's no general algorithm that decides whether arbitrary programs terminate. Any nontrivial semantic property of programs is undecidable. "Is this agent always correct?" "Does it ever output unsafe actions?" "Will it optimize the intended objective across all contexts?" These are exactly the kinds of questions self-evaluation would need to answer - and they're formally unanswerable in the general case.

This doesn't mean no useful self-evaluation exists. It means there is no universal, fully reliable self-evaluator for all sufficiently expressive agents and domain-specific tasks. Practical systems succeed only by restricting the problem class, bounding the horizon, or accepting probabilistic rather than absolute guarantees.

Where Empirical Self-Evaluation Breaks

Even when formal impossibility isn't the binding constraint, the empirical picture is clear.

Self-consistency is not truth. Approaches like sampling multiple responses and checking agreement can detect certain hallucinations. But an LLM can be consistently wrong, especially on misconceptions baked into training data. TruthfulQA showed that larger models can actually become less truthful, confidently reproducing common falsehoods. Consistency without ground truth is just coordinated error.

Self-critique doesn't work the way you'd hope as well. When LLMs are tested as both plan generators and plan verifiers, the self-critiquing loop actually degraded performance. The verifier produced too many false positives, rubber-stamping bad plans. Decomposition studies found that models struggle specifically at locating their own errors, even when they're capable of correcting errors once shown where they are. The bottleneck isn't fixing mistakes; it's finding them in the first place.

The Evaluator Is Not Neutral

Here's where it gets structurally concerning. When the agent is also the evaluator, independence is violated.

LLM judges systematically score their own outputs higher than others', even when human evaluators rate them as equivalent. This self-preference bias is correlated with self-recognition: models that can identify their own generations are more biased toward them. This isn't noise. It's a structural conflict of interest.

And it gets worse at scale. In RLHF-style systems, the agent optimizes a proxy reward model, essentially a learned approximation of what good looks like. Strong optimization against an imperfect proxy drives Goodhart effects: the agent climbs the proxy hill while sliding down the true objective. Formal work has shown that for broad policy classes, "unhackable" reward proxies essentially don't exist. The agent looks great by its own metric while performing poorly by the metric that matters.

This is the proxy collapse problem: when the evaluator is an imperfect stand-in for the real goal, optimization pressure turns self-assessment into self-deception.

The Takeaway

The instinct to let agents evaluate themselves is understandable. It's the cheapest, fastest path to a quality signal. But cheapness is exactly the problem.

The formal limits are real: self-reference, undecidability, and incompleteness set hard ceilings on what any sufficiently expressive system can verify about itself. The interdisciplinary pattern is real: from philosophy to math, self-certification without external anchors has always been suspect.

The uncomfortable conclusion is this: the more autonomous we make these systems, the more we need something outside of them to tell us whether they're working. The capabilities scale; the trustworthiness of self-assessment does not.

This runs against the grain of how the industry wants to move. The whole promise of autonomy is that you don't need a human in the loop. But the mathematics of self-reference doesn't care about product roadmaps. The systems that will actually work at scale, the ones we can trust with real stakes, will be the ones architected from the start around a simple, ancient insight: no mind is a reliable witness to its own competence.

Get Started

Your agents deserve better.

Use Cascade.

Keeping your AI agents safe and reliable as you scale.

Request Access