YouTube

Most AI claims don’t hold up because the way we evaluate systems doesn't reflect real-world conditions.

Every week brings a new headline announcing that AI has “solved” another domain, and yet paradoxically industry adoption remains slow.

There are many factors hampering AI adoption, but a major driver of the gap between hype and real-world performance is evaluation: how we measure AI systems and what conclusions we draw from those measurements. In product development, nuance matters, and to build robust evaluation, you need to consistently question if you are actually measuring what you are claiming and prove your assumptions.

Most AI claims don’t hold up because the way we evaluate systems doesn't reflect real-world conditions. At UnlikelyAI, trust in the systems we build means a lot to us: trust in the claims we make, and trust that our products will work for users in the messy complexity of reality. Trust requires a fundamentally different approach to evaluation than the one driving today’s hype cycles.

Beware of interpretative leaps

We are in an era of rapid AI development, where the ease of running models on pre-packaged benchmarks encourages sweeping claims about what AI can do. Let’s break down a few of those dangerously deceiving mistakes in evaluation that lead to confident promises in the lab, and disappointing failures in the real world.

Benchmarks don’t tell the whole story

One of the most frequent evaluation mistakes is to draw unfounded conclusions from benchmark performance: equating good metrics on a benchmark with good performance in the real world. Benchmarks are a well-established tool for evaluating model performance – they often target a specific challenge, allow easy comparison between different models and are convenient to use. Someone has already put in the work of compiling data, golden output, and measurement criteria, so all you need to do is run your system over them. However, when your goal is to develop a useful product, rather than academically comparing different architectures, you need to be careful: if the data and task don’t genuinely reflect your use-case, you run the risk of not getting meaningful results once you try to translate them to practice.

Often, benchmark tasks are quite artificial, optimised for ease of comparison between models, rather than necessarily reflecting the messy data of the real world, let alone all the specifics of the context you want to deploy your system in – and for a production system, robustness and consistency of results at scale are crucial.

Partial evaluations are not enough for complex real-life systems

A frequent source for gaps between evaluation and performance is the fallacy to focus evaluation only on those parts of the system where it’s comparatively easy to define and measure correctness. But to make plausible product claims to your user, you need to measure end-to-end performance in a realistic setting.

For example, if you're building a system for medical image classification, you may focus your evaluation on the “core task” of whether your model correctly predicts the presence of a disease given a scan. While this is easy to define and you can lean on existing data, it is only part of the puzzle of a real-life application: for example, what if your system is so slow or its interface so off-putting it stops medical professionals from using it? Before you can make claims about time saved or improved outcomes for doctors and patients, you need to actually measure whether these claims will hold up in practice – the success of your system depends on factors beyond the model itself, which you need to evaluate.

Task framing and measurement protocol can completely change results

Another issue driving unsubstantiated performance claims are the measurements themselves: what is being measured and how, and what conclusions are drawn from this?

Consider the recent announcements of Gemini and OpenAI reaching gold-medal performance on the International Maths Olympiad. Even if we avoid the sensational conclusion that models can now replace mathematicians, many readers naturally interpret this as “the task is solved” and assume building a robust, production-ready maths system is trivial. But as Fields Medalist Terence Tao points out, differences in task definition, available resources, allowed assistance, and evaluation protocol can lead to massively different performance outcomes, even for the “same” task.

Layer on top unreliable methods such as LLMs judging their own outputs or haphazard prompting experiments under one set of conditions that don’t match production constraints, and we get an output where evaluation and conclusions don’t match up.

Real-world product evaluation is hard

The mistakes mentioned above are common, but this approach to evaluation is risky and trust-busting. It leads to the mismatch we are seeing between hyped claims about AI and disillusioned users failing to get it to work for them.

There is now a growing awareness that something is going wrong in AI evaluations and that there is a gap between academic datasets and product development. This has driven recent efforts to create more “real-life” benchmarks. For example, PromptQL‘s collaboration with UC Berkeley and Samsung’s TRUEbench claim to reflect “the reliability demands of mission-critical decisions and the real questions businesses ask every day” (PromptQL) or contain “a comprehensive set of metrics to measure how large language models (LLMs) perform in real-world workplace productivity applications” (TRUEbench). It’s tempting to think that it will address our evaluation concerns if we just use these benchmarks. But we still need to be careful: Just because a benchmark claims to be grounded in business use-cases, does it actually overlap perfectly with yours? Will it truly measure exactly what you care about?

There is no one-size-fits-all approach to evaluation

The hard truth is that there is no generic, robust evaluation out of the box. It is easy to generate impressive numbers. It is far harder to ensure those numbers meaningfully predict future system behaviour.

To avoid making claims that you later can’t uphold, you need to invest care and thought. When building a tech product, it is easy for a team of engineers to be excited about building cool tech, rather than verifying it behaves as intended. The result of cutting corners on evaluation is discovering painful gaps too late and failing to achieve the amazing results you were hoping for.

Evaluation needs to be a part of your design from the very beginning, not just an afterthought. Good evaluation should be transparent, reliable, and meaningful. It lets you be confident in your work, and makes you trustworthy. It ensures the product will actually deliver value for users.

Evaluation is an important development tool. As you build a product, you rely on evaluation to understand how every change, a new prompt, a modified component, an architectural adjustment affects the system as a whole. But this is only meaningful when your evaluation criteria are well-defined and specific to your use-case.

Generic prompts and off-the-shelf datasets may be convenient, but they rarely capture the behaviours you actually need to measure, and they won’t guide you toward the best product decisions. In many cases, doing evaluation properly requires grappling with a harder question: what does “correct” behaviour even look like in the context of your full, complex process?

In the next post, we will share practical guidance for designing robust evaluation within the real constraints of product development. We will outline a lightweight approach teams can adopt immediately to test assumptions, measure what really matters, and build AI systems they can make trustworthy, defensible claims about.