What does good evaluation actually mean in the context of AI product development?
We’ve already seen how it can go wrong: evaluations that over-index on benchmark scores, ignore user needs because they’re harder to quantify, and rely on inconsistent task framing and measurement protocols. These setups generate impressive-looking numbers, and deeply unreliable claims about system performance. Worse, they push teams away from the hard work required to build trustworthy AI. We’ve broken down these evaluation anti-patterns in a previous blog post, you can see full details here.
We’ve shown that to make trustworthy performance claims, you need evaluation that is transparent, reliable and meaningful. This needs work and thought, and does not come out of the box: good evaluation needs to be specific to your use-case. But what does that look like concretely?We want to share our approach to evaluation design for product development: keeping evaluation trustworthy while actually getting something built!
So you’ve got a shiny new idea for a product you want to build. How can you design evaluation so that you can make trustworthy claims about its capabilities, but also guide decisions during the development process?
To design your evaluation, you need to deeply understand what it is you are trying to build and measure. The following questions can guide you through that process. First, break down your system to first principles:
These questions might seem basic but it is easy to get carried away and skip fundamentals. Next, you need to understand the data involved, both end-to-end and for any sub-components:
This will allow you to plan what kind of data needs to be involved in your evaluation. Finally, you need to start thinking about your measurements:
By answering these questions, you will get a much clearer view of what you are trying to build, and how you can evaluate it meaningfully.
Understanding your product and your evaluation will require close collaboration with Subject Matter Experts (SMEs). It’s critical to involve them early in the process. They help validate your assumptions, surface what actually matters in practice, and check whether you’re even asking the right questions.
In many cases, you’ll discover that users care about something quite different than what you thought, that there are additional desiderata you hadn’t considered, or that there are simplifying constraints you can include. SMEs are equally essential when it comes to designing your evaluation: as we know, generic evaluation does not exist, and you need to sculpt yours around what an SME would consider to be ground truth.
The questions above aren’t just a prerequisite for evaluation design, they’re also a forcing function for team alignment on what you are building. Often, trying to design good evaluation can “flush out” hidden assumptions and expectations that haven’t been aligned yet. But you need a shared understanding of what “good” actually means for your full, complex system and, by extension, which behaviors need measuring. Only then can you design evaluations that are genuinely informative rather than performative.
But this is usually where implementation starts to bite. You now have your ideal evaluation and all desiderata, but you may not have any relevant ground truth data, nuanced metrics may be hard to capture, and solving these problems takes time. Your development is blocked in the meantime because you can’t measure progress or compare design choices. At this stage, perfect evaluation can seem uncomfortably far away. How do you still proceed?
The most important next step is to unblock your development. In our experience, this works best with a parallel approach: investing part of our effort in building out the high-quality end-to-end evaluation, while in parallel using strategic, partial evaluations to power immediate development. Are there any sub-problems you know you need to solve that you can already evaluate? Start with those. Next, try to find the most urgent questions you need answers to that require some evaluation you are currently missing. Can you design intermediate evaluation that helps you answer those, even if it does not cover your entire use-case? You can use multiple approaches and datasets to measure different things, and it may be possible to answer immediate questions facing your design by using proxies or simplifying assumptions.
You will often find that all your evaluation desiderata are hard to fulfil within the constraints of the development questions facing you right here, right now. It may be expensive and slow to source ground truth data based on your specific use-case, fuzzy or nuanced metrics may be difficult to capture, or the speed and cost of human annotations may not match your current requirements. For your strategic, partial evaluation, you may therefore need to use proxies or simplify, eg:
It is ok (and often necessary) to compromise, but you need to be very careful and deliberate about it. Every time you do, you should take the following steps:
In this way, you can keep moving forward: unblock your development with strategic compromises, whilst building out evaluation that comes closer and closer to all the criteria you arrived at when you first planned. You can use proxies to bootstrap your development, and then, as you build out your evaluation, replace them with better data to verify your assumptions and get more meaningful results. This way, you can iterate, refine and improve: working towards your ideal evaluation, but still getting meaningful and trustworthy results on the way.
Throughout this process, it is helpful to keep yourself grounded by holding the following questions in mind every time you measure something:
With this approach, it is possible to develop and build while staying trustworthy, so you can make strong claims with confidence. Even if you need to make compromises and be pragmatic, by being deliberate and explicit about it, you can still be confident in your results along the way. Taking the time and care to clearly define what good evaluation even means in your case is crucial, so that you can make smart decisions on how to evaluate, know what assumptions you are making, and be guided on how best to proceed at every step. This way, you know you are measuring exactly what matters to you, your stakeholders and your customers.
Building out robust evaluation for your AI system is hard work, but it is worth it! It allows you to be in a position where your claims match up with real performance and you can confidently deliver on your promises.