Build Trustworthy AI Evaluations For Product Development

What does good evaluation actually mean in the context of AI product development?

We’ve already seen how it can go wrong: evaluations that over-index on benchmark scores, ignore user needs because they’re harder to quantify, and rely on inconsistent task framing and measurement protocols. These setups generate impressive-looking numbers, and deeply unreliable claims about system performance. Worse, they push teams away from the hard work required to build trustworthy AI. We’ve broken down these evaluation anti-patterns in a previous blog post, you can see full details here.

We’ve shown that to make trustworthy performance claims, you need evaluation that is transparent, reliable and meaningful. This needs work and thought, and does not come out of the box: good evaluation needs to be specific to your use-case. But what does that look like concretely?We want to share our approach to evaluation design for product development: keeping evaluation trustworthy while actually getting something built!

How to approach your evaluation

So you’ve got a shiny new idea for a product you want to build. How can you design evaluation so that you can make trustworthy claims about its capabilities, but also guide decisions during the development process?

Start by getting an overview of the end-to-end process

To design your evaluation, you need to deeply understand what it is you are trying to build and measure. The following questions can guide you through that process. First, break down your system to first principles:

What problem am I trying to solve?
Who are my users?
What does an end-to-end flow through my system look like?
Are there any separate stages involved in what I’m building, or any sub-components?

These questions might seem basic but it is easy to get carried away and skip fundamentals. Next, you need to understand the data involved, both end-to-end and for any sub-components:

What are my inputs and outputs?
What modality am I working over? Text, images, sound, PDF, multi-modal, ...?
Will the data be messy or clean?
Do I know the input in advance, or is it open? Are there any constraints on it?
What is my output space? What are possible labels or results?

This will allow you to plan what kind of data needs to be involved in your evaluation. Finally, you need to start thinking about your measurements:

What are my criteria for goodness or success? Is it getting the right label? Explanation quality? Style? Speed? Correctness? ……?
What measurements can I make to capture each of these?
How can I make these measurements? Can I get them programmatically, or do they require judgement or human annotation?

By answering these questions, you will get a much clearer view of what you are trying to build, and how you can evaluate it meaningfully.

Involve SMEs early and frequently

Understanding your product and your evaluation will require close collaboration with Subject Matter Experts (SMEs). It’s critical to involve them early in the process. They help validate your assumptions, surface what actually matters in practice, and check whether you’re even asking the right questions.

In many cases, you’ll discover that users care about something quite different than what you thought, that there are additional desiderata you hadn’t considered, or that there are simplifying constraints you can include. SMEs are equally essential when it comes to designing your evaluation: as we know, generic evaluation does not exist, and you need to sculpt yours around what an SME would consider to be ground truth.

A clear view on your evaluation

The questions above aren’t just a prerequisite for evaluation design, they’re also a forcing function for team alignment on what you are building. Often, trying to design good evaluation can “flush out” hidden assumptions and expectations that haven’t been aligned yet. But you need a shared understanding of what “good” actually means for your full, complex system and, by extension, which behaviors need measuring. Only then can you design evaluations that are genuinely informative rather than performative.

But this is usually where implementation starts to bite. You now have your ideal evaluation and all desiderata, but you may not have any relevant ground truth data, nuanced metrics may be hard to capture, and solving these problems takes time. Your development is blocked in the meantime because you can’t measure progress or compare design choices. At this stage, perfect evaluation can seem uncomfortably far away. How do you still proceed?

Iterate evaluation strategy and develop in parallel

The most important next step is to unblock your development. In our experience, this works best with a parallel approach: investing part of our effort in building out the high-quality end-to-end evaluation, while in parallel using strategic, partial evaluations to power immediate development. Are there any sub-problems you know you need to solve that you can already evaluate? Start with those. Next, try to find the most urgent questions you need answers to that require some evaluation you are currently missing. Can you design intermediate evaluation that helps you answer those, even if it does not cover your entire use-case? You can use multiple approaches and datasets to measure different things, and it may be possible to answer immediate questions facing your design by using proxies or simplifying assumptions.

You can use proxies and simplify, but you need to be deliberate about it

You will often find that all your evaluation desiderata are hard to fulfil within the constraints of the development questions facing you right here, right now. It may be expensive and slow to source ground truth data based on your specific use-case, fuzzy or nuanced metrics may be difficult to capture, or the speed and cost of human annotations may not match your current requirements. For your strategic, partial evaluation, you may therefore need to use proxies or simplify, eg:

Metrics: if it’s hard to measure exactly what you want, you may need to find a different metric that comes close but is easier to measure
Data: instead of sourcing your own data, you may rely on existing benchmarks or datasets that are similar or measure part of what you care about
Annotations: you may need speedier annotation signals, eg replacing human annotation with LLM-as-judge

It is ok (and often necessary) to compromise, but you need to be very careful and deliberate about it. Every time you do, you should take the following steps:

Explicitly note your assumptions: What are you leaving out? What are you assuming is “close enough”? What are you simplifying, and how? What are you trading off? What are you substituting?
Verify: can you try proving any of the assumptions made in step 1? Can you get confirmation from SMEs? Can you run a quick experiment that shows your proxy maps onto the real thing satisfactorily? Can you iterate your system so the assumptions are true for now by decreasing design scope to match?
Gauge impact: can you measure or quantify how reliable your proxies are? How much are you losing, and what is getting dropped? Is this acceptable in the context of the decision you are trying to make with this evaluation, or do you need to go back and find a different method?
Communicate clearly: be explicit about the outcome of the above steps when communicating about what you’re building with stakeholders, end-users and customers. The more explicit you are about your assumptions, compromises and proxies, the better you can still make trustworthy claims about what your system can do, and why you believe that to be true.

Keep iterating and refining, and stay grounded

In this way, you can keep moving forward: unblock your development with strategic compromises, whilst building out evaluation that comes closer and closer to all the criteria you arrived at when you first planned. You can use proxies to bootstrap your development, and then, as you build out your evaluation, replace them with better data to verify your assumptions and get more meaningful results. This way, you can iterate, refine and improve: working towards your ideal evaluation, but still getting meaningful and trustworthy results on the way.

Throughout this process, it is helpful to keep yourself grounded by holding the following questions in mind every time you measure something:

What am I measuring, and why?

Is what I’m actually measuring the same as what I’m trying to measure or prove?

What assumptions am I currently making, and have I made them explicit?

Why do I believe this evaluation will show something meaningful? What do I expect it to show?

Staying trustworthy and confident

With this approach, it is possible to develop and build while staying trustworthy, so you can make strong claims with confidence. Even if you need to make compromises and be pragmatic, by being deliberate and explicit about it, you can still be confident in your results along the way. Taking the time and care to clearly define what good evaluation even means in your case is crucial, so that you can make smart decisions on how to evaluate, know what assumptions you are making, and be guided on how best to proceed at every step. This way, you know you are measuring exactly what matters to you, your stakeholders and your customers.

Building out robust evaluation for your AI system is hard work, but it is worth it! It allows you to be in a position where your claims match up with real performance and you can confidently deliver on your promises.

Trustworthy evaluation for AI product development: making robust claims with confidence