The Problem
When I talk to teams that are making money with LLM-based products, “evaluation” is almost universally their largest challenge. We’ve talked about the challenges with evaluation at length before, but here’s a quick overview of the problems:
At the end of the day, “evaluation” is answering the question of “Does my LLM-based application ‘work’?”. This has two really tricky problems buried in it:
LLM applications typically have a huge input space, as they let users input natural language (and users can come up with an unimaginable array of natural language!)
LLMs (mostly) output unstructured information, and defining what “works” is much more complicated than in traditional software.
There are a handful of design variables at play when a developer tries to build an LLM application:
Namely:
The prompt
The model
(optionally) The information retrieval strategy
Changing any of these design variables can fundamentally alter the system, meaning that to iterate on any one of these variables, you need some ability to answer the fundamental question, “How well does my LLM application work?”.
For most teams, one of those design variables has been fixed (so far): most people are using OpenAI models. Most early design work has been at the intersection of prompts and information retrieval.
The Solutions (so far)
I break down the approaches that teams have taken to evaluate their LLM applications into four categories:
Offline, human evaluation
What is it: The most common form of evaluation, human evaluation, involves collecting a list (often a spreadsheet!) of sample user inputs and using a human to check how “good” the outputs of the LLM application are. The complexity of human evaluation can scale from ~5 examples that a single developer eyeballs when a change is made to armies of contracted workers that generate thousands of example inputs and rate the outputs.
Pros: The most accurate way to evaluate, in particular for challenging tasks (‘How good is this summary?’). Easy to get started.
Cons: Slow (read: expensive, slow iterations), hard to scale.
Offline, deterministic evaluation
What is it: In some cases, deterministic metrics can be used to evaluate LLMs. If you’re generating code, maybe you can check the generated code against unit tests. If you’re building a chatbot, you can measure the length of responses as a proxy for ‘conciseness.’ A lot of the popular LLM benchmarks use these types of metrics (see HELM).
Pros: Fast, well-studied, allow for quick iteration on prompts.
Cons: Require you to build datasets of examples, which can be expensive. Many tasks don’t have suitable metrics, so this approach fundamentally doesn’t work for every task.
Offline, model-driven evaluation
What is it: Many have been tempted to use LLMs to ‘simulate’ human evaluation — asking a model to judge the quality of a summary, for example. Although this seems to work okay at face value, early research is showing that you probably shouldn’t always rely on it.
Pros: Fast, scalable, not too expensive, measures even complicated tasks
Cons: Still somewhat expensive (LLMs are expensive!), might not be reliable, hard to gain trust in the outputs.
Online evaluation
What is it: Rather than building evaluation datasets (which is hard!), some teams have chosen to monitor real-world user inputs and try to pick out failure cases automatically. The field of product analytics has shown us that we can measure user sessions to track a series of user engagement metrics. If you tie those metrics together with the LLM completions, you may be able to find and debug edge cases.
Pros: In some sense, the ultimate source of truth: what your users like. Great if you already have something in prod. No need to build ground truth datasets.
Cons: You need to put something in production first; no ability to gain trust in the system offline. Not all metrics can be easily calculated. Slow iteration, you need to collect production traffic to compare designs.
My thoughts
Solving these evaluation challenges is the most important outstanding challenge for the broad adoption of LLMs. Being able to iterate quickly is the difference between a cool demo and a product. In particular, I think solving the offline case is critical — figuring out how to effectively evaluate novel tasks will be central when building LLM applications.