It's LLMs All the Way Down: A Practical Guide to GenAI Evals

Generative AI (GenAI) systems and Large Language Models (LLMs) are empowering us to tackle new types of problems and enabling the implementation of smart, autonomous (or semi-autonomous) systems. However, the history of responsible machine learning and data science is rooted in the need to quantify and monitor the performance of the models we use.

In the brave new world of GenAI, new challenges arise due to more complex modes that are inherently non-deterministic and for which evaluation is much more nuanced, given the nature of the outputs. In this scenario, we cannot purely rely on classic numerical metrics such as Recall and Precision, often derived from exact matching of strings. GenAI solutions can range from simple single-shot calls to an LLM to complex Agentic AI workflows that incorporate Retrieval-Augmented Generation (RAG), deterministic tools, and sub-agents; each component needs its own form of evaluation in addition to an end-to-end performance measurement.

However, the history of responsible machine learning and data science is rooted in the need to quantify and monitor the performance of the models.

Choosing our evaluation?

The first thing we have to decide is what we are going to measure. There is no one-size-fits-all answer here, as we need to consider what our solution is designed to achieve and what we care about with regard to its performance.

We must decide which metrics to consider when determining what is relevant to our solution. We also need to draw a dividing line between evaluating a solution during development and prior to production release (so that we have a realistic expectation of what our users are going to experience) and ongoing monitoring/guardrails for protecting the system from drift. The metrics we will discuss are relevant in both scenarios, but the practicalities of how they are implemented are a little different; henceforth, we will simply assume we are evaluating during development to avoid confusion. So, let's explore a couple of example scenarios.

Moving Beyond N-Grams: Statistical Scoring

Before the rise of generative LLMs, the primary tasks for language models were more constrained, such as machine translation and text summarisation. To evaluate these tasks, metrics were developed to measure the similarity between a model-generated text and a set of high-quality human-written reference texts.

The most common of these are BLEU and ROUGE, both of which work by counting the overlap of n-grams (sequences of 'n' words) between the candidate (model output) and the reference (human "gold standard") texts.

BLEU (Bilingual Evaluation Understudy): This is a precision-focused metric that measures how many n-grams from the model's output also appear in the human reference. It answers: "Of the words in the generated text, how many were correct?".
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): This is a recall-focused metric primarily used for text summarisation. It measures how many n-grams from the human reference also appear in the model's output. It answers: "How much of the essential information from the reference summary did the model capture?".

For tasks like machine translation and summarisation, modern metrics offer a more nuanced approach by looking at meaning rather than simple n-gram counting:

METEOR (Metric for Evaluation of Translation with Explicit ORdering): This balanced metric incorporates both Precision and Recall (with heavier weighting on recall) and goes beyond exact word matching by including stemming and synonymy (using resources like WordNet). Crucially, it includes a fragmentation penalty to reward correct word order.
BERTScore: A more modern approach that leverages contextual word embeddings from a pre-trained transformer model (like BERT). Instead of counting overlaps, it calculates the cosine similarity between the vector representations of the tokens. This allows it to answer: "Do the generated text and the reference text convey the same meaning, even if they use entirely different words?".
COMET (Cross-lingual Optimised Metric for Evaluation of Translation): A more recent advancement that acts like a multilingual expert. Unlike other metrics that only compare a model's output to a human reference, COMET also looks at the original source text. This "triple check" ensures that the translation is not just fluent English, but a faithful reflection of the original intent.

Generative AI Metrics: LLM-as-a-Judge

"LLMs all the way down"

With GenAI, the traditional statistical metrics often fail to capture some of the most important qualities of the LLM output, such as factual accuracy, coherence, and safety. The most prominent and scalable new approach leverages the LLM itself as a judge, the LLM-as-a-Judge paradigm. Whether Russell or Pratchett is your preferred philosopher, you may be familiar with the phrase, "it's turtles all the way down." When we use a powerful LLM to assess the outputs of another LLM, it can feel like we are simply building a stack of models on top of models. While this sounds recursive, it is incredibly effective, provided we remember that the "bottom turtle" must still be grounded by us. This is why human-in-the-loop sampling and human-annotated "Golden Datasets" remain the ultimate anchor for these judge-led processes.

This paradigm typically operates in one of several ways where we use a powerful or more specialised LLM to assess and critique the output of another LLM, crucially including both a score and reasoning:

Pointwise Scoring (with a rubric): The judge LLM is given a single model output and a detailed rubric. This is excellent for checking correctness against defined rules.
Pairwise Comparison: The judge LLM is shown two different model outputs (e.g., from Model A and Model B) and asked to decide which one is better and why. This is great for A/B testing prompts or different models.
Jury Voting: A diverse panel of distinct LLMs evaluates the same content to reach a consensus. By aggregating these individual judgements (e.g. via majority vote), we create an ensemble effect that helps smooth out specific model biases and improves reliability.

Core Generative AI Metrics

The primary focus of modern LLM evaluation is managing quality and mitigating real-world risks. As we move from simple chatbots to agentic workflows, the stakes for accuracy become significantly higher.

Factual Accuracy and Hallucination

The most fundamental requirement for many LLM applications is that their outputs be reliable. However, "reliability" creates a dichotomy between what is true in the world and what is true according to your internal data.

Correctness (vs. Ground Truth): This is the most traditional accuracy measurement. It evaluates whether the generated output is factually correct when compared against a known, verifiable "Gold Standard" or world knowledge. This is an assessment of the model's external knowledge. This requires a pre-existing dataset of questions and their correct answers.
Faithfulness (Contextual Adherence): This is a critical metric, especially for RAG systems. It measures whether the claims made in a response are supported exclusively by the provided source context. It does not measure correctness against the real world, but rather how well the model "stays in its lane" regarding your internal data.

Why the distinction matters: Imagine a corporate chatbot provided with an outdated travel policy document stating the dinner allowance is £25, though the model knows from its training that the industry standard is now £40.

Evaluation in Action: The “Judge” Call

To achieve this, we provide an “LLM Judge” with the user’s question, the document (context), and the original model’s response and we prompt the “Judge” with:

“Compare the Model Response against the Provided Context and the Ground Truth. Rate the response between 0-1 for  Correctness (truth in the real world) and Faithfulness (adherence to the document).
User Question: …
Provided Context: …
Ground Truth: …
Model Response: …”

Component	Content
User question	“What is my dinner allowance?”
Provided Context	“Internal Policy v1.2: Employees are entitled to a £25 dinner expense limit”
Ground Truth	“The current industry standard meal stipend is £40”
Model Response	“The allowance is usually £40”

Judge's Verdict

Correctness Score: 1/1 ✅
- Reasoning: The model’s answer matches the real-world ground truth.
Faithfulness Score: 0/1 ❌
- Reasoning: The model ignored the provided context (£25) and used it’s own training data instead. This is a “hallucination” relative to the source material.

We often prioritise Faithfulness to ensure the AI does not override contextual documents/data with its own historic training data.

Hallucination

A hallucination is the generation of information that sounds plausible but is factually incorrect, nonsensical, or not grounded in any provided source data. Hallucinations are among the most significant challenges facing the reliable deployment of LLMs.

The Air Canada Case: The business and legal risks associated with hallucinations are not merely theoretical. In a widely publicised case, a customer interacting with Air Canada's support chatbot was told that they could apply for a bereavement fare retroactively, based on a policy the chatbot invented. A Canadian tribunal ruled that the airline was responsible for all information on its website, whether from a static page or a chatbot, and ordered the airline to honour the hallucinated policy. This demonstrates that organisations can be held liable for the erroneous outputs of their AI systems.

Detection Techniques: In addition to LLM-as-a-judge and faithfulness checks, techniques include Self-Consistency (generating multiple responses (sometimes with multiple models) to the same prompt and checking for stability) and using benchmarks like TruthfulQA, designed to measure a model's propensity to generate answers that mimic common human falsehoods.

Relevance and Coherence

Beyond being factually correct, a high-quality response must also be relevant to the user's needs and presented in a logical, understandable manner.

Answer Relevancy: Evaluates how effectively the generated response addresses the user's specific query and intent. It penalises answers that are tangential, overly broad, or fail to address the core question, even if the information provided is factually correct.
Case Example: For the query, "What is the time complexity of the Quicksort algorithm in the average case?", the relevant answer is "The average-case time complexity of Quicksort is O(n log n)." An irrelevant answer, though factually correct, might only state: "Quicksort is an efficient, in-place sorting algorithm," failing to address the core question.
The Evaluation Prompt:

"Analyse the Generated Answer against the User Query. Does the answer directly address the specific question asked? Penalise answers that are technically correct but fail to provide the requested information; score the answer from 0-1.
Generated Answer: …
User Query: … "

Component	Content
User Query	“What is the time complexity of the Quicksort algorithm in the average case?”
Generated Answer	“Quicksort is a highly efficient, in-place sorting algorithm developed by Tony Hoare. it is widely used in standard libraries.”

Judge's Verdict

Relevancy Score: 0.2/1 ❌
Reasoning: While the answer provides factually correct information about Quicksort's history and efficiency, it completely fails to state the time complexity (O(n log n)) requested by the user. The response is tangential and does not satisfy the user's intent.

Semantic Coherence: Evaluates the internal logical flow and consistency of the generated text. A coherent response is well-structured, with ideas and sentences connecting logically. An incoherent response may feel disjointed, repetitive, or contradictory.
Case Example: For the prompt, "Explain why overfitting is a problem in machine learning."
A Coherent Answer: "Overfitting occurs when a model learns the training data too well, capturing noise rather than the underlying pattern. Consequently, the model performs poorly on unseen data because it fails to generalise."
An Incoherent Answer: "Overfitting learns the noise. The data is training data. It is a problem for the model. Generalisation is failing. The pattern is not captured. It works well."
While the keywords are present, the response is disjointed, robotic, and lacks the logical connective tissue to form a persuasive explanation.

Safety and Responsibility

Ensuring outputs are safe, ethical, and unbiased is a critical evaluation dimension, especially for user-facing applications.

Toxicity: Measures the presence of any harmful, offensive, or inappropriate content in the model's output. Benchmarks like ToxiGen are used to evaluate a model's ability to detect and avoid generating explicit and, more subtly, implicit hate speech.
Bias: Quantifies the extent to which a model's outputs exhibit unfair prejudice or stereotyping related to demographic attributes.

Case Example: A classic test for gender bias involves prompts like "The doctor spoke to the nurse and <pronoun> said...". A biased model might consistently complete the sentence with "she," reinforcing the stereotype that nurses are female. Datasets like BOLD (Bias in Open-Ended Language Generation Dataset) provide a large set of prompts designed to surface and measure biases across various domains.

Tiered Evaluation for RAG and Agentic Workflows

For sophisticated systems like RAG and multi-step agents, a tiered evaluation approach is essential, as failure at an early stage guarantees failure at the end.

Before diving into the metrics, let’s take a quick detour to clarify what a RAG system actually does. Think of a standard LLM as a brilliant student taking an exam based only on their memory; Retrieval-Augmented Generation (RAG) is like giving that student an open-book exam. Instead of relying solely on its original training, the model first "retrieves" specific, relevant documents from your internal database and then "augments" its response using that fresh information. This significantly reduces the risk of the model hallucinating and ensures its answers are grounded in your specific, up-to-date data.

RAG Evaluation: Retrieval Quality

The quality of the retrieval stage sets the performance ceiling for the entire RAG system.

Contextual Precision: Measures the signal-to-noise ratio of the retrieved context. It asks: "Of the context that was retrieved, how much of it was actually useful?".
Contextual Recall: Measures the completeness of the retrieved information. It asks: "Did we find all the relevant information that exists in our knowledge base?".

Agentic Evaluation

Agentic workflows involve multiple steps and stages, and can include LLM-based sub-agents, deterministic tools, and LLM orchestration. In addition, there may be dynamic workflows which add more complexity to how the system completes its task. Hence, depending on the implementation, various metrics and evaluations can be incorporated to assess the system.

Completion Success Rate: This is the ultimate, bottom-line metric for an agent's effectiveness. It is defined as the percentage of tasks or workflows that the agent completes successfully end-to-end. For example, if a scheduling agent successfully books the correct appointment for 85 out of 100 requests, its success rate is 85%.
Task-Specific Metrics: Many agentic workflows have unique definitions of success that require custom rubrics, often evaluated by an LLM-as-a-judge.
- Case Example: For a travel agent asked to "Plan a 7-day budget trip to Rome for a history buff," we assess more than just "did it produce an itinerary?". We check Preference Adherence (Is it actually 7 days? Is it low budget?), Logical Flow (Are the travel times realistic?), and Novelty (Did it find unique historical sites?).
Tool Selection & Call Correctness: This evaluates the agent's ability to interface with the external world. It measures
- Tool Selection Accuracy (did it choose the right tool?)
- Syntactic Accuracy (was the API call formatted correctly?)
- Semantic Correctness (were the parameter values, like city_name, actually correct?).
Innovation Accuracy (Decision Quality): This evaluates the agent’s initial decision regarding whether a tool is required at all. An agent should not invoke tools unnecessarily. For instance, if a user says "Thank you," the correct action is to reply politely, not to trigger a search tool.
Trajectory Efficiency: This measures the "path" the agent took. Two agents might both arrive at the correct answer, but one might take a direct route while the other takes the "scenic route," wasting time and tokens. Efficiency metrics include comparing step counts against an optimal "Golden Trajectory" and identifying redundant loops.
User-Centric Metrics: Even a technically successful agent can be annoying. These qualitative metrics ask: "Was the interaction helpful and pleasant?" This is often measured via direct user feedback (Thumbs Up/Down) or by using an LLM-judge to analyse conversation logs for sentiment and empathy

Summary

Evaluation is not a secondary task for Generative AI; it is the central discipline that underpins responsible and reliable deployment. We must move beyond simple string-matching to embrace semantic, factual, and safety-focused metrics. There are also many curated datasets for specific task types that can be used to help in evaluation, but ultimately nothing beats a task-specific, human-curated dataset that represents exactly what your AI is likely to “see” and what you would consider to be good outputs.

The LLM-as-a-Judge paradigm is now the dominant, scalable method for assessing free-form text, but it requires continuous validation against human-annotated data. For complex systems like RAG and multi-step agents, a tiered approach evaluating retrieval, generation, and tool use is essential to ensuring end-to-end success. By combining traditional, statistical, and modern generative metrics, we can confidently steer these powerful models towards safer, more accurate, and ultimately more valuable real-world applications.

Paradigm	Focus	Core Metrics & Goal
Traditional Statistical Evals	Constrained NLP (Translation, Summarisation)	BLEU (Precision), ROUGE (Recall), BERTScore (Semantic Similarity)
Generative AI Evals	Free-form text output	LLM-as-a-Judge is the dominant approach, used for Pointwise Scoring and Pairwise Comparison.
RAG Evals	Information retrieval quality	Contextual Precision (signal-to-noise ratio) and Contextual Recall (completeness of retrieved information).
Agentic Evals	Multi-step workflows & Tool use	The overall metric is Completion Success Rate, complemented by Tool Selection Accuracy and Task-Specific Custom Rubrics.