AI hallucinations are not random bugs — they are systematic failure modes that can be measured, benchmarked and reduced. Here is how to build a rigorous hallucination testing programme for LLM-powered applications.
When a large language model confidently states that a drug has no known side effects, that a court case was decided in favour of the wrong party, or that a financial regulation says something it does not say — that is a hallucination. It is not a glitch. It is not an edge case. It is a fundamental characteristic of how LLMs generate text, and in production applications where users act on AI-generated content, hallucinations carry real business and legal consequences.
The term hallucination covers a spectrum of AI failure modes: factual errors presented with high confidence, fabricated citations and sources, invented product names, prices or specifications, incorrect summaries of real documents, and plausible-sounding but entirely made-up statistics. What unifies them is that the model generates fluent, confident output that is factually wrong — and that users have no reliable way to detect the error without independent verification.
LLMs are not databases. They do not retrieve stored facts — they predict the most statistically likely next token given their training data and the current context. This means they can generate text that sounds like a fact without any underlying ground truth to verify against. When asked about something outside their training distribution, at the edge of their knowledge, or in domains where training data was sparse or contradictory, models fill gaps with plausible-sounding fabrications rather than expressing uncertainty.
Several factors increase hallucination risk: knowledge cutoffs (models trained before a regulatory change will confidently state outdated rules), domain specificity (medical, legal and technical domains have high hallucination rates because precise factual accuracy matters and errors are subtle), retrieval failures in RAG systems (when retrieved context is incomplete or irrelevant, models may hallucinate rather than admit insufficient information), and prompt design (leading questions and insufficient context push models toward confident confabulation).
Effective hallucination testing requires a structured approach across three dimensions: benchmark dataset construction, evaluation methodology, and continuous monitoring. Point-in-time testing before deployment catches baseline hallucination rates but does not address model drift, prompt changes or retrieval failures that emerge in production. A mature programme covers all three.
Benchmark dataset construction begins with identifying the highest-risk factual domains for your specific application — the categories of claim where a hallucination would cause the most harm. For a legal research tool, this means jurisdiction-specific case law and statutory interpretation. For a medical information platform, it means drug interactions, dosing guidelines and diagnostic criteria. For a financial services chatbot, it means regulatory requirements, product terms and compliance rules. Generic hallucination benchmarks exist, but domain-specific test sets are essential for production risk assessment.
Manual evaluation by domain experts remains the gold standard for hallucination testing — a subject matter expert reviewing model outputs against ground truth sources provides the most accurate signal. However, manual evaluation does not scale to the thousands of test cases needed for comprehensive coverage. KiwiQA's AI testing practice combines three evaluation methods: expert manual review for high-stakes domains and calibration, automated factual consistency checking using reference documents and structured databases, and LLM-as-judge evaluation where a second model evaluates factual accuracy with explicit rubrics.
Retrieval-Augmented Generation (RAG) systems introduce a distinct hallucination pattern: the model generates text that contradicts or extends beyond the retrieved context. Testing RAG systems requires evaluating not just factual accuracy but faithfulness to retrieved content — does the model's response stay within what the retrieved documents actually say, or does it supplement with model knowledge that may be incorrect or outdated?
KiwiQA's RAG testing methodology covers three layers: retrieval quality (are the right documents being retrieved for each query?), context utilisation (is the model accurately synthesising the retrieved content?), and boundary adherence (is the model avoiding generation beyond the provided context?). Each layer requires separate test cases and metrics. A model with excellent retrieval but poor boundary adherence will still hallucinate — and vice versa.
Hallucination rate is typically measured as the percentage of responses containing at least one factual error across a defined test set. More granular metrics include error severity (distinguishing minor inaccuracies from materially misleading statements), error type distribution (which categories of hallucination are most prevalent), confidence calibration (does the model express appropriate uncertainty when it is likely to be wrong?), and hallucination rate by query category (which topic areas have the highest error rates?). These metrics together provide a hallucination risk profile that can guide both technical remediation and deployment decisions.
Pre-deployment testing establishes a baseline hallucination rate but does not guarantee ongoing accuracy. Model providers update base models, retrieval indices go stale, and real-world query distributions shift over time — all of which can increase hallucination rates after deployment. KiwiQA recommends implementing continuous hallucination monitoring using a sample of production queries evaluated against ground truth on a weekly or monthly cadence, with alerting when hallucination rates exceed defined thresholds. This monitoring is a standard component of KiwiQA's AI Assurance managed service offering.
Hallucination testing is not a one-time pre-deployment activity — it is an ongoing quality discipline for any organisation deploying LLM-powered applications in production. The organisations that manage AI quality rigorously are the ones that can deploy AI confidently, knowing their hallucination risk is measured, understood and within acceptable bounds for their specific use case and user population.