AI Testing

Testing Agentic AI Systems: QA for Multi-Step AI Workflows

Agentic AI — systems that plan, use tools and take sequential actions to complete goals — breaks every assumption traditional testing was built on. Here is how to build a testing strategy from scratch.

KiwiQA AI Practice

KiwiQA Engineering

18 Feb 2025

11 min read

Agentic AILLMAI TestingK-ASCI

Agentic AI represents the most significant shift in AI deployment since the emergence of large language models. Where conventional AI applications produce text responses to user prompts, agentic AI systems take actions in the real world — browsing the web, writing and executing code, sending emails, making API calls, modifying databases, controlling software applications and coordinating other AI agents in multi-agent pipelines. The consequentiality of this shift for quality engineering cannot be overstated.

What Makes Agentic AI Different to Test

Conventional AI testing validates that outputs are accurate, fair and safe. Agentic AI testing must validate that actions are correct, appropriately authorised, reversible or confirmed before execution, and safe within the full chain of consequences they can trigger. An agentic system that misclassifies a document produces a wrong answer — recoverable. An agentic system that incorrectly executes a financial transaction, sends communications to wrong recipients, or modifies production data based on a misunderstood instruction produces consequences that may be irreversible.

The Five Dimensions of Agentic AI Testing

KiwiQA's agentic AI testing framework covers five dimensions absent from conventional AI testing. Task decomposition accuracy — does the agent correctly break complex goals into executable subtasks? Tool use correctness — does the agent select and use the right tools in the right sequence with the right parameters? Guardrail effectiveness — do safety controls prevent harmful or unintended actions under adversarial and edge-case conditions? Multi-agent coordination — do agents in multi-agent systems communicate state correctly and avoid conflicting actions? Failure recovery — does the agent detect failures, retry appropriately, and escalate to humans when it cannot proceed?

Critical Risk: Multi-step agentic pipelines can amplify errors through cascading dependencies — an incorrect output from step 1 becomes the input to step 2, compounding the error. Testing must validate not just individual agent steps but the full pipeline behaviour under error conditions.

Security Testing for Agentic Systems

Security testing for agentic AI requires the adversarial mindset of penetration testing combined with AI-specific threat models. Indirect prompt injection — malicious instructions embedded in external content the agent processes — is particularly dangerous for agentic systems because successful injection can result in tool-use actions, not just text responses. KiwiQA tests every external data source the agent accesses for injection resistance: web pages, documents, emails, database records, API responses. Privilege escalation testing validates that agents cannot acquire permissions beyond their defined scope. Data exfiltration prevention testing confirms that agents cannot be manipulated into exposing sensitive information through their tool-use capabilities.

Observability: The Foundation of Agentic Testing

Agentic AI systems require comprehensive observability to be testable. Every agent action — tool call, external API request, decision branch, intermediate output — must be logged in a format that enables test assertions about agent behaviour, not just final outcomes. KiwiQA implements structured logging for agentic pipelines that captures: the full chain of actions taken for each task; decision rationale at each choice point; external calls made and responses received; tokens consumed and latency at each step; and error conditions encountered and recovery actions taken. This logging enables both automated assertions in test suites and human review of agent behaviour under complex scenarios.

Human-in-the-Loop Testing

Well-designed agentic systems include human checkpoints for high-consequence actions — the agent pauses and requests human confirmation before sending an email, deleting a record, executing a financial transaction or taking any action that is difficult to reverse. Testing human-in-the-loop mechanisms validates that: the triggering conditions for human review are correctly identified; the information presented to the human for review is accurate and sufficient; human approvals and rejections are correctly handled; and timeout behaviour (what happens if the human doesn't respond) is safe and defined.

Performance and Cost Testing

Agentic AI systems have performance and cost characteristics that differ fundamentally from conventional applications. Multi-step reasoning chains consume significantly more tokens than single-prompt responses, with cost scaling super-linearly with task complexity. Parallel tool-use execution introduces latency profiles that depend on the slowest tool in the chain rather than aggregate throughput. Performance testing for agentic systems measures end-to-end task completion time, token consumption per task type, cost-per-task at production volume, and latency distribution under concurrent user load — metrics that don't exist in the performance testing vocabulary for non-AI systems.

Frequently Asked Questions

Enjoyed this? Explore more below.