Verification of Agentic Products
The V&V Problem
Verification answers: did we build the product right? Validation answers: did we build the right product? For conventional products, these questions are answerable through established methods — testing against requirements, analysis, inspection, demonstration.
For agentic products, the questions remain the same but the methods break down. You cannot test all behaviors of a product that makes its own decisions in an open-ended environment. You cannot enumerate all states of a learning system. You cannot inspect the decision logic of a neural network the way you inspect the logic of a rule set.
This does not mean verification is impossible. It means no single verification approach is sufficient. The solution is a portfolio of approaches, each covering what others cannot.
Approach 1: Scenario-Based Testing
The most intuitive approach: define scenarios that the product should handle, execute those scenarios (in simulation or physical test), and verify that the product's behavior meets requirements.
How It Works
- Define a set of test scenarios — specific situations the product will encounter (normal operations, degraded conditions, edge cases, adversarial inputs)
- Execute each scenario with the product in the loop
- Evaluate whether the product's response meets acceptance criteria
- Record results with full traceability to requirements
What It Proves
Scenario-based testing proves that the product behaves correctly in the tested scenarios. If the scenarios are representative of operational reality, this provides confidence that the product will perform acceptably in the field.
What It Cannot Prove
It cannot prove correct behavior in untested scenarios. For an agentic product operating in an open-ended environment, the number of possible scenarios is effectively infinite. No test campaign — however large — can cover them all.
The coverage problem: How many scenarios are enough? There is no principled answer. Doubling the test campaign doubles the cost but provides diminishing marginal confidence. The scenarios that matter most are often the ones no one thought to test — because they are novel combinations of conditions that were individually considered but never combined.
When to Use
Scenario-based testing is essential as the first layer of verification. It catches the obvious failures and validates behavior in known conditions. But it must not be the only approach — it provides necessary but not sufficient evidence.
Approach 2: Formal Methods
Mathematical techniques that prove properties of the product's decision-making logic — not by testing examples but by analyzing the logic itself.
How It Works
- Express the product's decision logic in a formal language (model checking, theorem proving, abstract interpretation)
- Express the desired properties as formal specifications (e.g., "the system shall never enter state X" or "if condition A holds, action B is always performed within T seconds")
- Use automated tools to prove or disprove that the logic satisfies the properties across all possible inputs
What It Proves
Formal methods prove that the verified properties hold for all inputs, not just tested ones. If the proof succeeds, the property is guaranteed — within the scope of the formal model.
What It Cannot Prove
State space explosion: As system complexity grows, the number of states to analyze grows exponentially. Formal methods that work well for embedded controllers with bounded state spaces become intractable for systems with continuous state spaces, complex environmental models, or learned components.
Abstraction gap: Formal methods verify properties of a formal model of the product, not the product itself. If the formal model is an incomplete or inaccurate abstraction of the real system, the proof may not apply to the actual product.
Learned components: Formal methods are effective for rule-based and model-based architectures where the decision logic is explicit. For neural networks and other learned components, formally verifying properties is an active research area with limited practical tools today.
When to Use
Formal methods are most valuable for critical properties of inspectable decision logic — safety invariants, deadlock freedom, bounded response time. They complement scenario-based testing by providing guarantees that testing alone cannot.
Approach 3: Simulation-Based V&V
Using the digital twin as a test bed — running the product (or its software) in a simulated environment that models the physical world and the operational context.
How It Works
- Build a simulation environment that models the product's operating context — physics, sensor models, environmental conditions, other agents (other vehicles, pedestrians, network traffic)
- Connect the product's decision-making software (or a model of it) to the simulation
- Run thousands or millions of simulation episodes, varying initial conditions, environmental parameters, and scenario elements
- Evaluate product behavior across all episodes against acceptance criteria
- Identify failures, edge cases, and performance boundaries
What It Proves
Simulation-based V&V can explore far more scenarios than physical testing — orders of magnitude more. It can test dangerous conditions (near-misses, equipment failures, adversarial inputs) that are unsafe or impractical to test physically. It can provide statistical confidence in the product's behavior across a distribution of operating conditions.
What It Cannot Prove
The reality gap: Simulation is an approximation of reality. Sensor models simplify real sensor noise and artifacts. Physics models approximate real-world dynamics. Environmental models cannot capture every possible real-world condition. If the simulation omits a phenomenon that matters in operation (glare, rain on a sensor lens, unexpected electromagnetic interference), the verification is blind to that failure mode.
Distribution mismatch: Simulation scenarios are drawn from a distribution designed by engineers. If the real-world distribution of conditions differs — if events occur that the simulation designers did not consider — the verification may not cover the actual operating envelope.
When to Use
Simulation-based V&V is the primary tool for scaling verification of agentic products. It enables statistical verification (run enough simulations to estimate failure probability) and regression testing (automatically re-verify after every software update). But it must be paired with physical testing to validate the simulation itself — ensuring the reality gap is bounded.
Approach 4: Runtime Monitoring
Verifying the product during operation rather than before deployment. Monitoring the product's decisions and actions in real time and flagging or intervening when behavior falls outside acceptable bounds.
How It Works
- Define runtime monitors — specifications of acceptable behavior that can be checked during operation. Examples: "vehicle speed shall never exceed X in zone Y," "control action shall not exceed Z rate of change," "confidence score of perception system shall remain above threshold"
- Implement monitors as software components that observe the product's internal state and outputs
- Define interventions when monitors trigger — degraded mode operation, graceful shutdown, handoff to human operator, alert to remote monitoring center
- Log all monitor activations for post-incident analysis and continuous improvement
What It Proves
Runtime monitoring provides an operational safety net. It catches failures that pre-deployment verification missed — because the scenario was untested, the simulation did not model the condition, or the formal proof did not cover the specific interaction.
What It Cannot Prove
Reactive, not proactive: Monitors detect violations after they occur (or are about to occur). They do not prevent the product from entering the problematic situation in the first place. The intervention must be fast enough and safe enough to prevent harm — which is a design challenge in itself.
Monitor completeness: The monitors only check what they are designed to check. If a failure mode was not anticipated and no monitor watches for it, it will not be caught. Designing comprehensive monitors requires the same domain expertise as designing comprehensive test scenarios.
Performance overhead: Monitors consume computational resources. Complex monitors on resource-constrained embedded systems may interfere with the product's primary functions. Monitor design must balance coverage with resource constraints.
When to Use
Runtime monitoring is essential for any agentic product deployed in safety-relevant environments. It is the last line of defense when pre-deployment verification cannot provide complete coverage — which, for truly agentic products, is always the case.
The Portfolio Approach
No single verification approach is sufficient for agentic products. The solution is a portfolio:
- Scenario-based testing covers known conditions and validates behavior in representative situations
- Formal methods prove critical safety properties for inspectable components
- Simulation-based V&V scales testing to millions of scenarios and provides statistical confidence
- Runtime monitoring provides operational safety when pre-deployment verification is incomplete
The portfolio must be designed as a system — each approach covering what others cannot. Gaps between approaches must be identified and accepted as residual risk, with mitigations in place.
Assessment
Why is no single verification approach sufficient for agentic products? (Select all that apply)
Select all that apply
You are responsible for the verification strategy of a surgical robot that uses a learned perception model to identify tissue types in real time. Design a verification portfolio for this product. For each approach (scenario testing, formal methods, simulation, runtime monitoring), describe what specifically you would verify and acknowledge what each approach cannot cover.