CAISI Field Notes / Benchmark Series

How to Evaluate Agentic Control

The market has plenty of agent demos and not nearly enough evaluation language. Evaluators are being asked to compare products that can change code, systems, and delivery workflows without a stable way to talk about action risk, control quality, proof quality, or pilot design. This five-part series is our attempt to make that language more usable.

Benchmark language evaluators can reuse Control quality over demo quality Evidence before expansion

OpenClaw report Sprawl report See the Gait series

Why a separate benchmark series

CAISI already has one case-study series, one report-interpretation series, and two implementation series. This collection does something different. It turns the lessons from those measured artifacts into a evaluation and pilot language other teams can use internally.

OpenClaw already gives us a clean example of runtime control efficacy: stop behavior, non-executable outcomes, destructive-action blocking, and evidence verification. The sprawl report already gives us a clean example of approval opacity and evidence weakness in public artifacts. What the market still lacks is a simple, reusable vocabulary that says which scenarios to test, which metrics to compare, and what evidence needs to exist before a team should trust a tool near a real write path.

The 5 posts

Benchmark Post 1

Leadership

Why Evaluators Still Cannot Evaluate Agentic Control Clearly

Why the market still compares agentic systems with product-language instead of benchmark-language, and what a better comparison model looks like.

Benchmark Post 2

AppSec

Agent Action Risk Scenarios: The Minimum Test Set Every Evaluator Should Use

A practical scenario set for evaluating delete, share, write, restart, and approval-dependent agent actions before a pilot turns into guesswork.

Benchmark Post 3

Control quality

How to Measure Control Efficacy for AI Agents

Which control metrics actually matter when evaluators need to know whether the system changes what can execute.

Benchmark Post 4

Evidence

Proof Completeness for AI Agent Changes

A practical completeness model for the evidence a reviewer, auditor, or incident responder should receive with autonomous change.

Benchmark Post 5

Pilot design

How to Run an Evaluation-Grade Agent Pilot

How to run an evaluation-grade pilot that measures control quality, evidence quality, and operational fit instead of demo theater.

What this series standardizes

Risk scenarios

What should be tested

The minimum action classes and failure modes a serious evaluator should require before a control claim sounds credible.

Control efficacy

What should be measured

Metrics that describe whether a control changes runtime behavior, not just whether the interface or policy language sounds mature.

Proof completeness

What should be evidenced

The minimum artifact set a reviewer should be able to inspect without relying on screenshots, memory, or vendor trust.

Pilot discipline

How teams should compare

A practical framework for paired lanes, scoped scenarios, exit criteria, and cross-functional review so pilots create durable learning.