How to Measure Control Efficacy for AI Agents

"We have controls" is one of the least useful sentences in this market because it can mean almost anything. A policy file exists. A reviewer eventually signs off. A dashboard shows a verdict somewhere. None of that answers the question an evaluator actually cares about: when the system tries to do something risky, what changes at execution time, how quickly does it change, and what proof survives afterward?

In this piece

Research grounding What control efficacy actually means The failure mode The five metrics that matter Why security leaders care Why platform and engineering care Concrete artifact: a control efficacy scorecard What to do next

Series home | All field notes

Research grounding

OpenClaw gives us a clean starting set for control efficacy: 100% post-stop executable-call rate in the baseline lane, 100% governed destructive-action block rate, 99.96% governed evidence verification rate, and 0 seconds stop-to-halt p95 in the governed lane. Those are not marketing metrics. They are examples of what evaluators should be asking to compare. See the OpenClaw report for the measured run context.

What control efficacy actually means

Control efficacy is not the presence of a policy file, a dashboard, or a reviewer in the loop somewhere. It is the measured ability of the system to change runtime behavior before a side effect happens and to leave behind proof strong enough for a third party to inspect later.

That definition matters because it narrows the evaluation conversation to the point where control becomes real. If the system can still execute the unsafe action while policy gets discussed later, the control may be useful for learning, but it is weak as an execution boundary.

The easiest way to remember the distinction is this: control presence tells you what is advertised or configured. Control efficacy tells you what the runtime actually prevented, delayed, or required in a tested scenario. Evaluators should care about the second much more than the first.

The failure mode

The anti-pattern is to accept proxies for control. Prompt guidance, reviewer expectations, and post-hoc logs can all be helpful. None of them answer the core team question: did the system change what could execute at the moment the action crossed the boundary?

Another weak pattern is to rely on one number that sounds impressive. A low block count might mean the tool is safe. It might also mean the scenarios were timid or the policy was too permissive. A high pass rate might signal maturity. It might also mean the control was never tested against the actions that matter most.

Efficacy needs a small set of related metrics, not one vanity number.

A third weak pattern is to separate efficacy from operational cost. A team can produce a gate that blocks aggressively and still fail the organization if every review turns into an exception queue. Teams should not confuse "blocks a lot" with "works well." A credible control has to change runtime behavior and remain operable enough that teams will actually keep it turned on.

The five metrics that matter

Evaluators should start with five efficacy metrics.

Non-allow non-executable rate. When the control says block or require approval, how often does the action actually stay non-executable? This is the core fail-closed test.
Stop-to-halt behavior. After a stop signal, how quickly and reliably does the runtime stop making new side effects possible? This is where "stop" becomes a control contract instead of a UI affordance.
Destructive-action block rate. Across delete, share, restart, or similar actions, how often does the system hold the unsafe action? This shows whether the boundary works on the classes evaluators care about most.
Approval mediation quality. Does approval change execution state, or does it only leave a process record? This is how evaluators tell the difference between workflow theater and actual mediation.
Evidence verification rate. After the decision, how often is the proof artifact complete and verifiable enough to support review? This is how efficacy survives outside the live run.

Teams do not need to start with perfect statistical sophistication. They do need to stop comparing control quality with slogans. These five measures already force much better conversations than "they have policy-as-code" or "they support approvals."

The useful habit is to read the metrics together, not one by one. A tool can show strong destructive blocking and weak proof. It can show strong proof and weak approval mediation. It can even show strong blocking and still be operationally brittle because the stop contract or human escalation path is too vague. Evaluators are not looking for a single magic number. They are looking for a believable control profile.

Scenario: agent attempts workflow-file change
Policy verdict: require_approval
Runtime result: change held non-executable until approval
Credential context: repo-scoped CI token, no production secrets
Proof artifact: PR link, policy verdict, approver, workflow run, validation result
Operational note: approval completed in 14 minutes; no exception path used

Why security leaders care

Security leaders need efficacy metrics because policy language by itself is cheap. A control program only deserves trust if it can show which unsafe actions became non-executable, how approval changed runtime behavior, and whether stop and proof worked under the tested conditions.

Those metrics also let security describe residual risk honestly. A tool can have strong destructive-action blocking and still have weak proof completeness. It can have great proof and weak stop behavior. Evaluators need that kind of precision if they want to expand safely.

That precision is what lets AppSec say yes to a narrower scope without pretending the entire category is solved. It is also what keeps the approval posture defensible later. A security leader should be able to point to the measured control profile and say exactly why a workflow was approved, what was withheld, and what would have to improve next.

Why platform and engineering care

Platform and engineering leaders benefit because efficacy metrics make tradeoffs visible. If the gate is effective but generates too many operational delays, that is something the team can tune. If the tool feels fast but the non-allow outcomes still execute, that is not a tuning issue. It is a boundary issue.

Good metrics also make re-testing possible. Once the organization has a stable scorecard, it can compare versions, policy changes, and pilot expansions with less hand-waving.

This is especially important for platform teams that will own the system after the pilot ends. They need to know whether a control is merely present, truly effective, or effective but too costly in its current form. Efficacy metrics turn that into engineering work instead of organizational argument.

Concrete artifact: a control efficacy scorecard

A usable scorecard should capture more than "pass" or "fail." It should explain the denominator, the runtime effect, and the operational consequence.

Scenario denominator

Which action classes were tested, and how many attempts existed in scope?

Boundary outcome

How often did block or approval-required outcomes actually stay non-executable, and where is that runtime verdict recorded?

Evidence quality

How often did the resulting record remain reviewable and verifiable later?

Operational cost

What human intervention, delay, or exception handling was required to keep the control effective?

What to do next

Take one current agent pilot and rewrite the success criteria in control-efficacy terms. The goal is to make the next steering conversation about what the system did at the boundary, not just how useful it felt to the operator.

Define the denominator for each risk scenario.
Track whether non-allow outcomes were actually non-executable, with a threshold defined before the run.
Track whether stop behavior changed runtime state in time, including p95 stop-to-halt if the system claims runtime halt.
Track whether the proof package remained reviewable, using the CI/CD guide where delivery workflows are involved.
Track the human cost required to keep those controls effective.

If the pilot cannot answer those five questions, it still knows more about usability than it does about control.