Benchmark Series / Post 3 of 5 / Control Quality

How to Measure Control Efficacy for AI Agents

"We have controls" is one of the least useful sentences in this market because it can mean almost anything. A policy file exists. A reviewer eventually signs off. A dashboard shows a verdict somewhere. None of that answers the question a buyer actually cares about: when the system tries to do something risky, what changes at execution time, how quickly does it change, and what proof survives afterward?

Research grounding

OpenClaw gives us a clean starting set for control efficacy: 100% post-stop executable-call rate in the baseline lane, 100% governed destructive-action block rate, 99.96% governed evidence verification rate, and 0 seconds stop-to-halt p95 in the governed lane. Those are not marketing metrics. They are examples of what buyers should be asking to compare.

What control efficacy actually means

Control efficacy is not the presence of a policy file, a dashboard, or a reviewer in the loop somewhere. It is the measured ability of the system to change runtime behavior before a side effect happens and to leave behind proof strong enough for a third party to inspect later.

That definition matters because it narrows the buying conversation to the point where control becomes real. If the system can still execute the unsafe action while policy gets discussed later, the control may be useful for learning, but it is weak as an execution boundary.

The easiest way to remember the distinction is this: control presence tells you what the product claims to have. Control efficacy tells you what the runtime actually prevented, delayed, or required in a tested scenario. Buyers should care about the second much more than the first.

The failure mode

The anti-pattern is to accept proxies for control. Prompt guidance, reviewer expectations, and post-hoc logs can all be helpful. None of them answer the core buyer question: did the system change what could execute at the moment the action crossed the boundary?

Another weak pattern is to rely on one number that sounds impressive. A low block count might mean the tool is safe. It might also mean the scenarios were timid or the policy was too permissive. A high pass rate might signal maturity. It might also mean the control was never tested against the actions that matter most.

Efficacy needs a small set of related metrics, not one vanity number.

A third weak pattern is to separate efficacy from operational cost. A team can produce a gate that blocks aggressively and still fail the organization if every review turns into an exception queue. Buyers should not confuse "blocks a lot" with "works well." A credible control has to change runtime behavior and remain operable enough that teams will actually keep it turned on.

The five metrics that matter

We think buyers should start with five efficacy metrics.

Buyers do not need to start with perfect statistical sophistication. They do need to stop comparing control quality with slogans. These five measures already force much better conversations than "they have policy-as-code" or "they support approvals."

The useful habit is to read the metrics together, not one by one. A tool can show strong destructive blocking and weak proof. It can show strong proof and weak approval mediation. It can even show strong blocking and still be operationally brittle because the stop contract or human escalation path is too vague. Buyers are not looking for a single magic number. They are looking for a believable control profile.

Why security leaders care

Security leaders need efficacy metrics because policy language by itself is cheap. A control program only deserves trust if it can show which unsafe actions became non-executable, how approval changed runtime behavior, and whether stop and proof worked under the tested conditions.

Those metrics also let security describe residual risk honestly. A tool can have strong destructive-action blocking and still have weak proof completeness. It can have great proof and weak stop behavior. Buyers need that kind of precision if they want to expand safely.

That precision is what lets AppSec say yes to a narrower scope without pretending the entire category is solved. It is also what keeps the approval posture defensible later. A security leader should be able to point to the measured control profile and say exactly why a workflow was approved, what was withheld, and what would have to improve next.

Why platform and engineering care

Platform and engineering leaders benefit because efficacy metrics make tradeoffs visible. If the gate is effective but generates too many operational delays, that is something the team can tune. If the tool feels fast but the non-allow outcomes still execute, that is not a tuning issue. It is a boundary issue.

Good metrics also make re-testing possible. Once the organization has a stable scorecard, it can compare versions, policy changes, and pilot expansions with less hand-waving.

This is especially important for platform teams that will own the system after the pilot ends. They need to know whether a control is merely present, truly effective, or effective but too costly in its current form. Efficacy metrics turn that into engineering work instead of organizational argument.

Concrete artifact: a control efficacy scorecard

A usable scorecard should capture more than "pass" or "fail." It should explain the denominator, the runtime effect, and the operational consequence.

Scenario denominator

Which action classes were tested, and how many attempts existed in scope?

Boundary outcome

How often did block or approval-required outcomes actually stay non-executable, and where is that runtime verdict recorded?

Evidence quality

How often did the resulting record remain reviewable and verifiable later?

Operational cost

What human intervention, delay, or exception handling was required to keep the control effective?

What to do next

Take one current agent pilot and rewrite the success criteria in control-efficacy terms. The goal is to make the next steering conversation about what the system did at the boundary, not just how useful it felt to the operator.

If the pilot cannot answer those five questions, it still knows more about usability than it does about control.