Benchmark Series / Post 2 of 5 / AppSec

Agent Action Risk Scenarios: The Minimum Test Set Every Buyer Should Use

Vendor demos are almost always optimized for the moment everyone in the room nods. The task completes, the workflow looks clean, and the operator seems productive. None of that tells a buyer what will happen when the system tries to delete, share, restart, or write without the right boundary in place. If the tool can take actions with real side effects, the evaluation needs a scenario set designed for pressure.

It does not need more applause.

Research grounding

OpenClaw gives us four durable scenario families already: 214 inbox-delete actions after stop, 155 public-share actions, 87 finance write-class actions without approval, and 260 operations restart attempts in the ungoverned lane. That matters because those are not abstract categories. They are concrete action classes buyers can test.

Why scenario design matters

Buyers often talk about control in general and scenarios in passing. That order should be reversed. A control claim only becomes meaningful when a real action class is in scope. "We enforce policy" tells you very little until you ask: against which actions, under which conditions, and with what proof?

Scenario design is what keeps evaluations from becoming theater. It forces the buyer and vendor to agree on the actions that matter before the tool is judged by ergonomics or fluency. It also gives the internal platform ally a more defensible way to explain what the pilot did and did not actually test.

That matters because different action classes reveal different failure modes. Stop behavior tests whether the runtime can change state under pressure. Approval-dependent writes test whether process actually mediates execution. Share or publication tests exposure widening. Operational mutation tests whether the tool can affect the systems around the code, not just the code itself. Proof reconstruction tests whether the organization could explain the run later. Without a scenario family for each, buyers end up generalizing from the easiest case.

The failure mode

The anti-pattern is to let the vendor choose only the safest or most flattering workflow. A tidy refactor, a documentation update, or a clean code-generation task may prove the tool is useful. It does not prove how the control layer behaves when the action is harder to unwind or more expensive to mis-handle.

Another weak pattern is to test only one risk class. A stop demo does not tell you anything about approval-dependent writes. A finance-write demo does not tell you whether public sharing gets blocked. Buyers need a small but varied set of scenarios that reveal different failure modes.

A third weak pattern is to treat a scenario as one scripted moment instead of a family with a denominator. One successful approval demo is not the same as repeated approval-dependent attempts across a defined sample. Buyers need enough repetition to learn whether the control is consistent or just capable of passing a stage-managed case.

The minimum test set

We think a serious buyer should require at least five scenario families.

Those scenarios are deliberately simple. The point is not exhaustive coverage in the first round. The point is a minimum test set that reveals whether the control layer can handle the kinds of actions that change organizational risk, not just the kinds of actions that make a good demo.

"Minimum" matters here. Buyers do not need a thousand-case harness before they can learn something useful. They do need enough range to avoid being fooled by a tool that looks controlled only because it was tested on low-consequence workflows. The right first step is a small, explicit set of scenarios with clear expected outcomes and evidence requirements.

Why security leaders care

Security leaders need scenario coverage because risk language gets vague very quickly without it. A defined scenario set lets AppSec ask whether the vendor or internal platform actually tested the action classes the organization is worried about, or whether it only proved the system can be polite in low-risk situations.

It also makes approvals cleaner. Security can say which scenarios are required before a pilot widens and which action classes remain out of scope. That is much more useful than one vague approval statement attached to the entire category.

It is also how a security leader avoids getting trapped between two bad positions. Without scenarios, the team either says no to everything because the unknowns are too large, or says yes to a pilot that never tested the actions most likely to matter later. A stable scenario set creates a middle path: selective control, not blanket slowdown.

Why platform and engineering care

Platform and engineering leaders benefit because a stable scenario set reduces argument-by-anecdote. Instead of defending a pilot with general claims, they can show which scenario families were tested, how the control behaved, and where the current workflow still needs stronger boundaries or better evidence.

Scenario discipline also makes pilots more reusable. Once one team has a credible action-risk matrix, the next team does not have to start the buying conversation from zero.

There is a design benefit too. Scenario families help platform teams decide what the orchestrator, repo contract, and validation path should own. If a stop scenario keeps failing, that points to runtime state. If proof reconstruction keeps failing, that points to evidence packaging. A scenario set is not just a buying tool. It is a roadmap for the next control improvements.

Concrete artifact: a scenario matrix

A useful scenario matrix should answer four questions for each action class and make the denominator explicit.

Action

What exact behavior is being tested: delete, share, write, restart, or stop-handling?

Expected control behavior

Should the system allow, block, or require approval before the action can execute?

Expected evidence

What proof should exist if the buyer later needs to review the scenario outcome cold?

Threshold to widen

What result would be strong enough to expand the pilot and what result would hold the scope narrow?

What to do next

Before the next pilot, ask the tool owner to write down the five scenario families above and fill in the matrix before the first demo. If the team objects that this is too much process, that is usually a sign the current evaluation is too dependent on performance theater.

If the pilot team cannot answer those questions in advance, the control story is still too soft for a serious buying decision.