Agent Action Risk Scenarios: The Minimum Test Set Every Evaluator Should Use

Vendor demos are almost always optimized for the moment everyone in the room nods. The task completes, the workflow looks clean, and the operator seems productive. None of that tells a team what will happen when the system tries to delete, share, restart, or write without the right boundary in place. If the tool can take actions with real side effects, the evaluation needs a scenario set designed for pressure.

Use this before procurement accepts a vendor's control claim or before an internal pilot expands from read-only assistance into write-capable delivery workflows.

It does not need more applause.

In this piece

Research grounding Why scenario design matters The failure mode The minimum test set Why security leaders care Why platform and engineering care Concrete artifact: a scenario matrix What to do next

Series home | All field notes

Research grounding

OpenClaw gives us four durable scenario families already: 214 inbox-delete actions after stop, 155 public-share actions, 87 finance write-class actions without approval, and 260 operations restart attempts in the ungoverned lane. That matters because those are not abstract categories. They are concrete action classes evaluators can test. See the OpenClaw report for the measured run context.

Why scenario design matters

Teams often talk about control in general and scenarios in passing. That order should be reversed. A control claim only becomes meaningful when a real action class is in scope. "We enforce policy" tells you very little until you ask: against which actions, under which conditions, and with what proof?

Scenario design is what keeps evaluations from becoming theater. It forces security, platform, and the tool owner to agree on the actions that matter before the tool is judged by ergonomics or fluency. It also gives platform teams a more defensible way to explain what the pilot did and did not actually test.

That matters because different action classes reveal different failure modes. Stop behavior tests whether the runtime can change state under pressure. Approval-dependent writes test whether process actually mediates execution. Share or publication tests exposure widening. Operational mutation tests whether the tool can affect the systems around the code, not just the code itself. Proof reconstruction tests whether the organization could explain the run later. Without a scenario family for each, evaluators end up generalizing from the easiest case.

The failure mode

The anti-pattern is to let the vendor choose only the safest or most flattering workflow. A tidy refactor, a documentation update, or a clean code-generation task may prove the tool is useful. It does not prove how the control layer behaves when the action is harder to unwind or more expensive to mis-handle.

Another weak pattern is to test only one risk class. A stop demo does not tell you anything about approval-dependent writes. A finance-write demo does not tell you whether public sharing gets blocked. Teams need a small but varied set of scenarios that reveal different failure modes.

A third weak pattern is to treat a scenario as one scripted moment instead of a family with a denominator. One successful approval demo is not the same as repeated approval-dependent attempts across a defined sample. Evaluators need enough repetition to learn whether the control is consistent or just capable of passing a stage-managed case.

The minimum test set

A serious evaluator should require at least five scenario families.

Stop after escalation. Can the runtime prevent further side effects after a stop signal once the run is already active, or does stop remain mostly conversational?
Share or exfiltrate. Can the system block public sharing, outbound publication, or similar actions that widen exposure beyond the original audience?
Approval-dependent write. Can it hold a write-class action non-executable until approval exists, and can it prove the approval changed state?
Operational mutation. Can it stop restart, deployment, or environment-affecting actions when the policy should deny them?
Proof reconstruction. Can a reviewer inspect what happened without relying on screenshots, memory, or a vendor employee to narrate the run?

Those scenarios are deliberately simple. The point is not exhaustive coverage in the first round. The point is a minimum test set that reveals whether the control layer can handle the kinds of actions that change organizational risk, not just the kinds of actions that make a good demo. Use the Agent Action BOM to decide which action classes belong in the first test set.

"Minimum" matters here. Teams do not need a thousand-case harness before they can learn something useful. They do need enough range to avoid being fooled by a tool that looks controlled only because it was tested on low-consequence workflows. The right first step is a small, explicit set of scenarios with clear expected outcomes and evidence requirements.

Why security leaders care

Security leaders need scenario coverage because risk language gets vague very quickly without it. A defined scenario set lets AppSec ask whether the vendor or internal platform actually tested the action classes the organization is worried about, or whether it only proved the system can be polite in low-risk situations.

It also makes approvals cleaner. Security can say which scenarios are required before a pilot widens and which action classes remain out of scope. That is much more useful than one vague approval statement attached to the entire category.

It is also how a security leader avoids getting trapped between two bad positions. Without scenarios, the team either says no to everything because the unknowns are too large, or says yes to a pilot that never tested the actions most likely to matter later. A stable scenario set creates a middle path: selective control, not blanket slowdown.

Why platform and engineering care

Platform and engineering leaders benefit because a stable scenario set reduces argument-by-anecdote. Instead of defending a pilot with general claims, they can show which scenario families were tested, how the control behaved, and where the current workflow still needs stronger boundaries or better evidence.

Scenario discipline also makes pilots more reusable. Once one team has a credible action-risk matrix, the next team does not have to start the evaluation conversation from zero.

There is a design benefit too. Scenario families help platform teams decide what the orchestrator, repo contract, and validation path should own. If a stop scenario keeps failing, that points to runtime state. If proof reconstruction keeps failing, that points to evidence packaging. A scenario set is not just an evaluation artifact. It is a roadmap for the next control improvements.

Concrete artifact: a scenario matrix

A useful scenario matrix should answer four questions for each action class and make the denominator explicit.

Action

What exact behavior is being tested: delete, share, write, restart, or stop-handling?

Expected control behavior

Should the system allow, block, or require approval before the action can execute?

Expected evidence

What proof should exist if the reviewer later needs to review the scenario outcome cold?

Threshold to widen

What result would be strong enough to expand the pilot and what result would hold the scope narrow?

What to do next

Before the next pilot, ask the tool owner to write down the five scenario families above and fill in the matrix before the first demo. If the team objects that this is too much process, that is usually a sign the current evaluation is too dependent on performance theater.

Mark which scenarios are in scope for the pilot.
Mark the expected control verdict and what would count as failure for each one.
Mark what proof a reviewer should receive after each scenario.
Mark which scenarios are intentionally deferred and why.

If the pilot team cannot answer those questions in advance, the control story is still too soft for a serious evaluation decision.