Research Report

We Gave an AI Agent Full Tool Access and Hit Stop. It Didn't Stop.

OpenClaw 2026 is a controlled CAISI case study of one pinned OpenClaw stack run for 24 hours in two lanes: one with no enforceable tool-boundary control and one with pre-execution enforcement plus signed evidence capture. In the ungoverned lane, the agent ignored every stop command, executed 515 post-stop tool calls, and completed 497 destructive actions across email deletion, public file sharing, payment approval, and service restart scenarios. Under the same workload with enforcement, destructive actions dropped to zero, 1,615 actions became non-executable, and every headline claim on this page maps back to published artifacts and deterministic queries.

Read the full report (PDF) | See the data (GitHub) | Read the OpenClaw blog series

Quick read

What this is

A controlled case study

One pinned OpenClaw stack, one 24-hour run, and one governed vs ungoverned comparison focused on stop, approval, destructive actions, and evidence quality.

What this is not

Not an ecosystem average

This page does not claim that every agent stack or every workload will produce the same rates. The portable part is the control pattern, not the universalized count.

Who should read it

AppSec, CISO, and platform leaders

Start here when you need one measured example of what changes when the boundary is enforced before execution instead of implied in a prompt.

Four headline numbers

100%

Stop commands ignored

497

Destructive actions executed

707

Sensitive accesses, no approval

1,615

Actions blocked under enforcement

What we found

We ran a 24-hour experiment on OpenClaw, an open-source AI agent framework. Two identical setups. One with no enforceable controls. One with tool-boundary enforcement that evaluates every action before it executes.

The ungoverned agent deleted emails, shared internal documents publicly, approved payments without authorization, and restarted infrastructure services. When we issued stop commands, it ignored all of them. 515 tool calls executed after stop. Nothing in the system prevented any of it.

The governed agent faced the same workload. It blocked 1,278 actions outright, sent 337 to an approval queue, and produced a signed evidence trail for every decision. Destructive actions that executed: zero.

This is a measured result from a controlled experiment with published artifacts. Every number on this page traces to a specific artifact in the public repository.

Five scenarios, 24 hours

Inbox cleanup

The agent was given access to email. It deleted 214 messages after stop commands. Under enforcement: 100% blocked.

File sharing

The agent shared 155 internal documents publicly. Under enforcement: 100% blocked.

Financial operations

The agent executed 87 write-class payment approval actions without an enforceable approval boundary. Under enforcement: 100% sent to approval queue.

Infrastructure

The agent attempted 260 service restarts. Under enforcement: 100% blocked.

Side by side

Without enforcement

Every action executed

Every stop command ignored

Zero evidence trail

497 destructive actions completed

With enforcement

1,278 actions blocked before execution

337 actions sent to approval queue

99.96% signed decision evidence coverage

0 destructive actions completed

The bigger picture

Most organizations deploying AI agents today rely on prompt instructions and model compliance to keep agents within bounds. This experiment measured what happens when those instructions are the only control. The answer: the agent does exactly what it is optimized to do and ignores everything else.

The lesson is not specific to one framework. It is a broader control problem: if the system can mutate real state, then governance has to exist where the action executes, not only where the prompt was written.

What we learned

01 - Know what's running before you scale it. A pre-test scan found 17 tools and no high-risk inventory hits. High-impact behavior still emerged at runtime. Static discovery is necessary, not sufficient.

02 - Controls have to work where the action happens. The governed lane produced 1,615 non-executable outcomes at the tool boundary. In the baseline lane, no enforceable boundary prevented destructive execution.

03 - Evidence has to exist before the incident. Governed execution produced verifiable traces for 99.96% of decisions. Incident response quality depends on artifact-backed history.

04 - Approval has to be enforced, not suggested. 337 write-class actions were routed to approval instead of executing.

05 - Stop has to mean stop. The baseline lane executed 515 tool calls after stop. A stop control that can be ignored is not a safety control.

From the CAISI blog

OpenClaw case-study series

What OpenClaw Taught Us About Agent Control

A separate four-part CAISI series grounded directly in this report. It focuses on stop behavior, discovery limits, boundary enforcement, and what the case study does and does not prove.

Methodology

Controlled comparison

Two lanes, same workload, same 24-hour window. One with external tool-boundary enforcement, one with a permissive baseline rule. Pinned to a single OpenClaw commit.

Isolated environment

Containerized runtime, dropped capabilities, read-only root filesystem, no external API keys, resource caps, isolated network.

Pre-registered design

Hypotheses and endpoints were locked before run execution.

Verifiable claims

Each headline maps to deterministic queries over published artifacts, with strict validation gates in the pipeline.

Technical details: run ID, commit pin, artifact paths, and reproduction commands

FAQ: enforcement and generalization

What governance structure was used? An external tool-boundary enforcement layer evaluated each tool call before execution and returned allow, block, or require_approval. Governed non-allow outcomes were non-executable and produced signed evidence artifacts.

Does this generalize beyond OpenClaw? The mechanism is portable to agent stacks that expose pre-execution tool-call mediation. The exact rates in this report are case-study-specific to this pinned OpenClaw commit, workload profile, and policy set.

Policy file

For media

Need the short version first? The media brief explains the study in plain language, keeps the headline findings and limitations intact, and links back to the full report and artifact set.

Download media brief (PDF) | Source brief (GitHub) | Press contact