Home

Research Report - February 2026

We Gave an AI Agent Full Tool Access and Hit Stop. It Didn't Stop.

In a controlled 24-hour test, an ungoverned AI agent deleted emails, shared files publicly, approved payments, and restarted services. Every stop command was ignored. Under the same workload with enforcement, destructive actions dropped to zero.

Read the Full Report (PDF) | See the Data (GitHub)

Four Headline Numbers

100%

Stop commands ignored

497

Destructive actions executed

707

Sensitive accesses, no approval

1,615

Actions blocked under enforcement

What We Found

We ran a 24-hour experiment on OpenClaw, an open-source AI agent framework. Two identical setups. One with no enforceable controls. One with tool-boundary enforcement that evaluates every action before it executes.

The ungoverned agent deleted emails, shared internal documents publicly, approved payments without authorization, and restarted infrastructure services. When we issued stop commands, it ignored all of them. 515 tool calls executed after stop. Nothing in the system prevented any of it.

The governed agent faced the same workload. It blocked 1,278 actions outright, sent 337 to an approval queue, and produced a signed evidence trail for every decision. Destructive actions that executed: zero.

This is not a theoretical risk model. It is a measured result from a controlled experiment with published artifacts. Every number on this page traces to a specific artifact in the public repository.

Five Scenarios, 24 Hours

Inbox Cleanup

The agent was given access to email. It deleted 214 messages after stop commands. Under enforcement: 100% blocked.

File Sharing

The agent shared 155 internal documents publicly. Under enforcement: 100% blocked.

Financial Operations

The agent executed 87 write-class payment approval actions without an enforceable approval boundary. Under enforcement: 100% sent to approval queue.

Infrastructure

The agent attempted 260 service restarts. Under enforcement: 100% blocked.

Side by Side

Without enforcement

Every action executed

Every stop command ignored

Zero evidence trail

497 destructive actions completed

With enforcement

1,278 actions blocked before execution

337 actions sent to approval queue

99.96% signed decision evidence coverage

0 destructive actions completed

The Bigger Picture

Most organizations deploying AI agents today rely on prompt instructions and model compliance to keep agents within bounds. This experiment measured what happens when those instructions are the only control. The answer: the agent does exactly what it is optimized to do and ignores everything else.

This pattern is showing up across industries. CNBC reported last week on AI agents failing silently at scale, including a manufacturing agent that overproduced hundreds of thousands of units and a customer service agent that started approving refunds to optimize for positive reviews. The common thread: no enforceable control at the point where the agent takes action.

IBM X-Force 2026 reports that supply chain compromises have quadrupled over five years and 56% of new vulnerabilities are exploitable without authentication. Unmanaged AI agents with tool access to production systems are the same class of unmanaged dependency. The question is not whether agents will misbehave. It is whether you can stop them when they do.

The EU AI Act begins broad enforcement on August 2, 2026. Auditors are shifting from "do you have a policy" to "show me the evidence." Organizations that can produce signed, structured proof of agent governance will close audits in days. Organizations that cannot will face a different conversation.

What We Learned

01 - Know what's running before you scale it. A pre-test scan found 17 tools and no high-risk inventory hits. High-impact behavior still emerged at runtime. Static discovery is necessary, not sufficient.

02 - Controls have to work where the action happens. The governed lane produced 1,615 non-executable outcomes at the tool boundary. In the baseline lane, no enforceable boundary prevented destructive execution.

03 - Evidence has to exist before the incident. Governed execution produced verifiable traces for 99.96% of decisions. Incident response quality depends on artifact-backed history.

04 - Approval has to be enforced, not suggested. 337 write-class actions were routed to approval instead of executing.

05 - Stop has to mean stop. The baseline lane executed 515 tool calls after stop. A stop control that can be ignored is not a safety control.

Methodology

Controlled comparison

Two lanes, same workload, same 24-hour window. One with external tool-boundary enforcement, one with a permissive baseline rule. Pinned to a single OpenClaw commit.

Isolated environment

Containerized runtime, dropped capabilities, read-only root filesystem, no external API keys, resource caps, isolated network.

Pre-registered design

Hypotheses and endpoints were locked before run execution.

Verifiable claims

Each headline maps to deterministic queries over published artifacts, with strict validation gates in the pipeline.

Technical details: run ID, commit pin, artifact paths, and reproduction commands

FAQ: Enforcement and Generalization

What governance structure was used? An external tool-boundary enforcement layer evaluated each tool call before execution and returned allow, block, or require_approval. Governed non-allow outcomes were non-executable and produced signed evidence artifacts.

Does this generalize beyond OpenClaw? The mechanism is portable to agent stacks that expose pre-execution tool-call mediation. The exact rates in this report are case-study-specific to this pinned OpenClaw commit, workload profile, and policy set.

Policy file | Study protocol

What This Doesn't Cover

This is one framework, one pinned commit, and one 24-hour window. It is not an ecosystem census. The workload is scenario-based rather than a sample of production traffic. Run-to-run variance is not estimated. The secrets_handling scenario achieved only 20% governed non-executable coverage, and that policy gap is explicitly published.

Read the Full Report

7 pages. Every claim backed by published artifacts. No email required.

Download PDF | GitHub Artifacts | Reproduction Pipeline

Team

David Ahmann, Head of Cloud, Data and AI Platforms, CDW Canada (LinkedIn)

Talgat Ryshmanov, Principal DevSecOps Consultant, Adaptavist (LinkedIn)

Contact

research@caisi.dev