How to Run an Evaluation-Grade Agent Pilot

The pilot ends, the team is impressed, and the real decision still is not defensible. Everyone learned that a strong operator could get useful output. Almost nobody can say what was proven about control, proof, or widening safety. That is not a useless pilot. It is just not an evaluation-grade one. A serious pilot has to answer a harder question than "was this useful?" It has to answer "what would justify expanding this safely?"

In this piece

Research grounding What most pilots actually prove The failure mode A practical pilot framework What a serious pilot should leave behind Why security, engineering, and platform should co-own it Concrete artifact: a pilot scorecard What to do next

Series home | All field notes

Research grounding

OpenClaw used matched governed and ungoverned lanes under one pinned workload and policy set. The sprawl report stayed explicit about what a public artifact set could and could not prove. Those two habits are the beginnings of a serious pilot framework: controlled comparison and clear scope discipline. See the OpenClaw report and AI Tool Sprawl report.

What most pilots actually prove

Most pilots test productivity, task completion, and how quickly one strong operator can get good output. Those are reasonable things to learn. They are not sufficient for an evaluation decision once the system is expected to interact with real delivery paths.

The missing questions are usually the ones that matter later. What action classes were tested? What did the control layer do when the action got risky? What proof traveled with the run? What exactly would make the team comfortable expanding the pilot from one repo to many, or from read assistance to write-capable work?

Many pilots are still trying to settle three different questions with one loose exercise: is the tool useful, is it governable, and is it worth integrating? Those questions overlap, but they do not collapse into each other.

A pilot that proves usefulness may still leave control quality ambiguous. A pilot that proves control quality may still be too operationally heavy to justify rollout. A serious framework makes those tradeoffs visible instead of hiding them under one thumbs-up or thumbs-down verdict.

The failure mode

The anti-pattern is productivity theater. The tool completes a few convenient tasks, the stakeholders see enough upside to stay excited, and nobody defines the control, proof, or scenario standards that would justify a broader rollout. The pilot ends with enthusiasm but no durable decision framework.

A second weak pattern is to let one team own the pilot alone. If security is absent, the pilot misses boundary and evidence questions. If platform is absent, the pilot misses operability and integration questions. If engineering management is absent, the exit criteria stay too vague to drive a real deployment decision.

A third weak pattern is to postpone the exit memo until after the pilot. That sounds efficient but usually guarantees ambiguity. Teams end up reverse-engineering success criteria from a run they already want to like. The right time to decide what would count as expand, hold narrow, or fail is before the first workflow is exercised.

A practical pilot framework

An evaluation-grade pilot should have five defined elements before it starts, with exit criteria written before the team sees results.

Scope. Which repos, workflows, and action classes are in scope, and which are explicitly out of scope? Start from the Agent Action BOM.
Scenario set. Which risk scenarios will be exercised, including stop, approval-dependent write, share, restart, or proof reconstruction?
Control scorecard. Which efficacy metrics will be measured during the run?
Proof rubric. What evidence fields must travel with the run or resulting change?
Exit criteria. What results would justify expansion, pause, or failure?

Notice what is missing from that list: a generic question about whether the tool "felt good." Usability matters, but it should sit beside control quality and evidence quality, not replace them.

The reason this framework works is that it produces a decision shape, not just a learning exercise. Security can see whether the control assumptions held. Platform can see whether the workflow is operable enough to maintain. Engineering leadership can see whether the upside justifies the integration cost. Procurement gets a cleaner record of why the organization expanded, paused, or walked away.

What a serious pilot should leave behind

A good pilot should end with something the organization can reuse. Not a vague recommendation, but a small package of artifacts: the scenario matrix, the control efficacy scorecard, the proof completeness review, the known gaps, and the explicit decision on whether the pilot should widen or stay narrow.

That is what turns a pilot into organizational knowledge. The next vendor or internal platform experiment should not have to restart the benchmark conversation from the beginning.

One more requirement belongs in the output: name the next control investment plainly. If the pilot stayed narrow because stop behavior was weak, say that. If it stayed narrow because the proof packet was incomplete, say that. If it failed because the human review path was too heavy to scale, say that. The best pilot outputs are useful even when they do not recommend expansion.

Why security, engineering, and platform should co-own it

Security, engineering, and platform should co-own the pilot because they are judging different truths about the same system. Security needs to know whether the boundaries hold and whether the evidence is defensible. Engineering and platform teams need to know whether the system is operable, reusable, and worth integrating. No one group can answer the others' questions by itself.

When both groups are in the design from the start, the organization is much less likely to end the pilot with two conflicting narratives: one saying the tool is powerful, and one saying the risk is still too unclear to proceed.

Engineering leadership matters here too because someone has to own the deployment decision and the widening economics. A control that works only with constant heroics is not mature enough yet. A productive tool that nobody can explain after the fact is not mature enough either. Co-ownership keeps the pilot anchored to both truths at once.

Concrete artifact: a pilot scorecard

A useful scorecard should force one final decision, not just collect observations. The categories should be defined before the pilot starts, not interpreted after the fact.

Expand

Scenario coverage, control efficacy, and proof completeness all met the widening threshold for the next scope.

Hold narrow

The tool is useful, but one or more benchmark dimensions are not strong enough for broader rollout yet.

Fail

The pilot left too many gaps in runtime control, proof, or operational fit to justify expansion.

Next control investment

The scorecard should name the single most important mechanism or artifact the team must improve before widening.

What to do next

For the next pilot, write the exit memo before the work starts. If that feels premature, it usually means the team has not decided what the pilot is supposed to prove.

Define the three possible outcomes: expand, hold narrow, fail.
State which scenario, control, and proof conditions correspond to each outcome.
Make security, engineering, and platform sign off on the criteria before the first run.
Name the next control investment that would matter most if the pilot stays narrow.

If the team cannot do that, it is still gathering impressions. A evaluation-grade pilot should leave behind an exit memo with the scenario results, control scorecard, proof gaps, and expansion decision.