Proof Completeness for AI Agent Changes

The proof problem usually shows up late. The pilot looked good, the change merged, and everyone moved on. Then a reviewer, auditor, or incident responder asks a simple question about what happened and the answer turns into screenshots, memory, and partial logs. A green CI badge is a useful signal. It is not enough evidence for a write-capable agent workflow. The CI/CD control guide explains the delivery-path evidence that should sit underneath the badge.

This piece is for evaluators scoring proof quality in a pilot or vendor evaluation. The related proof-of-work article is for teams designing their internal workflow artifact. The distinction matters: evaluators need a scoring rubric; builders need an implementation pattern.

In this piece

Research grounding Why proof completeness is its own category The failure mode The seven fields of a complete proof packet Why security leaders care Why platform and engineering care Concrete artifact: a proof completeness rubric What to do next

Series home | All field notes

Research grounding

The sprawl report found 54.4% of locked-cohort targets without verifiable governance evidence. OpenClaw's governed lane measured a 99.96% evidence verification rate. Those numbers matter because they describe two ends of the same category: proof that is missing, and proof that survives verification. See the AI Tool Sprawl report and OpenClaw report for the measured context.

Why proof completeness is its own category

Teams often treat evidence as a byproduct of controls. It deserves to be evaluated directly. A workflow can have approvals and still fail to explain what actually happened. It can have logs and still fail to produce a coherent review packet. That is why proof completeness should be judged as its own dimension, not implied from policy or CI alone.

This matters most for evaluators because incomplete proof creates hidden costs. The system may look fine during the pilot. The pain appears later, when a reviewer, auditor, or incident responder tries to reconstruct the change and discovers that the workflow left a trail of fragments instead of one inspectable packet.

We keep coming back to the same rule at CAISI: the approval is not the proof. Authorization tells you what the system was allowed to try. Proof tells you what actually happened. Mature evaluators need both because they answer different questions and they tend to fail in different ways.

The failure mode

The anti-pattern is to confuse receipts, logs, and screenshots with a proof packet. Approval receipts are useful. Runtime logs are useful. Screenshots may be temporarily useful. None of them alone tell a third party what the run knew, what policy applied, what executed, which validations ran, and what residual risk remained.

Another weak pattern is to let the vendor define proof in a dashboard-native way. If the evidence only makes sense while you are logged into one product surface, the evaluator still has an explanation problem. Proof should survive time, tool changes, and disputes.

A subtler anti-pattern is to optimize proof only for happy-path review. That creates neat summaries for successful runs and brittle evidence for the moments that matter most: contested outcomes, failed validations, disputed approvals, or questions that arrive weeks later. Evaluators should judge proof by how it behaves under scrutiny, not just by how cleanly it renders in a demo.

The seven fields of a complete proof packet

An evaluation-grade proof packet should include seven fields.

Trigger and scope. What initiated the work, and what paths, tools, or systems were in scope?
Context and intent. What the system knew and what it was trying to do.
Policy and approval state. Which rules applied, what verdict was returned, and what approvals existed or were required.
Execution record. What actually executed or stayed non-executable.
Validation record. Which checks ran, which passed, and which were intentionally out of scope.
Residual risk statement. What remains uncertain, deferred, or dependent on human review.
Rollback or recovery note. What the team should do if the outcome needs to be reversed or investigated later.

Teams do not need every packet to be verbose. They do need it to be complete enough that a third party could review the run cold. That is the standard that separates a proof artifact from a collection of traces.

Another useful test is portability. If the packet only works while you are logged into one product and already know the internal naming, it is not complete enough yet. Proof has to survive time, vendor boundaries, and the fact that future reviewers will not share the original operator's context.

Why security leaders care

Security leaders care because proof is where many control claims fail their first real audit. The team can describe the policy. The vendor can describe the feature. But when the reviewer asks what executed and why, the answer still depends on someone who remembers the run.

A completeness model changes that. It lets AppSec state what evidence must travel with the workflow before the organization expands the approved scope.

It also changes incident posture. When proof is incomplete, security has to reconstruct events by interviewing people and stitching logs. When proof is complete, the team can start with one coherent packet and spend more of its time deciding what to do next instead of arguing about what happened.

Why platform and engineering care

Platform and engineering teams benefit because complete proof reduces review friction and debugging time. A strong packet helps a successful run merge faster and helps a failed run get understood faster. In both cases the team spends less time reconstructing context and more time making decisions.

It also makes pilot comparison easier. Once the organization has a stable proof rubric, two tools can be compared on whether they produce reviewable artifacts, not just on whether they completed the task.

This is one of the least appreciated productivity benefits in the category. Good proof is not extra paperwork layered on top of engineering work. It is the thing that reduces re-review, cuts cold debugging time, and keeps later questions from turning into archaeology.

Concrete artifact: a proof completeness rubric

A simple rubric can score each field as missing, partial, or reviewable, but the important part is that the rubric is portable enough to use across tools and pilots.

Missing

The reviewer would need side-channel context, screenshots, or memory to understand the field.

Partial

The field exists, but it is incomplete, ambiguous, or trapped in one product surface.

Reviewable

The field is portable, specific, and sufficient for a third party to inspect without guessing.

Residual risk

The packet names what remains uncertain so the reviewer does not mistake completeness for certainty.

What to do next

Take the last autonomous or semi-autonomous change your team produced and score it against the seven fields above. Do it with someone who was not involved in the original run. That is usually when the gaps become obvious.

Mark which fields are missing entirely.
Mark which fields exist but would not survive external review.
Decide which fields are required before the next pilot expands.
Check whether the packet would still make sense outside the product UI.

That exercise usually reveals quickly whether the workflow produces proof or just leaves behind traces.

Proof completeness is downstream of action inventory. If the team cannot first explain what the workflow can touch and change, start with an Agent Action BOM before debating the final proof packet.