Benchmark Series / Post 4 of 5 / Evidence

Proof Completeness for AI Agent Changes

The proof problem usually shows up late. The pilot looked good, the change merged, and everyone moved on. Then a reviewer, auditor, or incident responder asks a simple question about what happened and the answer turns into screenshots, memory, and partial logs. A green CI badge is a useful signal. It is not enough evidence for a write-capable agent workflow.

Research grounding

The sprawl report found 47.08% of completed targets without verifiable governance evidence. OpenClaw's governed lane measured a 99.96% evidence verification rate. Those numbers matter because they describe two ends of the same category: proof that is missing, and proof that survives verification.

Why proof completeness is its own category

Teams often treat evidence as a byproduct of controls. It deserves to be evaluated directly. A workflow can have approvals and still fail to explain what actually happened. It can have logs and still fail to produce a coherent review packet. That is why proof completeness should be judged as its own dimension, not implied from policy or CI alone.

This matters most for buyers because incomplete proof creates hidden costs. The system may look fine during the pilot. The pain appears later, when a reviewer, auditor, or incident responder tries to reconstruct the change and discovers that the workflow left a trail of fragments instead of one inspectable packet.

We keep coming back to the same rule at CAISI: the approval is not the proof. Authorization tells you what the system was allowed to try. Proof tells you what actually happened. Mature buyers need both because they answer different questions and they tend to fail in different ways.

The failure mode

The anti-pattern is to confuse receipts, logs, and screenshots with a proof packet. Approval receipts are useful. Runtime logs are useful. Screenshots may be temporarily useful. None of them alone tell a third party what the run knew, what policy applied, what executed, which validations ran, and what residual risk remained.

Another weak pattern is to let the vendor define proof in a dashboard-native way. If the evidence only makes sense while you are logged into one product surface, the buyer still has an explanation problem. Proof should survive time, tool changes, and disputes.

A subtler anti-pattern is to optimize proof only for happy-path review. That creates neat summaries for successful runs and brittle evidence for the moments that matter most: contested outcomes, failed validations, disputed approvals, or questions that arrive weeks later. Buyers should judge proof by how it behaves under scrutiny, not just by how cleanly it renders in a demo.

The seven fields of a complete proof packet

We think a buyer-grade proof packet should include seven fields.

Buyers do not need every packet to be verbose. They do need it to be complete enough that a third party could review the run cold. That is the standard that separates a proof artifact from a collection of traces.

Another useful test is portability. If the packet only works while you are logged into one product and already know the internal naming, it is not complete enough yet. Proof has to survive time, vendor boundaries, and the fact that future reviewers will not share the original operator's context.

Why security leaders care

Security leaders care because proof is where many control claims fail their first real audit. The team can describe the policy. The vendor can describe the feature. But when the reviewer asks what executed and why, the answer still depends on someone who remembers the run.

A completeness model changes that. It lets AppSec state what evidence must travel with the workflow before the organization expands the approved scope.

It also changes incident posture. When proof is incomplete, security has to reconstruct events by interviewing people and stitching logs. When proof is complete, the team can start with one coherent packet and spend more of its time deciding what to do next instead of arguing about what happened.

Why platform and engineering care

Platform and engineering teams benefit because complete proof reduces review friction and debugging time. A strong packet helps a successful run merge faster and helps a failed run get understood faster. In both cases the team spends less time reconstructing context and more time making decisions.

It also makes pilot comparison easier. Once the organization has a stable proof rubric, two tools can be compared on whether they produce reviewable artifacts, not just on whether they completed the task.

This is one of the least appreciated productivity benefits in the category. Good proof is not extra paperwork layered on top of engineering work. It is the thing that reduces re-review, cuts cold debugging time, and keeps later questions from turning into archaeology.

Concrete artifact: a proof completeness rubric

A simple rubric can score each field as missing, partial, or reviewable, but the important part is that the rubric is portable enough to use across tools and pilots.

Missing

The reviewer would need side-channel context, screenshots, or memory to understand the field.

Partial

The field exists, but it is incomplete, ambiguous, or trapped in one product surface.

Reviewable

The field is portable, specific, and sufficient for a third party to inspect without guessing.

Residual risk

The packet names what remains uncertain so the reviewer does not mistake completeness for certainty.

What to do next

Take the last autonomous or semi-autonomous change your team produced and score it against the seven fields above. Do it with someone who was not involved in the original run. That is usually when the gaps become obvious.

That exercise usually reveals quickly whether the workflow produces proof or just leaves behind traces.