The AI Engineering Maturity Model

The quarterly planning meeting eventually reaches the same question: how much agent autonomy can the team defend now? That is not a model-selection problem. It is a maturity problem. Each next level of autonomy should require evidence that the current controls, proof, and operating discipline can support it.

In this piece

Where the pressure shows up The failure mode The better pattern How to use the model without theater Why security cares Why platform and engineering care Concrete example: a realistic 90-day progression What to do next

Series home | All field notes

Where the pressure shows up

A common adoption path starts with interactive prompting and a few strong engineers. Then teams add repository guidance, scripts, and some workflow structure. Eventually they want orchestration, isolated workspaces, hidden evaluation, and proof-rich shipping. The trap is trying to skip the middle. That usually produces more autonomy than the team can actually govern.

A maturity model helps because it turns vague optimism into a concrete adoption path. It gives AppSec a framework for approval. It gives engineering and platform teams a roadmap for implementation. It gives engineering leaders a way to invest in the next capability without pretending that every repo needs the final state immediately.

It also prevents a common failure mode: pilot success in one repo gets misread as enterprise readiness across all repos.

The failure mode

The anti-pattern is autonomy inflation: assuming that once an agent is useful interactively, the organization is ready for background execution at scale. That leap ignores everything in between: repo contract quality, sandbox isolation, policy boundaries, hidden evals, proof packets, and operational ownership.

It also creates bad conversations between security and engineering. One side sees unacceptable risk. The other sees arbitrary friction. A maturity model gives both sides a shared language for what must be true before the next level is justified.

Without this shared language, escalation becomes personal and political instead of technical and measurable.

The better pattern

The better pattern is staged capability growth. We find it useful to think in five levels.

Level 1: Prompt-first. Useful for individual assistance, low repeatability, weak evidence, heavy human steering.
Level 2: Repo-aware. The system has a usable repo contract, deterministic commands, and scoped documentation.
Level 3: Blueprinted. Workflows split reasoning from deterministic mechanics, with structured handoffs and reusable validators.
Level 4: Governed delivery. Orchestration, isolated workspaces, policy gates, and proof packets exist for autonomous work.
Level 5: Dark-factory capable. Safe concurrency, hidden evals, durable evidence, and operational states allow managed background delivery.

None of those levels is moralized. They are implementation states. A team can be effective at one level for a long time. Trouble starts when the organization claims a higher level than its controls support.

Each level should also have an exit test. A workflow is not repo-aware if critical commands are still discovered by memory. It is not governed delivery if non-allow outcomes can still execute. It is not dark-factory capable if approvals, retries, and proof still depend on manual reconstruction after the fact.

How to use the model without theater

A maturity model becomes theater when it turns into a labeling exercise. The point is not to declare a level in a slide deck. The point is to connect each level to technical prerequisites, evidence requirements, and approval posture. If a team says it is running governed delivery but cannot show isolated workspaces, deterministic validation, and proof packets, the label is not meaningful.

Used properly, the model becomes a planning tool. It tells leaders what to build next, which labels are premature, and which classes of work are safe to widen. That is much more useful than broad statements about being "AI ready."

The practical pattern is to pair every maturity claim with three artifacts: an owner, an evidence requirement, and an exit test.

Why security cares

AppSec and security engineering need this model because approval posture should change with maturity. Level 1 may allow interactive assistance on low-risk work. Level 4 may justify background change generation inside bounded repos. Level 5 may justify parallel autonomous work only if the evidence, isolation, and policy layers are already mature.

The value of the model is that it makes "no" more precise and "yes" more defensible. A security team does not have to reject autonomy as a whole. It can say, "You are ready for the next level once these control conditions are met."

Why platform and engineering care

Platform and engineering teams need a roadmap that matches implementation reality. The first investments should usually be repo contract quality, deterministic commands, and isolated execution. Then come blueprinting, orchestration, proof packets, and layered evaluation. That order matters because later stages depend on the earlier ones being stable.

This is also the right way to think about product fit. Wrkr-style discovery and orchestration makes more sense once the repo is legible. Gait-style policy control makes more sense once the workflow has a real execution boundary to mediate. The products are not the maturity model. They map onto it.

The roadmap should be sequenced by dependency, not by novelty. Building advanced orchestration before repo contracts and deterministic validators are stable usually creates fragile complexity.

Concrete example: a realistic 90-day progression

Most teams do not need to build a dark factory in one quarter. They do need to move out of the prompt-first zone deliberately.

Days 1-30

Clean up repo contracts, pin deterministic commands, and define path boundaries for one or two high-value workflows using an Agent Action BOM.

Days 31-60

Split workflows into blueprints, add isolated workspaces, and define the minimum proof packet for autonomous changes.

Days 61-90

Introduce orchestration states, holdout evaluation, and a narrow approval posture for background runs in bounded repos.

What to do next

Assess one team or one repo against the five levels and be strict about what evidence counts.

If the workflow still depends on hidden human supervision, it is not beyond prompt-first.
If there is no clean repo contract, do not talk about scale yet.
If deterministic validators are missing, blueprinting is still incomplete.
If proof packets and hidden evals are missing, background autonomy should remain narrow.

The point of maturity is not prestige. The point is to let the organization adopt more autonomy with less guesswork and fewer political collisions.

That is the core belief behind this series. The goal is not fully autonomous everything. The goal is governed, observable, high-leverage autonomy that security can approve and engineering can actually use.