Days 1-30
Clean up repo contracts, pin deterministic commands, and define path boundaries for one or two high-value workflows.
Independent research and operating notes on AI agent governance.
AI Engineering Operating Notes / Post 10 of 10
Most teams are not arguing about whether AI engineering will happen. They are arguing about how much autonomy they can defend this quarter. That is not a model-selection problem. It is a maturity problem: what current controls, evidence, and operating discipline can justify today?
We see the same pattern across teams. They start with interactive prompting and a few strong engineers. Then they add repository guidance, scripts, and some workflow structure. Eventually they want orchestration, isolated workspaces, hidden evaluation, and proof-rich shipping. The trap is trying to skip the middle. That usually produces more autonomy than the team can actually govern.
A maturity model helps because it turns vague optimism into a concrete adoption path. It gives AppSec a framework for approval. It gives platform teams a roadmap for implementation. It gives engineering leaders a way to invest in the next capability without pretending that every repo needs the final state immediately.
It also prevents a common failure mode: pilot success in one repo gets misread as enterprise readiness across all repos.
The anti-pattern is autonomy inflation: assuming that once an agent is useful interactively, the organization is ready for background execution at scale. That leap ignores everything in between: repo contract quality, sandbox isolation, policy boundaries, hidden evals, proof packets, and operational ownership.
It also creates bad conversations between security and engineering. One side sees unacceptable risk. The other sees arbitrary friction. A maturity model gives both sides a shared language for what must be true before the next level is justified.
Without this shared language, escalation becomes personal and political instead of technical and measurable.
The better pattern is staged capability growth. We find it useful to think in five levels.
None of those levels is moralized. They are implementation states. A team can be effective at one level for a long time. Trouble starts when the organization claims a higher level than its controls support.
Each level should also have an exit test. Do not claim repo-aware if critical commands are still discovered by memory. Do not claim governed delivery if non-allow outcomes can still execute. Do not claim dark-factory capability if approvals, retries, and proof still depend on manual reconstruction after the fact.
A maturity model becomes theater when it turns into a labeling exercise. The point is not to declare a level in a slide deck. The point is to connect each level to technical prerequisites, evidence requirements, and approval posture. If a team says it is running governed delivery but cannot show isolated workspaces, deterministic validation, and proof packets, the label is not meaningful.
Used properly, the model becomes a planning tool. It tells leaders what to build next, what not to claim yet, and which classes of work are safe to widen. That is much more useful than broad statements about being "AI ready."
The practical pattern is to pair every maturity claim with three artifacts: an owner, an evidence requirement, and an exit test.
AppSec and security engineering need this model because approval posture should change with maturity. Level 1 may allow interactive assistance on low-risk work. Level 4 may justify background change generation inside bounded repos. Level 5 may justify parallel autonomous work only if the evidence, isolation, and policy layers are already mature.
The value of the model is that it makes "no" more precise and "yes" more defensible. A security team does not have to reject autonomy as a whole. It can say, "You are ready for the next level once these control conditions are met."
Platform and engineering teams need a roadmap that matches implementation reality. The first investments should usually be repo contract quality, deterministic commands, and isolated execution. Then come blueprinting, orchestration, proof packets, and layered evaluation. That order matters because later stages depend on the earlier ones being stable.
This is also the right way to think about product fit. Wrkr-style discovery and orchestration makes more sense once the repo is legible. Gait-style policy control makes more sense once the workflow has a real execution boundary to mediate. The products are not the maturity model. They map onto it.
The roadmap should be sequenced by dependency, not by novelty. Building advanced orchestration before repo contracts and deterministic validators are stable usually creates fragile complexity.
Most teams do not need to build a dark factory in one quarter. They do need to move out of the prompt-first zone deliberately.
Clean up repo contracts, pin deterministic commands, and define path boundaries for one or two high-value workflows.
Split workflows into blueprints, add isolated workspaces, and define the minimum proof packet for autonomous changes.
Introduce orchestration states, holdout evaluation, and a narrow approval posture for background runs in bounded repos.
Assess one team or one repo against the five levels and be strict about what evidence counts.
The point of maturity is not prestige. The point is to let the organization adopt more autonomy with less guesswork and fewer political collisions.
That is the core belief behind this series. The goal is not fully autonomous everything. The goal is governed, observable, high-leverage autonomy that security can approve and engineering can actually use.