Parallel Agents Without Chaos

Queue volume rises, more agent workers start, and everyone assumes throughput should climb with them. Then the collisions begin: overlapping edits, stale assumptions, retries against invalid state, and review handoffs nobody can merge cleanly. Parallelism is where agent programs either become a platform capability or a coordination crisis.

In this piece

Where the pressure shows up The failure mode The better pattern Why security cares Why platform and engineering care Concrete example: two tasks, one safe to parallelize, one blocked What to do next

Series home | All field notes

Where the pressure shows up

Most repositories contain work that could run in parallel if the system actually knew what was safe. A docs change and a detector change may not conflict. A frontend refactor and a billing schema migration probably should not run at the same time without coordination. Humans have a rough intuition for these boundaries. Agents need the orchestrator to encode them.

Without that encoding, concurrency produces a familiar mess: duplicate edits, branch conflicts, stale assumptions, wasted retries, and review handoffs nobody can merge cleanly. Teams then conclude that parallel agents are inherently chaotic when the real problem is the missing control layer above them.

The practical signal is simple: queue volume rises, but completed and mergeable work does not rise with it.

The failure mode

The anti-pattern is unmanaged fan-out. An issue queue is full, the tooling can start multiple runs, and teams assume the system should do so aggressively. That treats concurrency as a scheduling question when it is really a claims question: who owns which paths, which tasks depend on others, and what should happen when issue state changes underneath an active run?

If the orchestrator cannot answer those questions, more agents simply means more hidden contention. The queue appears busy while actual delivery quality gets worse.

The sharpest failure mode is stale parallelism: a run keeps going even after upstream issue state changed and invalidated its assumptions.

The better pattern

Safe concurrency starts with explicit claims. Each run should declare what paths it intends to modify, which upstream work it depends on, and which branch or artifact state it assumes. The orchestrator then uses those claims to decide what can run in parallel, what must wait, and what should be cancelled or retried when conditions change.

Those path claims are a natural input to an Agent Action BOM: actor, owner, repo, workflow, reachable actions, target systems, and proof.

A dependency DAG is not overkill here.

It is the minimum structure required to let autonomous work scale safely. Once the graph exists, concurrency becomes a policy decision instead of a guess. You can block overlapping claims, reuse workspaces across retries, and re-evaluate the run when a parent issue or dependency changes state.

The rule is straightforward: concurrency should be earned by explicit non-overlap, not assumed by default.

Why security cares

From a security perspective, unmanaged concurrency creates invisible risk. Two changes may each look safe in isolation and become unsafe together. One run may invalidate assumptions another run depends on. If no system owns those dependencies, accountability gets diluted the moment multiple autonomous changes are in flight.

Safe parallelism is not about slowing things down. It is about keeping ownership explicit. A change packet should tell a reviewer not only what the run did, but what else it was allowed to overlap with and why that overlap was considered safe.

Why platform and engineering care

This is the real throughput multiplier for engineering. One isolated agent run can be useful. A controlled system that can run independent work items at the same time is where output actually bends upward. That only happens if cancellation, retry backoff, restart recovery, and workspace reuse are designed into the orchestrator instead of improvised by humans.

Engineering and platform leaders should think of this as traffic control, not raw compute scaling. The value is not "more runs." The value is more safe work completed per review cycle.

This also improves accountability. When claims and dependencies are explicit, post-incident reviews can explain why overlapping work was allowed, blocked, retried, or cancelled.

Concrete example: two tasks, one safe to parallelize, one blocked

Concurrency should be a claim-aware decision, not an optimistic default.

Safe fan-out

A docs-only task and a detector-only task claim different paths, no shared dependencies, and can run at the same time.

Blocked overlap

Two tasks claim the same service directory or one task depends on schema changes from another. The orchestrator holds one run.

Recovery

If the blocking run fails or the issue changes state, the held run is retried or cancelled with a clear reason code.

What to do next

Before adding more parallel workers, add a concurrency model with path claims, dependency state, cancellation rules, and retry reason codes.

Require each run to declare expected path claims and map them into the Agent Action BOM.
Model obvious task dependencies and blockers explicitly.
Define what should cancel a run when the source issue changes state.
Keep retry reasons and backoff visible so failure loops do not become silent queue churn.

Concurrency is a force multiplier only when the orchestrator owns the claims and the conflict rules.

Otherwise it is a faster path to branch-level confusion.

The next post shifts from scaling work to trusting outcomes. Visible tests matter, but they are not enough when agents can optimize against the exact checks they can see.