What I learned trying to run multiple AI agents on real projects, and what I built because of it.
February 2025
Software engineering has gone through four transitions in rapid succession. Each one changed not just how we write code, but where.
The first four transitions had a natural home. Autocomplete lives in the editor. A sidebar agent lives in the IDE. A terminal agent lives in the terminal. But the fifth is different.
I wanted to run multiple side projects in parallel — have agents working on one while I focused on another. When you're running three agents simultaneously, you're not editing alongside them. You're managing them. And the right form factor for managing a team isn't an IDE or a terminal.
The obvious answer is tmux — split your terminal into panes, run an agent in each one. I tried this. It works until it doesn't. Three agents means three panes to watch, with no unified view of what's done, what's stuck, and what needs you. You're manually tracking which agent finished, which branch to review, which to merge first. There's no shared state between them, no way to coordinate handoffs, and when you come back after twenty minutes, you're reading scrollback in three panes trying to reconstruct what happened. It scales to about two agents before the cognitive overhead eats the productivity gain.
What I needed was a dashboard I could check — delegate work, switch to my own editor, glance back when something needed my attention. I also needed a server process that could stay alive, manage long-running agent sessions, and eventually respond to webhooks. And I wanted to talk to my team — voice input is natural for giving instructions and nearly impossible in a terminal.
I started by giving multiple Claude Code instances their own branches and letting them work. The agents were fine. They could read code, write implementations, run tests. The problem was everything around them.
Two agents touching the same file. An agent that decided to
git rebase on its own and corrupted the branch history.
Worktrees left behind after a crash. Merge conflicts that appeared only
after both agents had finished. A merge that moved main
forward while I had uncommitted changes, leaving my working tree in a
confusing state.
I spent more time in git reflog cleaning up messes than I
saved by running agents in parallel. The productivity gain was negative.
The solution was isolating agents completely. Each agent works in its own
git worktree with a dedicated branch, forked from main at the
moment the task is created. Agents are forbidden from rebasing, merging,
or checking out branches — enforced at the OS permission level, not just
in the prompt. Merges happen through a separate process: create a
temporary branch, rebase onto current main, run the test
pipeline, fast-forward main in one atomic operation. If
anything fails, nothing is modified. The repository is never corrupted.
This infrastructure is invisible when it works. But without it, multi-agent coding is a liability.
Once git was solid, the next problem surfaced. An agent finishes coding. Now what? Who reviews it? How do I know it's ready? What if the review finds problems — does it go back to the same agent? What if I want human approval before anything merges?
Each project I worked on had slightly different needs. One needed strict review. Another was a prototype where I wanted auto-merge. A third needed tests to pass before review even started.
I was rebuilding a different ad-hoc pipeline for every project. What I actually needed was a way to define workflows — stages, transitions, assignment rules, quality gates — that could vary per project but share the same execution engine. And those workflows needed to be able to reach external systems: post to Slack when a task is ready for review, create a GitHub PR when code is approved, update Linear when a task ships.
Delegate is what came out of all this. It's a browser-based multi-agent
tool with a configurable workflow engine running against your local git
repository. You talk to a manager agent who decomposes your request into
tasks, assigns them to engineering agents, coordinates reviews, and
orchestrates merges. You review and approve before anything lands on
main.
# install delegate $ pip install delegate-ai # start delegate, go to the browser and tell your team what to build $ delegate start
The interaction model is async. Describe what you want built, switch to
your own editor, come back when the tab says (2) Delegate
— two tasks need your attention. The interface shows what every agent is
doing in real time — files being read, tests being run, commits being made
— without flooding you with messages. For the times you need a quick
terminal command without switching windows, there's /shell.
Workflows are defined in Python. Each stage specifies who it's assigned to, what conditions must be met to enter it, and what happens on entry and exit. Transitions are enforced by the engine — no LLM can skip a stage or bypass a guard.
# API is alpha — this is the direction, not the final syntax class InReview(Stage): label = "In Review" def guard(self, ctx): ctx.require_clean_worktree() ctx.require_commits() def assign(self, ctx): # Never assign to the author return ctx.pick( role="engineer", exclude=ctx.task.dri ) def enter(self, ctx): ctx.slack.post( f"Task {ctx.task.id} ready for review" )
Stages can be assigned to an AI agent, a human, or the system itself. Integrations with external systems — Slack, GitHub, Linear — are hooks on stage transitions. The workflow engine is the core of the product. Agents are workers plugged into it.
Delegate is in early alpha. It works for single-player local development — one human managing a team of AI agents against a local git repository. The git orchestration, merge queue, and review pipeline are functional and handling real work.
The workflow API is alpha and will change. Integrations with external systems are designed for but not yet built — the hook points exist in the workflow engine, the connectors don't.
A few things I've learned the hard way:
Agent quality is a function of context, not just model. A cheaper model with a persistent session often outperforms an expensive model with a fresh context window each turn. The agent that remembers reading a file doesn't re-read it, doesn't re-derive the architecture, and doesn't make contradictory decisions. Per-turn cost is a misleading metric. Per-correct-outcome cost is what matters.
The merge queue is everything. Without automated rebase, pipeline gates, and atomic fast-forward merges, multi-agent development is multi-agent chaos. Every bug I've fixed in the merge pipeline has been worth ten prompt engineering improvements.
Reviews catch what prompts miss. Having a separate agent review code with fresh eyes catches bugs the author agent is blind to. This doesn't require AI coordination magic — it's the same reason human code review works.
Local-first is the foundation, not the ceiling. The workflow engine is designed to support hybrid teams — human engineers and AI agents working side by side, with configurable workflows that encode each team's process. The same stage can be assigned to a person or an agent depending on the task.
I'm particularly interested in what happens when the infrastructure is right and the models keep improving. Faster inference means the manager responds in seconds, not minutes. Better reasoning means agents produce code that passes review on the first try. The infrastructure I'm building today — worktree isolation, workflow engine, merge orchestration — becomes more valuable as agents get smarter, not less.
Delegate is MIT licensed and open source.
I'm looking for early users who want to push on this with me. What workflows would you define? What integrations matter most? What's broken about multi-agent coding that you wish someone would fix?