What I learned trying to run multiple AI agents on real projects, and what I built because of it.
February 2025
Software engineering has gone through four transitions in rapid succession. Each one changed not just how we write code, but where.
The first four transitions had a natural home. Autocomplete lives in the editor. A sidebar agent lives in the IDE. A terminal agent lives in the terminal. But the fifth is different.
I wanted to run multiple side projects in parallel — have agents working on one while I focused on another. When you're running three agents simultaneously, you're not editing alongside them. You're managing them. What should be the right form factor for managing a team of agents?
One obvious answer is tmux — split your terminal into panes, run an agent in each one. I tried this. It works until it doesn't. Three agents means three panes to watch, with no unified view of what's done, what's stuck, and what needs you. You're manually tracking which agent finished, which branch to review, which to merge first. There's no shared state between them, no way to coordinate handoffs, and when you come back after twenty minutes, you're reading scrollback in three panes trying to reconstruct what happened. It scales to about two agents before the cognitive overhead eats the productivity gain.
What I needed was a dashboard I could check — delegate work, switch to my own editor, glance back when something needed my attention. I also needed a server process that could stay alive, manage long-running agent sessions, and eventually respond to webhooks.
I started by giving multiple Claude Code instances their own branches and letting them work. The agents were fine. They could read code, write implementations, run tests. The problem was everything around them.
Two agents touching the same file. An agent that decided to
git rebase on its own and corrupted the branch history.
Worktrees left behind after a crash. Merge conflicts that appeared only
after both agents had finished. A merge that moved main
forward while I had uncommitted changes, leaving my working tree in a
confusing state.
I spent more time in git reflog cleaning up messes than I
saved by running agents in parallel. The productivity gain was negative.
The obvious answer is to isolate agents completely in their own worktrees and/or branches. This is easier said than done though because agents are rather creative and resourceful and may still bypass the prompt instructions. So some form of sandboxing is needed to prevent agents from being able to do things they shouldn't. Further, come merge time, all operations need to be atomic and idempotent.
After git, the next problem is environment isolation. Agents are all
sharing one environment - Agent A's pip install clobbers
Agent B's
dependency. Agent A runs tests, Agent B's test run collides on the same
port. Two agents both trying to npm install in the same
node_modules directory. Everything interferes with
everything.
Creating automated environment per agent, unsurprisingly, turned out to
be hard. Git worktrees share a single .git directory, which
means tools like pip install -e . that depend on the repo
root get confused. Python venvs contain hardcoded absolute paths. Node
resolves node_modules by walking up the directory tree, so a
worktree nested inside the repo picks up the wrong one. Every language
ecosystem has its own assumptions about where it lives, and many of those
assumptions break when you have five copies of the same repo checked out
simultaneously.
What I wanted was some sort of auto-setup of an isolated environment for the repository which agents can themselves edit and improve as needed.
I wans't comfortable with multiple agents running potentially unsupervised having full access to the machine. What if they were to corrupt my git repo? Or mistakenly delete my database? Or download malware?
I wanted a reasonable level of sandboxing that prevented them from doing harm while still allowing them to do their job. I could run them all in a separate docker container and maybe that's the right answer - but I just wanted something simpler and lighter weight.
Once git and environment isolation were solid, the next problem surfaced. An agent finishes coding. Now what? Who reviews it? How do I know it's ready? What if the review finds problems — does it go back to the same agent? What if I want human approval before anything merges?
Each project I worked on had slightly different needs. One needed strict review. Another was a prototype where I wanted auto-merge. A third needed tests to pass before review even started.
I was rebuilding a different ad-hoc pipeline for every project. What I actually needed was a way to define workflows — stages, transitions, assignment rules, quality gates — that could vary per project but share the same execution engine. And those workflows needed to be able to reach external systems: post to Slack when a task is ready for review, create a GitHub PR when code is approved, update Linear when a task ships.
Delegate is what came out of all this. It's a browser-based multi-agent
tool with a configurable workflow engine running against your local git
repository. You talk to a manager agent who decomposes your request into
tasks, assigns them to engineering agents, coordinates reviews, and
orchestrates merges. You review and approve before anything lands on
main.
# install delegate $ pip install delegate-ai # start delegate, go to the browser and tell your team what to build $ delegate start
Each agent works in its own git worktree on a dedicated branch. Permissions are enforced via sandbox such that agents can't add/remove worktrees, branches, do rebase, reset, etc - all such operations are handled by deterministic Python code.
Delegate auto-generates a bash
setup script for each repo that creates a working environment from scratch
- installing dependencies, creating virtualenvs, setting up build tools.
It works out of the box for common project layouts, though it's not
foolproof.
Agents can iterate on the script themselves as they discover what's
missing.
The script is committed to the repo (.delegate/setup.sh), so
you
can edit it directly if needed.
Agents are run in a secured sandbox that limits write access to either delegate's own folders, temp directory or project's .git directory (but importantly, not the working directory). As a result, they can either write to delegate manager worktrees or your project's .git folder. Further, they are disallowed from running git operations that mutate the branch topology (e.g. rebase, merge, reset, etc) so even with access to the .git folder, they cannot corrupt the project. The network access is also limited to a allowlist of domains (default being common package managers and git forges) and can be further customized via CLI.
Describe what you want
built, switch to your own editor, come back when the tab says (2)
Delegate — two tasks need your attention. The interface shows
what every agent is doing in real time — files being read, tests being
run, commits being made — without flooding you with messages. For the
times you need a quick terminal command without switching windows,
there's /shell.
Even though I love being in the terminal, I have come to believe that such an async model requires one to observe many agents live which requires deeper UI capabilities than a terminal can provide. I have bundled a PWA for native app experience but hope to make a real native app (using Tauri?) at some point.
The workflow API is early and actively evolving. Currently, it hardcodes the "standard" workflow - coding, peer review, approval, revise, merge. From here on, I want to evolve the workflow API to be more flexible and powerful.
Delegate is in early alpha. It works for single-player local development — one human managing a team of AI agents against a local git repository. The git orchestration, merge queue, and review pipeline are functional and handling real work.
The workflow API is functional but still evolving. Integrations with external systems are designed for but not yet built — the hook points exist in the workflow engine, the connectors don't. Environment setup works for some common cases but needs to become lot more robust.
A few things I've learned the hard way:
Agent quality is a function of context, not just model. A cheaper model with a persistent session often outperforms an expensive model with a fresh context window each turn. The agent that remembers reading a file doesn't re-read it, doesn't re-derive the architecture, and doesn't make contradictory decisions. Per-turn cost is a misleading metric. Per-correct-outcome cost is what matters.
The merge queue is everything. Without automated rebase, pipeline gates, and atomic fast-forward merges, multi-agent development is multi-agent chaos. Every bug I've fixed in the merge pipeline has been worth ten prompt engineering improvements.
Peer reviews catch what prompts miss. Having a separate peer agent review code with fresh eyes catches bugs the author agent is blind to. This doesn't require AI coordination magic — it's the same reason human code review works.
Local-first is the foundation, not the ceiling. The workflow engine is designed to support hybrid teams — human engineers and AI agents working side by side, with configurable workflows that encode each team's process. The same stage can be assigned to a person or an agent depending on the task.
I'm particularly interested in what happens when the infrastructure is right and the models keep improving. Faster inference means the manager responds in seconds, not minutes. Better reasoning means agents produce code that passes review on the first try. The infrastructure I'm building today — worktree isolation, workflow engine, merge orchestration — becomes more valuable as agents get smarter, not less.
Delegate is MIT licensed and open source.
I'm looking for early users who want to push on this with me. What workflows would you define? What integrations matter most? What's broken about multi-agent coding that you wish someone would fix?