The hard problem isn't the agents —
it's everything between them

What I learned trying to run multiple AI agents on real projects, and what I built because of it.

February 2025

How we got here

Software engineering has gone through four transitions in rapid succession. Each one changed not just how we write code, but where.

Write code in an IDE VS Code, IntelliJ
AI autocompletes lines Copilot, early Cursor
Agent in the sidebar Cursor, Windsurf
Agent in the terminal Claude Code, Codex
Multi-agent teams ???

The first four transitions had a natural home. Autocomplete lives in the editor. A sidebar agent lives in the IDE. A terminal agent lives in the terminal. But the fifth is different.

I wanted to run multiple side projects in parallel — have agents working on one while I focused on another. When you're running three agents simultaneously, you're not editing alongside them. You're managing them. And the right form factor for managing a team isn't an IDE or a terminal.

The obvious answer is tmux — split your terminal into panes, run an agent in each one. I tried this. It works until it doesn't. Three agents means three panes to watch, with no unified view of what's done, what's stuck, and what needs you. You're manually tracking which agent finished, which branch to review, which to merge first. There's no shared state between them, no way to coordinate handoffs, and when you come back after twenty minutes, you're reading scrollback in three panes trying to reconstruct what happened. It scales to about two agents before the cognitive overhead eats the productivity gain.

What I needed was a dashboard I could check — delegate work, switch to my own editor, glance back when something needed my attention. I also needed a server process that could stay alive, manage long-running agent sessions, and eventually respond to webhooks. And I wanted to talk to my team — voice input is natural for giving instructions and nearly impossible in a terminal.

The first wall: git

I started by giving multiple Claude Code instances their own branches and letting them work. The agents were fine. They could read code, write implementations, run tests. The problem was everything around them.

Two agents touching the same file. An agent that decided to git rebase on its own and corrupted the branch history. Worktrees left behind after a crash. Merge conflicts that appeared only after both agents had finished. A merge that moved main forward while I had uncommitted changes, leaving my working tree in a confusing state.

I spent more time in git reflog cleaning up messes than I saved by running agents in parallel. The productivity gain was negative.

The solution was isolating agents completely. Each agent works in its own git worktree with a dedicated branch, forked from main at the moment the task is created. Agents are forbidden from rebasing, merging, or checking out branches — enforced at the OS permission level, not just in the prompt. Merges happen through a separate process: create a temporary branch, rebase onto current main, run the test pipeline, fast-forward main in one atomic operation. If anything fails, nothing is modified. The repository is never corrupted.

This infrastructure is invisible when it works. But without it, multi-agent coding is a liability.

The second wall: process

Once git was solid, the next problem surfaced. An agent finishes coding. Now what? Who reviews it? How do I know it's ready? What if the review finds problems — does it go back to the same agent? What if I want human approval before anything merges?

Each project I worked on had slightly different needs. One needed strict review. Another was a prototype where I wanted auto-merge. A third needed tests to pass before review even started.

I was rebuilding a different ad-hoc pipeline for every project. What I actually needed was a way to define workflows — stages, transitions, assignment rules, quality gates — that could vary per project but share the same execution engine. And those workflows needed to be able to reach external systems: post to Slack when a task is ready for review, create a GitHub PR when code is approved, update Linear when a task ships.

Introducing Delegate

Delegate is what came out of all this. It's a browser-based multi-agent tool with a configurable workflow engine running against your local git repository. You talk to a manager agent who decomposes your request into tasks, assigns them to engineering agents, coordinates reviews, and orchestrates merges. You review and approve before anything lands on main.

# install delegate
$ pip install delegate-ai 

# start delegate, go to the browser and tell your team what to build
$ delegate start 

The interaction model is async. Describe what you want built, switch to your own editor, come back when the tab says (2) Delegate — two tasks need your attention. The interface shows what every agent is doing in real time — files being read, tests being run, commits being made — without flooding you with messages. For the times you need a quick terminal command without switching windows, there's /shell.

Workflows are defined in Python. Each stage specifies who it's assigned to, what conditions must be met to enter it, and what happens on entry and exit. Transitions are enforced by the engine — no LLM can skip a stage or bypass a guard.

Todo Planning In Progress In Review Approval Merging Done
workflow.py
# API is alpha — this is the direction, not the final syntax

class InReview(Stage):
    label = "In Review"

    def guard(self, ctx):
        ctx.require_clean_worktree()
        ctx.require_commits()

    def assign(self, ctx):
        # Never assign to the author
        return ctx.pick(
            role="engineer",
            exclude=ctx.task.dri
        )

    def enter(self, ctx):
        ctx.slack.post(
            f"Task {ctx.task.id} ready for review"
        )

Stages can be assigned to an AI agent, a human, or the system itself. Integrations with external systems — Slack, GitHub, Linear — are hooks on stage transitions. The workflow engine is the core of the product. Agents are workers plugged into it.

Where it stands

Delegate is in early alpha. It works for single-player local development — one human managing a team of AI agents against a local git repository. The git orchestration, merge queue, and review pipeline are functional and handling real work.

The workflow API is alpha and will change. Integrations with external systems are designed for but not yet built — the hook points exist in the workflow engine, the connectors don't.

A few things I've learned the hard way:

Agent quality is a function of context, not just model. A cheaper model with a persistent session often outperforms an expensive model with a fresh context window each turn. The agent that remembers reading a file doesn't re-read it, doesn't re-derive the architecture, and doesn't make contradictory decisions. Per-turn cost is a misleading metric. Per-correct-outcome cost is what matters.

The merge queue is everything. Without automated rebase, pipeline gates, and atomic fast-forward merges, multi-agent development is multi-agent chaos. Every bug I've fixed in the merge pipeline has been worth ten prompt engineering improvements.

Reviews catch what prompts miss. Having a separate agent review code with fresh eyes catches bugs the author agent is blind to. This doesn't require AI coordination magic — it's the same reason human code review works.

What's next

Local-first is the foundation, not the ceiling. The workflow engine is designed to support hybrid teams — human engineers and AI agents working side by side, with configurable workflows that encode each team's process. The same stage can be assigned to a person or an agent depending on the task.

I'm particularly interested in what happens when the infrastructure is right and the models keep improving. Faster inference means the manager responds in seconds, not minutes. Better reasoning means agents produce code that passes review on the first try. The infrastructure I'm building today — worktree isolation, workflow engine, merge orchestration — becomes more valuable as agents get smarter, not less.

Delegate is MIT licensed and open source.

Shape what Delegate becomes

I'm looking for early users who want to push on this with me. What workflows would you define? What integrations matter most? What's broken about multi-agent coding that you wish someone would fix?