Problems, Learnings, and 36% Refactoring: What Really Happened Building My AI Team

In the first part of this series, I described how a chaotic collection of AI tools evolved into a structured development team with 15 specialized modes. In the second part, I covered the sophisticated handover system, quality gates, and theoretical foundations.

But reality often differs from theory. In this article, I share the problems that occurred during intensive development and the learnings that emerged. Spoiler: 36.4% of all commits were refactoring commits — and for good reason. Building a multi-agent system is an iterative process where you learn more from solving problems than from the initial design.

Before diving into problems, some good news: Shortly after the first two articles, I discovered the RunVSAgent plugin for JetBrains IDEs. In my first article, I described how switching from JetBrains to VSCode felt like a necessary evil — I'd been a convinced JetBrains user for years and the switch felt like a loss. With RunVSAgent, I could return to my beloved IDEs — PyCharm, PHPStorm, WebStorm — while still using Roo Code. The narrative arc closes: JetBrains → VSCode → JetBrains. Sometimes problems solve themselves in unexpected ways if you give them time.

The Biggest Architecture Change: From Parallel to Sequential

One of the most fundamental insights came on August 8th — a day I internally marked as "Critical Recognition Day." Sometimes there are days in projects where fundamental assumptions turn out to be wrong. This was one of those days.

In the original architecture, I had planned parallel handover patterns. The idea was elegant and convincing on paper: The Team Lead delegates multiple tasks simultaneously to different specialists working in parallel. Backend Developer implements the API while Frontend Developer builds the UI simultaneously. QA Engineer writes tests concurrently. In theory, this should drastically reduce throughput time — after all, real teams work in parallel too.

The problem was fundamental and unavoidable: Roo Code doesn't support parallel processing. The platform works sequentially — one mode after another, one conversation after another. My beautiful parallel workflows were useless. I was frustrated because I'd invested significant time designing parallel patterns. Boomerang tasks, coordinated checkpoints, merge strategies for parallel results — all for nothing.

The rebuild was extensive and touched the entire system. I had to remove parallel handover patterns from README and global instructions, switch QA automation strategy from parallel to sequential execution, replace parallel_tasks tracking with sequential_tasks, and remove all parallel handover templates along with their validation tests. The standardization settled on the coordinated_sequential pattern, where the Team Lead regains control after each completed specialist task.

What initially seemed like a setback turned into an improvement: Sequential workflows are more predictable. Feedback loops are clearer because only one mode is active at a time. Debugging improved dramatically — when something goes wrong, I know exactly which mode is responsible. No race conditions, no merge conflicts between parallel agents. The forced sequentialization paradoxically made the system more robust and easier to understand.

Fighting Over-Engineering: When AI Does Too Much

A pattern that persisted throughout development was the system's tendency toward over-engineering. This problem is tricky because it stems from a positive trait: AI agents tend to do more than asked — and that's not always a good thing.

Concretely, this is what it looked like: The system would take a simple bug-fix request and also "improve" surrounding code, add tests nobody asked for, and expand documentation. A simple "Fix the null pointer on line 42" became a comprehensive refactoring session. Modes added features that were "nice to have" but nobody had requested. "You could add error handling here" — sure I could, but I just wanted to fix the bug. A task like "Fix the login bug" turned into "Fix the login bug and refactor the entire auth layer and add logging and document everything."

The problem isn't that the improvements are bad. They're often sensible. But they cost time, create unexpected changes that need to be reviewed, and distract from the actual goal. When I want to fix a bug, I want to fix a bug — not revolutionize the system.

The solution consisted of three interconnected policies that reinforce each other.

Minimal Code Changes Policy

An explicit "Surgical Precision" guideline that only allows changes necessary for the actual task, demands strict prevention of scope creep, preserves the existing working system, enforces test continuity, and requires justification for every change. The justification requirement was particularly effective — when an agent has to explain why it's making a change, it thinks twice.

Feature Addition Control

Before implementing a new feature, a mode must explicitly ask via ask_followup_question: "Should I also implement X?" This sounds like overhead that interrupts the workflow. But it prevents hours of undoing unwanted features. In practice, most "spontaneous" feature ideas from agents aren't needed — and the few that are sensible are actually implemented better through the explicit question because the context is clear.

Scope Expansion Control

A 5-step scope expansion process with mandatory user approval. Scope expansions for features, optimizations, and enhancements must be explicitly approved. Quality gates were extended with scope validation requirements, and scope boundaries are now documented in the handover context. This makes transparent what belongs to the task and what doesn't.

The overarching learning here is remarkable: AI agents have a built-in perfectionism bias. They don't just want to solve the problem — they want to deliver the best possible result. That sounds good but leads to "goldplating" — coating everything in unnecessary gold. The policies had to explicitly limit this bias. It's interesting and somewhat ironic that you have to teach AI to do less. The natural tendency is maximization, not optimization.

Mode Drift and Orchestrator Problems: When Modes Get Confused

Sometimes the wrong mode was activated or stayed active when the task required a different specialist. Backend Developer tried solving UI problems — producing questionable CSS. QA Engineer started writing code instead of testing. Documentation Writer attempted to fix bugs. This phenomenon called "Mode Drift" occurs when task boundaries aren't clear or when a mode "drifts" into an area that isn't its domain.

The problem is subtle: You often notice drift only when results look strange. Why did the Backend Developer suddenly write React components? Because the task was called "API endpoint with frontend integration" and he interpreted that integration was also his responsibility. These ambiguities had to be closed.

My first implementation was typically over-engineered — an 8-step process with drift detection algorithm and context similarity scoring, a decision matrix for mode correction vs. delegation, circuit breaker pattern with attempt limits and cooldown, and graceful fallback strategies for edge cases. On the whiteboard, it looked impressive. In practice, it was far too complicated, hard to debug, and created new problems of its own.

The simplified solution reduced the 8 steps to a 3-case decision matrix that anyone can understand:

Wrong mode active but right category (e.g., wrong Developer instead of right Developer) — correct internally. The code mode recognizes it's actually doing backend tasks and calls the Backend Developer.
Wrong mode active and wrong category (e.g., QA instead of Developer) — delegate to Team Lead. That's a bigger context switch requiring coordination.
If uncertain — ask the user. Better to ask once too often than take a wrong turn.

The learning was clear and transferable to many engineering decisions: Simplicity beats complexity. The 8-step process was "academically interesting" but practically unmaintainable. The 3-case matrix is understandable, debuggable, and works better. It solves 95% of cases, and the remaining 5% are caught by the "ask the user" option.

The Orchestrator Return Problem

A related problem I couldn't fully solve concerns returning to the original orchestrator. The problem occurred mid-task: When Roo Code encountered a problem that didn't match the current mode, it started a new orchestrator instead of returning the problem to the original one.

Concretely: Backend Developer hits a security problem during work. Instead of returning to Team Lead (who could call Security Engineer), the system starts a completely new Team Lead. All previous context is lost — the new Team Lead doesn't know the original task, completed steps, or decisions made.

In practice, it looked like this: Team Lead A starts and delegates to Backend Developer. Backend Developer works, encounters a problem outside his expertise. Instead of going back to Team Lead A, a new Team Lead B is started. Context and knowledge from Team Lead A are lost. Team Lead B has to laboriously reconstruct the context — if it even realizes it's not the first. This led to inconsistencies, duplicates, and lost work.

The implemented solution was "Team Lead Exclusivity" — only one Team Lead active at a time, loop protection with a maximum of 3 delegation levels. It helped limit the impact but was only a workaround. A clean architectural solution would have required deep changes to the mode-switching logic — more effort than the expected benefit justified.

Context Overflow and Template Evolution: The Platform's Limits

15 modes sound like great flexibility and specialization. In practice, that means 15 whenToUse sections that Roo Code automatically writes into the system context. Each section describes when the respective mode should be used, its capabilities, and restrictions. Even the 8 standard modes of the "Standard Team" are already heavyweight in context.

The result is a classic trade-off problem: Available context for the actual task shrinks dramatically. LLMs have a maximum context length, and when a large portion is used for mode descriptions, less remains for code, task descriptions, and conversation history. With complex projects and long code files, context becomes the bottleneck. The agent can no longer "see" all relevant files simultaneously.

I tried various approaches with mixed success. The whenToUse sections were shortened and optimized — every unnecessary word removed. Unused modes were temporarily disabled — but that required manual management. Context compression prompts were adjusted to prioritize important information. But Roo Code's fundamental architecture requires all mode descriptions to be available in system context. There's a feature request for selectively enabling/disabling modes, but until that's implemented, the problem persists.

This limitation causes the framework to hit its limits with larger projects. For smaller, focused tasks, it works well — but once multiple long files are simultaneously relevant, context becomes the bottleneck. The framework itself isn't bad, but the combination of many modes and complex projects requires careful context management.

Template Standardization

Parallel to these challenges, the handover system evolved positively. Originally, some modes had custom handover templates. The Documentation Writer had a documentation-focused template with fields for target audience and documentation type. The Git Specialist had a commit-focused template. The QA Engineer had a test-focused template with fields for test coverage and test strategies. The Team Lead had a coordination template with project overview.

The problem became visible during mode transfers: Information was lost because templates were incompatible. The Documentation Writer handed over to the Git Specialist, but the documentation template had no fields for commit information. Information had to be "translated," and context was lost in the process.

The solution was radical but effective: All custom templates were removed. Every mode now uses the same standardized template with sections for Task, Mode, Context, Files, Expected Outcomes, Success Criteria, and Constraints. No special cases, no exceptions. The template is lean enough not to consume too much context but structured enough to transfer all necessary information. Standardization also enabled automatic validation — a handover missing required fields is rejected.

Task Chain Tracking

Another learning was the necessity of task chain tracking. Without visibility into workflow progression, you lose track during complex tasks. Who did what? Where are we in the process? The solution was systematic tracking in the handover template — a structured table documenting the path from user request through all mode handovers. This enables debugging and traceability even after long chains of handovers.

100% Template Compliance

The final escalation was targeting 100% template compliance: Template validation before every operation, automatic rejection on template violations, circuit breaker for repeated non-compliance, and compliance monitoring with measurable metrics. It sounds bureaucratic, and yes, it adds overhead. But without this rigor, the system quickly drifts into chaos. The metrics showed that after introducing strict compliance, the handover success rate increased significantly.

Small Learnings: Anecdotes from Development

Not all problems were architectural. Sometimes the small things cost surprising amounts of time.

The Naming Conflict

One day I noticed strange system behaviors. Tasks were being misassigned, and logs showed peculiar patterns. After some analysis, I found the reason — almost embarrassingly simple: Frontend Developer and Tech Educator were both named "Sophia." The system couldn't reliably distinguish which "Sophia" was meant, especially in contexts where the name was used without a role designation. The solution was equally simple: Frontend Developer became "Maya." The learning is transferable: Unique identities for all modes are critical. Names aren't just labels — they're part of the system architecture.

The Regex Problem

The QA Engineer initially had an overly restrictive fileRegex pattern that defines which files a mode can access. Test infrastructure files were blocked because they didn't match the pattern. The QA Engineer wanted to check jest.config.js — but wasn't allowed. Several iterations followed — from extension-based (too restrictive, blocked many things) through comprehensive patterns with excludes (worked better but complex) to the final version with test_ and spec_ prefixes and a more generous base rule. The learning: Regex patterns for AI agents must be more generous than for humans. Humans can say "I need this file too." Agents strictly follow defined boundaries — or fail.

The ToDo Framework

Without structured issue tracking, I eventually lost track of all the problems, open items, and improvement ideas. The list grew but without prioritization. The solution was a priority-based ToDo framework with four levels: P1 for System Integrity (critical system functionality — if this breaks, nothing works), P2 for Task Continuity (task continuity through escalation — if this breaks, tasks get stuck), P3 for Mode Boundaries and Quality Gates (expert domains and standards — if this is missing, quality drops), and P4 for Efficiency (performance and resource utilization — nice to have). This framework helped me keep focus on what matters and not drown in the flood of improvement ideas.

The Language Setting

A surprising problem was the language setting. The system sometimes answered in English while I was working in German. In the middle of a German conversation, an English response would suddenly appear. The reason was hardcoded German in the prompts — "Antworte auf Deutsch" — but Roo Code had its own language setting that was independent. The systems weren't talking to each other. The solution came with RooCode v3.9.1+: Language Forwarding Integration. The system now respects the Roo Code language setting and passes it to the modes. A small detail, but it significantly improved the user experience.

Conclusion: What I Learned

After several weeks of intensive development and iteration, I draw a mixed but overall positive assessment. The many refactoring commits — a good third of all changes — weren't signs of poor planning but of iterative learning. Every refactoring was a lesson.

What worked: Specialized modes with clear responsibilities reduce mental load and improve quality. Structured handovers with unified templates ensure no information is lost. Quality gates as checkpoint systems prevent errors from being passed on. Explicit policies against over-engineering force AI restraint and make behavior predictable.

What remains challenging: Context overflow from too many modes is a platform limitation I can't solve — it requires changes to Roo Code itself. The orchestrator return problem remains unfinished — sometimes you have to accept "good enough." Roo Code's sequential limitation forced a complete architecture rebuild — which actually turned out to be an improvement.

What's still open: The tendency toward over-engineering is contained by policies but not eliminated — the system keeps testing boundaries. Several implemented features like Mode Drift Correction and Template Compliance haven't been systematically tested under production load yet. Context compression sometimes loses important information like the original task or tool specifications. And the question of whether a code mode should be allowed to modify test files remains unresolved. These open points aren't bugs — they're areas for further iteration.

My overarching learning: AI agent systems aren't "fire and forget" solutions. They require continuous iteration, clear boundaries, and the courage to simplify. The many refactoring commits aren't waste — they're the path to a system that actually works. Or as a colleague aptly put it: "The first system you build to learn. The second you build to use." How the Roo Code journey finally ended and why it became a new beginning, I describe in the fourth and final part of this series.

Roo Code Series: - Part 1: From Chaotic AI Tools to a Structured Development Team — The origin story - Part 2: When 15 Modes Become an Orchestrated Symphony — The technical deep dive - Part 4: End and New Beginning — Why the Roo Code journey ended and what followed

This article was originally published on Medium.