Loading
Loading
We gave the same sprint scope to two teams: one running AI Code Factory, one running traditional. Same deadline, same codebase. We measured everything. Here are the results.
Author
Tom Bergström
Published
21 May 2026
Reading time
8 min read
Topics
nordic-tech, architecture, scaling
We had a rare opportunity: two teams, same client codebase, same two-week sprint scope, same deadline. One team ran our AI Code Factory methodology — structured agents, SKILL.md files, guardrails, hooks. The other team ran the way most engineering teams run: good developers, GitHub, and a standup. We measured everything we could measure. The AI Code Factory team of four matched and in two of three sprints exceeded the historical output of the 11-person team. Here's the data.
The context: a client with a 14-person product engineering team needed to scale. Rather than simply adding headcount, they agreed to run a structured comparison. We split a defined scope — three features and a data migration — across two tracks. Track A: a four-person AI Code Factory team using our full methodology. Track B: their existing team running normally, staffed at approximately the same seniority level.
The same product spec was given to both teams on day one. Neither team knew the other's sprint velocity — we kept the build logs separate. The sprint was two weeks. We measured: stories completed, test coverage on new code, review cycles per PR, post-merge defects in the first week after sprint, and total developer hours logged. Both teams had access to the same codebase and the same development environment. The only difference was methodology.
This wasn't a laboratory experiment with perfectly controlled variables. It was a real sprint on a real codebase, which is exactly what makes the data meaningful. Real conditions include distractions, unclear requirements, and integration surprises. The AI Code Factory needs to work in those conditions — not ideal ones.
We ran three consecutive two-week sprints with the same structure. The results were consistent enough that by sprint two, the pattern was clear. The AI Code Factory team of four produced equivalent story point output to the traditional team of 11 in sprint one. In sprints two and three, the AI team exceeded the traditional team's output while the traditional team's velocity stayed flat. The AI team's velocity increased as the SKILL.md library expanded with each sprint's learnings.
| Metric | Traditional Team (11 engineers) | AI Code Factory (4 engineers) |
|---|---|---|
| Stories completed (Sprint 1) | 18 | 17 |
| Stories completed (Sprint 2) | 19 | 22 |
| Stories completed (Sprint 3) | 18 | 24 |
| Avg. review cycles per PR | 3.2 | 1.4 |
| Post-merge defects (7 days) | 11 | 3 |
| Test coverage (new code) | 61% | 92% |
| Lint errors on merge | Avg. 4.3 | 0 (blocked by hooks) |
| Documentation generated | Manual, inconsistent | Automated on every PR |
The review cycle difference was the most immediately visible. The traditional team's PRs averaged 3.2 back-and-forth cycles before approval. The AI Code Factory team averaged 1.4 — because the PR review agent caught issues before human review began. Reviewers on the AI team were reviewing code that had already been automatically checked for standards compliance, type safety, and coverage thresholds. Their cognitive load was lower. Their feedback was more focused on logic and architecture rather than style and coverage gaps.
Sprint one was roughly even. Sprint three saw the AI team 33% ahead. The compounding mechanism is the SKILL.md library. After each sprint, we updated the domain skill files to reflect what we learned: new coding patterns that worked well in this codebase, common pitfalls in the client's data layer, integration patterns for their specific API structure. Those learnings are codified and reloaded into agents at the start of the next sprint. The agents get smarter about this specific codebase with each iteration.
The traditional team's velocity didn't increase because their learnings were in people's heads — not systematically captured and reloadable. When a team member was out for a day, that knowledge went with them. When a PR review comment addressed a pattern issue, it was addressed in that PR but not necessarily applied to the next developer who made the same pattern choice. Knowledge in the AI Code Factory accumulates in the library. Knowledge in a traditional team accumulates in individuals.
ALWAYS join on global_cust_id, not email or company_name. Email changes on account updates. Company names diverge across systems. The canonical join field is: customers.global_cust_id
Attribution fields (acquisition_channel, referral_source) are nullable. Never assume a value exists. Always coalesce with 'direct' as fallback: COALESCE(c.acquisition_channel, 'direct') AS channel
"The first sprint is when the AI team is learning the codebase. The second sprint is where the compounding starts. By sprint four, you typically see a performance gap that doesn't close — because the skill library keeps accumulating context the traditional team doesn't have." — Tom Bergström, COO, Indpro AB
Our stated 2.8× delivery speed improvement is a floor measured across multiple engagements — not a ceiling and not this specific experiment's peak. In this parallel sprint, the AI team of four matched 11 engineers in sprint one (roughly 2.75× per-engineer output) and exceeded them by sprint three. The 2.8× figure is conservative because it's measured across all engagements including early-stage ones where the skill library is thin and the team is still calibrating.
The caveat worth stating: the traditional team in this experiment was running a typical client codebase without structured guardrails. If your traditional team already has strong automated testing, enforced code quality standards, and a well-documented codebase, the gap will be smaller. The AI Code Factory's advantage is largest relative to a baseline that is common in the industry — ad hoc Copilot usage, manual review, inconsistent coverage.
Want to see these numbers applied to your specific team and codebase? We run a free 2-week diagnostic sprint before any engagement.
Book a Diagnostic Sprint4 Devs, Output of 11 →Story points and coverage metrics are measurable. Some meaningful differences between the teams were harder to quantify. The traditional team had deeper institutional knowledge of the client's business domain — they'd been working in the codebase for two years. In the first sprint, this showed in lower-level design choices: the AI team occasionally proposed technically correct solutions that were at odds with product conventions the traditional team knew intuitively. By sprint two, those patterns were in the SKILL.md files. By sprint three, they weren't a factor.
The other dimension not captured: the AI Code Factory team required a more structured process around knowledge handoff. At the end of each sprint, updating the SKILL.md files was a real task — typically 90 minutes for the lead engineer. That's overhead the traditional team doesn't carry. It's worth it, but it's not zero cost.
The takeaway from this experiment: The AI Code Factory doesn't replace developer judgment. It systematises what good developers know and makes it compounding rather than siloed. The velocity gap is real. So is the setup cost in the first sprint.
Was the comparison fair — wasn't the traditional team disadvantaged by size?
The traditional team had 11 engineers; the AI team had four. The comparison isn't "same size team, different tools" — it's "what can a structured AI team achieve relative to a traditional team running at normal scale." If anything, the larger traditional team had an advantage in raw capacity for parallel work streams. The AI team's advantage was per-engineer output and quality metrics.
What codebase was this run on?
We're not disclosing the client. The codebase was a TypeScript/Next.js frontend with a Node.js API layer and a PostgreSQL database — a very common modern SaaS stack. The data migration component involved dbt and Snowflake. The results are representative of this stack; different stacks may vary.
How long does it take an AI Code Factory team to reach full velocity?
Based on this experiment and others, we see full velocity — where the SKILL.md library has sufficient codebase knowledge to guide agents accurately — by sprint two or three (4–6 weeks). The first sprint is productive but runs at roughly equivalent-to-traditional output. The compounding starts in sprint two.

CTO & Co-Founder
Tom leads Indpro's technology strategy and engineering standards. With 20+ years of experience building and leading engineering teams across the Nordic region, he ensures every engagement delivers at the highest technical level.
Connect on LinkedIn →Automated PR review AI: our agent found 23 bugs in week one while human reviewers caught 4. Full breakdown by bug type, false positives, and what this means for your review process.
10 pages of practical insight on operating models, compensation benchmarks, and a hiring playbook. Free PDF.
Download the Free GuideOr reach us directly: sales@indpro.se · +46 73 932 21 38