Loading
Loading
Automated PR review AI: our agent found 23 bugs in week one while human reviewers caught 4. Full breakdown by bug type, false positives, and what this means for your review process.
Author
Tom Bergström
Published
21 May 2026
Reading time
7 min read
Topics
nordic-tech, architecture, scaling
In its first week on a production codebase, our AI Code Factory PR review agent flagged 23 issues across 31 pull requests. In the same week, human reviewers independently caught 4 issues — all of which the agent had already flagged. The overlap was not a coincidence. It confirmed that the agent was catching the same things experienced reviewers catch, plus 19 additional issues the reviewers didn't comment on.
This is the breakdown. What the 23 issues were, how we categorized them, how many were false positives, and what happened to the 4 that both the agent and humans caught. If you're evaluating automated code review, this is a real data set from a real production week.
Agent findings
Human findings
False positives
PRs reviewed
Not all 23 findings were equal in severity. We categorized them into four buckets: critical (would cause production failures), high (incorrect behavior in edge cases), medium (security or performance risks), and low (code quality, maintainability). The distribution tells you what automated review is best at — and where it still needs human backup.
The 3 critical findings were the most important. An unhandled Promise rejection in a payment confirmation handler that would cause a silent failure with no user feedback. A missing null check in the subscription cancellation flow. And a race condition in the workspace creation endpoint that would occasionally create duplicate records. All three were fixed before the PRs merged.
The 7 high-severity findings included the most interesting category: auth bypass risk. The agent flagged a route where a workspace admin check was present but could be bypassed by manipulating the request body. It was subtle — the kind of issue that passes review when a human is moving quickly through a large PR. The agent caught it because it checks every route against the authorization pattern defined in the skill file, regardless of PR size.
The PR review agent is one component of the AI Code Factory. See how all four pillars work together.
3 of the 23 findings were false positives — the agent flagged something that looked like a bug but was intentional code. False positive rate: 13%. That's within an acceptable range for a first week of deployment, but it tells you something about where skill file calibration matters most.
All three false positives came from the same category: the agent flagged the absence of a Zod validation schema on a route that intentionally accepts unstructured data (a webhook receiver). The skill file said "all routes require Zod validation" — which is correct for 99% of routes. The webhook route was the exception, and the skill file didn't encode exceptions.
Fix: we added an annotation to the webhook route file
(// @skip-validation-check — webhook receiver, validated by signature)
and updated the agent configuration to respect it. False positives in
week 2 for the same category: zero.
The agent missed nothing that week — the 4 human findings were all issues the agent had already flagged. But this is a single week of data, not a generalized claim. There are categories of issue that automated review handles poorly: UX edge cases, product logic correctness, architectural decisions, and anything requiring business domain knowledge.
"Automated PR review replaces the mechanical part of code review — pattern matching, standard violations, known anti-patterns. What it doesn't replace is engineering judgment about whether the code does the right thing for the product. Those are different questions." — Aash, Engineering Lead, Indpro AB
In week 2, a human reviewer caught an issue the agent couldn't have found: an endpoint that worked correctly by all technical measures but produced a response structure inconsistent with the client's mobile app expectations. The agent had no way to know that. That's the right division of responsibility — agent handles the technical layer, humans handle the product layer.
We deployed the PR review agent in advisory mode for the first three weeks — it flagged issues as comments but didn't block merges. This gave the team time to calibrate their response, tune the false positive rate, and build confidence in the output quality. After week 3, we flipped it to blocking mode for critical and high severity findings.
The transition was smooth because we'd spent three weeks on calibration. Teams that flip to blocking mode immediately tend to get pushback — not because the agent is wrong, but because the team hasn't had time to build trust in its judgment. Advisory first, blocking second. The transition period is where you tune the skill file exceptions and teach the team that the agent is catching real problems.
Before the agent: average 2.1 human review cycles per PR on this project. After the agent (blocking mode, month 2): 1.4 cycles. The agent pre-resolves the issues that were driving the extra cycles — the straightforward catches that reviewers would have flagged anyway. By the time a PR reaches a human reviewer, the low-hanging-fruit issues are already gone. Human review time focuses on higher-order concerns.
Interested in deploying a PR review agent on your codebase? Let's map out the setup.
Q: How does the AI Code Factory PR review agent differ from standard GitHub Copilot review features?
The core GitHub Copilot review features provide general suggestions. The AI Code Factory's PR review agent is configured with project-specific SKILL.md files — it checks against your codebase's specific patterns, guardrails, and architectural decisions. It knows your authorization model, your data access patterns, your validation requirements. That specificity is what makes the false positive rate manageable and the catch rate high.
Q: What's an acceptable false positive rate for automated PR review?
In our experience, below 15% in advisory mode is workable. Below 5% is the target for blocking mode. The difference is calibration — tuning the skill file exceptions for your codebase's intentional deviations from standard patterns. We typically reach below 5% false positives by the end of a 6-week implementation sprint.
Q: Can the PR review agent check for business logic correctness?
No. The agent checks structural correctness — patterns, security, performance anti-patterns, type safety. Business logic correctness requires domain knowledge that can't be encoded in a skill file. That's intentional: automated review handles the mechanical layer, human review handles the product layer. Trying to encode business logic in a skill file produces high false positives and low developer trust.

CTO & Co-Founder
Tom leads Indpro's technology strategy and engineering standards. With 20+ years of experience building and leading engineering teams across the Nordic region, he ensures every engagement delivers at the highest technical level.
Connect on LinkedIn →10 pages of practical insight on operating models, compensation benchmarks, and a hiring playbook. Free PDF.
Download the Free GuideOr reach us directly: sales@indpro.se · +46 73 932 21 38