The other day, I dropped a 2,000 line MR on a colleague. Our guidelines say we should be aiming at 10-20% of that size, but in the process of creating it, it hadn't _felt_ like writing 2,000 lines. It was the culmination of a fair bit of planning and prompting, and the code itself came together in just a few hours. I felt a little silly, but I'm certainly not the first one to run into this problem: we can generate code a lot more quickly now. Because we have seen such a tremendous speedup in generating new/revising existing code via LLMs, it is extremely attractive to have LLMs handle verifying that the code works, too. Why torment my colleagues with analyzing all of that for subtle bugs when a machine could do it. Sadly, while I think they can be a valuable aide, I do not think LLMs alone can validate software well enough to remove humans from QA. Our practices will have to evolve more drastically than just "hand that step to the machine." ## Why would you think this works? Thinking about humans, we know a couple things about what makes code review work: - Humans who are good at programming are generally good at code review because, for humans "code comprehension is fundamental to code review" ("[Code Review Comprehension: Reviewing Strategies Seen Through Code Comprehension Theories](https://arxiv.org/html/2503.21455v1)"). For humans, it's the same sort of reasoning, and so performance in one area improves performance in another. - Experience shows that even less experienced engineers can provide useful review and verification, because the reviewer still brings a mental model of how the code operates AND brings a different set of experiences and perspective to the task. - Throwing more reviewers at code has a similar effect to throwing more experienced reviewers at it ([Kononenko et al. (2015)](https://plg.uwaterloo.ca/~migod/papers/2015/icsme15-OleksiiOlgaLatifa.pdf)), which feels like "well I'll use several models and prompts!" Anyway, humans aren't that good at review to begin with: in Kononenko et al., 54% of the merge requests made it through with bugs even after review. So, if we see LLMs as "less-good" human developers (and their performance in Zone 1 makes this an attractive way to look at LLMs), one could reason that throwing several of them at a problem should work pretty well. But more humans help because each one brings genuinely different experiences and perspectives; **multiple LLMs share the same training biases** and will tend to make the same mistakes and the same judgments. Unfortunately, this intuition that "more LLMs = more reviewers" is wrong both theoretically and empirically. Using different models (say, having Claude generate and ChatGPT review) doesn’t solve this either; [Kim, Garg & Peng (ICML 2025)](https://arxiv.org/pdf/2506.07962) showed that across 350+ LLMs, error correlation increases with model accuracy, even across different providers and architectures — more capable models converge on the same blind spots. ## Theoretical understanding Recapping the insights from [[Wayfinding the jagged frontier]], we can think of LLM inference as generating responses to prompts that encode what someone from a relevant community would plausibly say when presented with the prompt, constrained by the training material available to the LLM. That constraint is pretty important: it means that LLMs tend to "know" a lot generically about domains like programming and QA, but don't have much (or any) training about domain-specific problems and have much thinner materials to draw on for complex, multi-part problems. Building from the above, we'd expect that when throwing LLMs at code to review and QA, we might find: 1. Test generation and review judgment will both be driven by generic training material, making the model inconsistently sensitive to domain specifics. 2. Coding competence does not transfer to review competence in LLMs, because different tasks activate different regions of the training topography. And if both of those hold, a compound risk emerges: a QA→repair loop may amplify domain insensitivity, with the QA agent flagging essential behavior as bugs and the repair agent unable to distinguish essential from accidental. That second hypothesis is subtle but important: LLMs do not form mental models of code like we do. They generate responses to prompts based on their training material. So a model that's competent at code generation won't necessarily be competent at review or verification, because those are different tasks that are discussed differently in the training material. The communities of practice are adjacent, but if you think about discussions of code review, they are dominated by examples of finding defects and generic techniques for finding defects. This is different from the corpus of material on programming, which is dominated by solutions to specific problems and techniques for deriving solutions. ## What the research shows Once I had that theoretical posture that matched my past experience, I figured there must be a body of research around this --- it's too popular an idea for academics to leave to my intuitions. So, I ran a couple research prompts with Claude (Opus 4.6, a mix of normal and deep research mode) to find relevant, recent papers that used frontier models (a nice zone-1 task of "write a lot of search queries and summarize the results"), and then read the most interesting looking ones myself. **Spoiler alert:** The research is honestly worse than I expected from my theoretical stance. There's pretty direct evidence of both hypotheses --- that test generation and review are driven by generic training material, and that coding competence doesn't transfer to review competence. I couldn't find any studies that considered the agentic "identify and repair" loop directly, but given the evidence for the first two, the compound risk is quite well supported. For a sampling of what I mean, here are some of the most interesting/relevant papers I came across. ### "Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement" ([Jin & Chen, 2026](https://arxiv.org/abs/2603.00539)) Jin & Chen took _correct code_ --- algorithm implementations with clear specifications and clear, correct solutions --- added bugs, and then had LLMs review both versions. The LLMs did pretty well identifying buggy versions of the code, with low (under 10%) false negative rates of letting the buggy versions through. BUT, they also do terribly at false positives; even the best-performing model finds (made-up) "bugs" in 32-40% of the correct samples if you give them a detailed review prompt, and the worst hit 73-88%. This feels really weird, because this is zone 1 code that I'd expected LLMs to be great at generating. But **they aren't generating code; they're generating code reviews** based on the code-review training material. That material is full of examples of finding bugs, so, well, they "find" bugs. This behavior also increases your risk of adding an LLM-review step that _introduces_ novel defects that didn't exist in the original code. If review makes up bugs and a remediation agent is supposed to fix them, and they get to loop, you could end up with some very weird results that go beyond amplifying domain insensitivity and land on "random, hard to understand bugs unrelated to any known reality." ### "TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation" ([Liu et al., 2026](https://arxiv.org/pdf/2602.10471)) This paper tests something very close to what you'd actually want: given repo-level context (documentation, function signatures, dependency graphs, and in the whitebox arm, full source code), can a model write tests that find bugs? The models do _terribly_: on the blackbox arm, they find under 8% of the bugs; in the whitebox arm, they can't break 12%. Even in an agentic setting where models can freely explore the repository, the best result (Claude Sonnet 4.5 with SWEAgent) only reaches about 16% --- roughly double, but still finding fewer than one in six bugs. The tested models are very bad at finding bugs by writing tests, which is a more severe case of my first hypothesis. One interesting positive result here is the LLMs often did figure out which codepaths they need to exercise; they then just wrote bad assertions that didn't actually test the thing they needed to test. ### "Evaluating LLM-Based Test Generation Under Software Evolution" ([Haroon, Khan & Gulzar, 2026](https://arxiv.org/abs/2603.23443)) Haroon et al. took code with adequate test coverage and made two types of changes: semantics-preserving changes (e.g., rename a variable, add a no-op `else` block) and semantic-altering changes (e.g., an algorithm that processes a list now processes all but the last item). In both cases, as the code base evolves, test coverage and quality falls. What's weird is that in semantically-altering changes, the models wrote tests that passed under the _previous_ behavior, which the authors note "indicating residual alignment with the original behavior rather than adaptation to updated semantics." When they made semantically-preserving mutations, the LLMs tended to generate many new tests and toss old ones that are still valid. The authors think this is because the semantics-preserving changes tend to be _bigger_ (more lines changed), which then makes the model respond with bigger changes to the test corpus, "suggesting sensitivity to lexical changes rather than true semantic impact." Both behaviors support the hypotheses around domain-sensitivity; **the models respond to structural features and slide into their training material** as the functions mutate to handle new scenarios. ### "Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?" ([Tufano et al. (ICSE 2025)](https://arxiv.org/abs/2411.11401)) Tufano et al. studied 29 human reviewers over 50 hours of code reviews, augmenting some with LLM-generated reviews. The abstract puts it starkly: > We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior: Reviewers tend to focus on the code locations indicated by the LLM rather than searching for additional issues in other parts of the code. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process. Finally, the automated support did not result in saved time and did not increase the reviewers' confidence. The LLM's findings skewed heavily toward low-severity issues --- it could occasionally catch real logic bugs, but its output was dominated by style and maintainability concerns. The bummer here is that this then **anchors the human reviewers to the same distribution**, spending their time on the LLM's low-severity findings while missing high-severity ones they might have found on their own. This particular study used only ChatGPT Plus (GPT-4); I'd love to see if a more sophisticated model at least shifts the severity distribution. The human angle though is worth remembering. ### "Are 'Solved Issues' in SWE-bench Really Solved Correctly? An Empirical Study" ([Wang, Pradel & Liu, ICSE 2026](https://arxiv.org/pdf/2503.15223)) Wang et al. looked at SWE-bench results and found that nearly 8% of accepted patches don't actually solve the problem and nearly 30% of "plausible patches induce different behavior than the ground truth patches. These behavioral differences are often due to similar, but divergent implementations (46.8%) and due to generated patches that adapt more behavior than the ground truth patches (27.3%). Our manual inspection shows that 28.6% of behaviorally divergent patches are certainly incorrect." This gets at my concerns about the correction-loop of a fully-agentic verification pipeline --- just as agents are great at either ignoring failing tests or replacing good tests with `return true`, they also adjust code unnecessarily when fixing identified bugs. Combining that with the way they identify spurious/fake defects in working code, you run a real risk of **the review phase _lowering_ the quality of code**. ## Can I just solve this with more context? No, for three reasons. Foremost, if context could solve the problem, you probably wouldn't even need a separate review/QA step. I cannot imagine any additional context that you'd give to the QA/review step that the implementing step doesn't have. The implementing step is also playing to the model's strengths (implementing), has a human to guide it and ask questions, and should be doing some sort of self-review as it goes. The review step is at a series of structural disadvantages, since as posited it should be fully autonomous and it doesn't have the detailed planning that the implementer worked out with the human. Secondly, when we see the model inventing errors in code that doesn't have any (as it did in [Jin & Chen, 2026](https://arxiv.org/abs/2603.00539)), it's really unclear what additional context the model could need to avoid that. Those models had lots of useful context, too --- they got a description of what the function was supposed to do, and where to look at the code. The problem is, the model is being pulled into what it's trained to do, and sometimes that training will overcome everything else in the context window. Similarly, with a clear example of the semantically mutated code, the model still wants to test the versions of the functions from the training material. Third, at least on the codebases I work on, it's almost impossible to tell _which_ specific domain constraints are important to any given feature _ex ante_. This leaves the model needing to surf your entire company context, and, using what it knows about the task, surface places where you've written up the key domain constraints that apply to all aspects of your system. Once it's found all that data, it needs to reason abductively about which constraints might have been violated, and then test the code to see whether it conforms. It takes months to onboard humans to be able to do that, and there's no evidence that models can perform domain-grounded abductive reasoning from knowledge they haven't been trained on. In all these cases, you're fighting gravity. The evaluative context of review and QA have lots of examples of surface level bugs, of finding bugs frequently, etc, and no examples of the kinds of bugs you typically introduce in your domain. Additionally, models infer from patterns in their training data. Your company's priorities aren't in there, and we have no evidence they can perform the domain-grounded abductive reasoning that you'd need to generalize from your company's knowledge base to this ticket. ## That's depressing. Does anything actually work? The above leads me to think pure-LLM review and verification is not possible in systems that matter. In systems that don't matter, well, don't waste the money. I would frame the experimental results as **"there is no zone 1 in code review and QA."** All code review and QA appears to need a human to help figure out _what_ to test and to create reliable oracles that tell you when something is actually broken. The advantage to using LLMs is that, once you have determined what to test, the LLM can help _execute_ the test. Specifically, it can - help write tooling for exercising the code (based on a human's judgment of what to exercise) manually AND solve environment issues that make those tools hard to use normally, - write linters/SAST tools and conformance tools that would have taken far too long in the past, - write tests to check for erroneous behavior, assuming a human can confirm the tests themselves are doing what they claim I think there may also be some pretty-specific tooling you could build around RAG-assisted vulnerability analysis (like in [Vul-RAG](https://arxiv.org/abs/2406.11147)) tailored for the kind of software you write to create essentially better SAST tools, but that's not generalized review and QA. Using models for this task feels similar to Zone 3 work, honestly --- you're not depending on the model for much of the actual planning, but it can build you tools and do experiments that would take too long on your own. So while models may not be terribly good at code review or validation, they may make humans great at it, with a little creativity.