Wayfinding the jagged frontier

A day of work for an LLM-assisted software engineer can feel quite chaotic emotionally. It's not uncommon to have all of the following thoughts in the same workday: - I just did half-a-day of work in 45 minutes. This is incredible. - I just did half-a-day of work in 45 minutes. I am going to lose my job soon. - The approach the LLM suggested looks really good, but something feels off. I can't put my finger on it though, so I should go with it. Oh no. - The LLM just came up with a really clever implementation of what would have been a tricky function for me. - I asked the LLM to apply a clear pattern (and even just copy code) from one module to a new one. It ... did its own thing entirely. I need to `git checkout --` all of this and start over. - I have no idea whether this would have been faster to do by hand. I have encountered plenty of talk of the [jagged frontier of LLM capabilities](https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the-jagged), and while that certainly does capture the feeling and reduce loneliness, I have seen much less about how to wayfind your way along that frontier (other than by trial and error) and understand where the machine can be more and less helpful. My goal here is to begin to build those intellectual tools. While I suspect these intellectual tools apply across knowledge work fields, I will be focusing on the domain I know and love best: software engineering. I should also note that what's here is an overview; almost every assertion in this essay gets way more complicated as you dig in, and I'll be adding hyperlinks for doing so as I do the writing. I also don't cover the ethical issues LLM use raises; that will require entirely different essays, later. ## What is going on in there? First, the materially obvious: LLMs are trained on huge amounts of language that was mostly written by humans for other humans. This language is preprocessed into tokens, and then through training across billions of parameters organized in layers of attention and transformation, the model learns contextual numerical representations — where the representation of any token depends on every other token surrounding it. The interesting question is, what on earth do those representations encode? I don't think there's a single answer to that question, because I think how you answer it depends on the kinds of questions you want to answer next. For me, right now, I want to answer practical questions about LLM capabilities. So, one useful answer is those representations encode, as [Dan Silver](https://thesilverlining3.substack.com/p/kicking-away-the-ladder-of-reference) claims, "what someone, under comparable discursive conditions, would plausibly say. What looks like a purely statistical operation over tokens is also a reconstruction of social judgements." Moving a little outside of Silver's framing, the model has learned not just what people say, but what communities recognize as worth saying. Looking at that concretely for programming, the model has encoded what counts as “good code to solve problems” from the written record of developers discussing, reviewing, and sharing code. It’s predictably good where that record is dense and predictably misleading where it's sparse or tacit. That these discussions happened in the context of social communities, with their own standards and conventions, further complicates what's actually encoded; programmers certainly don't generally agree what is "good code," and so the model has picked up a lot of community-specific judgments that nuance exactly what it will generate when asked to generate "good code to solve some problem, in my context." When I prompt an LLM, it generates the response that would best survive scrutiny from a relevant community of practice, since that community’s standards are recoverable from the textual record. We can think of the matrix of relations of tokens as a landscape that has been generated from the training data: The model compresses this record into a high-dimensional landscape where the topography encodes not just what was said but what counted as competent, valid, or appropriate to say in context. A prompt situates the model in this landscape. Generation traces a path through it — not retrieving stored answers, but following the grooves worn by, say, millions of developers arguing about what's good code and what isn't. ## A frontier without a map Working out the implications of the above does not offer us a map, but it does offer a sense of the landscape and tools for navigating the frontier, even if it shifts under us with each new model release. For me, a given generation will work from one of three zones of model competence, each requiring its own approach. ### Zone 1, or embarrassingly-solved problems Some problems are extremely well represented in the training data and have well-understood solutions. Zach Pearson calls them "[embarrassingly solved problems"](https://zjpea.substack.com/p/embarrassingly-solved-problems) and I kind of love that. That "I just did half-a-day of work in 45 minutes," almost certainly means you used the LLM to apply what it has learned of embarrassingly solved problems to some context where the solution fits really well. The model needed to use some technique that's appeared in a lot of tutorials, write some tests that everyone says you should write, maybe apply a pattern that's clearly well-represented in your own code. I think of things like, “I need to query a model and filter its results by some parameter from the requests (that map clearly to the model fields), then JSON-encode the results. I need tests to show my approach is correct.” The community validation signal is overwhelming: good implementations are upvoted, forked, and reused; bad ones are corrected or ignored. Using an LLM for an actual zone 1 problem is always going to feel like a win. This is also why some people love LLMs early in greenfield projects --- almost everything you do early is generic "best practice" stuff that the LLM can reproduce at the speed of token generation. The tricky part is _knowing_ that you're solving a zone 1 problem for sure without doing the intellectual work of actually solving the problem; as the [Programmer's Credo](https://x.com/Pinboard/status/761656824202276864) states, "we do these things not because they are easy, but because we thought they were going to be easy." ### Zone 2, the engineerable zone Zone 2 is where you're staring at something messy and realizing it's actually several things you know how to do. The work is decomposition — turning and transforming a problem until it breaks into pieces that each match a known pattern. The pieces are Zone 1; the assembly is Zone 2. The model can be genuinely helpful here, because developers write about _how_ to decompose problems all the time — refactoring strategies, pattern selection, case studies with code. The model has absorbed a lot of that. But your specific problem, in your codebase, with your domain's constraints, won't match anything in the training data directly. The model is working by analogy to well-documented processes, not by recognition. It has no way to understand your domain, but in zone 2 it doesn't have to; the problems in zone 2 are decomposable into pieces where domain understanding isn't the binding constraint. For example, I once refactored scattered authorization logic into a cleanly defined module. The extraction itself was straightforward — the model knows how to pull code into a module. But our authorization had business-specific quirks that mattered: odd edge cases that looked like bugs if you'd learned authorization from tutorials. The model kept "fixing" them — normalizing the logic toward what a well-written auth tutorial would look like. The Zone 2 work was keeping those quirks intact while restructuring around them, which meant I had to understand _why_ each quirk existed well enough to defend it against the model's instinct to clean it up. There's a subtler version of this same dynamic. Zone 2 work has two phases: _recognizing_ that a simpler structure is available, and _executing_ the restructuring. I needed to recognize that the refactoring was necessary and the structure of the cleanly-defined module. The mechanics of actually moving the code, applying a context manager where appropriate, etc, the model can do. But the first part — looking at three functions and seeing that they share a structure that could be collapsed — requires reading _your_ code through a lens the training data doesn't provide. The model has absorbed plenty of advice about refactoring _in general_, but almost nothing about the specific moment when _this_ code reveals a latent simplification. That recognition is nearly always tacit; developers do it in their heads and never write it down. We see this [empirically](https://arxiv.org/abs/2511.04427): LLMs are [net adders](https://www.gitclear.com/ai_assistant_code_quality_2025_research) of code. They do not find simplifying abstractions on their own. If you don't propose the shape, the model will propose _adding_ code instead. Zone 2 problems are problems that are local to a system — the relevant constraints are visible in a small enough area of the codebase that the model can actually hold what it's working with in context. That's a real boundary: once the binding constraints are spread across many modules/services, or live in domain knowledge that isn't coherent in the code, you've crossed into Zone 3. Zone 2 problems are ideally problems you can write acceptance or integration tests for. There's a short feedback loop from "possible solution" to verification, which means the model can iterate its way to success — propose, check, revise. Setting up that process largely determines your success in zone 2 (welcome to "I have no idea whether this would have been faster to do by hand.") ### Zone 3, the sparse danger zone Not all problems decompose well into zone 1 problems. Think about abstractions that coordinate multiple modules or systems around domain-specific constraints — the kind of architectural decision that reflects a deep understanding of a particular business’s problem space. Every business has its own domain. There are rarely many examples of “several solutions to this domain problem” in the public record, because even the business facing the problem hasn’t solved it adequately if someone is reaching for a new abstraction. The training data is sparse by necessity. Often, these are problems where you likely won't know you were right for weeks or months. I think back to my time on DirectFile, where I set up a declarative system for navigating paths through the tax logic we had encoded and turning those paths into a web app. I made a bet that the declarative approach would pay for its implementation complexity by allowing us to quickly onboard many developers. It mostly worked for year 1, and then in year 2 they had to add features I'd never thought through and that, it turned out, the abstractions I'd designed couldn't cleanly accommodate. The [traces of some of that thinking](https://github.com/IRS-Public/direct-file/blob/main/docs/adr/adr-frontend-architectural-constraints-spring-2023.md) exist in an [open source repo](https://github.com/IRS-Public/direct-file), but I never wrote a solid analysis of how things worked out, and knowing which constraints we actually needed to optimize for requires a great deal of unarticulated knowledge of the sociotechnical environment — and judgement that is won by designing at this level over and over and seeing what has and has not worked. There is not a well-agreed-upon process for these sorts of decisions (beyond "write an RFC!" and "write an ADR!"), nor a good process for decomposing them into well-understood sub-problems. Zone 3 tasks don’t exist in isolation — they sit adjacent to the densely populated landscape of _generic software architecture advice_. When the model encounters a Zone 3 problem, it doesn’t signal uncertainty. It slides into the nearby dense region and produces the solution that a well-read software engineer who doesn’t know the domain would suggest. It generates what would get upvotes on Stack Overflow from outsiders. This is worse than mere absence of knowledge. It is _systematically misleading_ competence — the output has the right shape, uses the right vocabulary, and pattern-matches to something that sounds architecturally principled. But it misses the essential constraints that make the domain what it is. Researchers have come to a similar conclusion. A [Carnegie Mellon/SEI assessment (Ivers and Ozkaya, ICSA 2025)](https://www.sei.cmu.edu/documents/6252/ICSA_2025_NEMI_preprint.pdf) systematically evaluated generative AI against common software architecture activities. The research team rated LLMs well at tasks involving brainstorming alternatives and applying known patterns. Every task involving contextual reasoning, comparison of alternatives in system-specific context, or assessment of goodness-of-fit received the lowest possible rating. ## Cool, then what do I do? **Zone 1 is pretty easy** --- if you are confident the problem is pretty generic, then the LLM has just automated away some drudgery for you. Neat. If you're iffy on that question, you probably need to work a level deeper than your typical prompt and try yourself to decompose the problem into what you'd need to implement, to make sure it's all stuff you're pretty sure you'd find the right answer for by searching the web and then making really straightforward changes. **Zone 3 is also weirdly easy** --- don't believe the model's lies. Use the model to explore what the generic solutions might look like, but at this point you're doing something like real software architecture work, and you are literally just going to have to work out a design, get it detailed enough that it decomposes down to zone 1 and 2 work, and then you probably can have the model speed you up with implementing it. Honestly one of my favorite things about LLMs is, once I have the big picture of "how these systems should interact" and "what's the rough API surface," I can have the model generate a couple different implementations so I can see how the details of my design might play out. For me to POC all of that myself would have often taken too long, but now I can get a real sense as to whether I have the right knobs and switches earlier in the process. This work is also pretty important; a huge part of knowing whether we got an abstraction _right_ is seeing whether it's frustrating to work with; in LLM-heavy development, you've got to sense that frustration, because the model cannot. **Zone 2 is all about keeping the model on track, starting with the shape of the solution.** The tricky thing is Zone 2 has two faces. It's where the model is surprisingly competent (because the _type_ of move needed to solve these problems are well documented in the training material), but also it can easily miss key details of the problem and then quickly solve the wrong thing. You've got a lot working against you here: your own understanding is probably fuzzy since you haven't solved the problem yet; you're probably working in an existing codebase that is going to light up all kinds of parts of the landscape as the model pulls the integration point into the context window; the model wants to give you a solution _like_ what's written about online, which likely ignores the important details of your domain (which are not written about extensively online). The first move is truly yours: propose the shape of the final structure of the code that you want. If you need three scattered functions collapsed into one parameterized helper, say so. If the right move is extracting a module, describe what belongs in it and what doesn't. The model will execute a restructuring you describe (with help), but left to its own devices it defaults to _adding_ code — a new function, a new wrapper, a new layer — rather than simplifying what's there. Once you have the shape, the key move here is to give the model some way of checking its work (an oracle) and to foreclose the possibility of ignoring that oracle (foreclosure). People are doing this intuitively --- they write plans in markdown documents so they can clear context and get the model back on track; they document ACs and goals carefully via "[spec-driven development](https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html)," they put wild things in their system instructions about lives depending on all the tests actually passing; they tell the model to practice [red/green development](https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/). What we're trying to do is preemptively shape the landscape the model will navigate during generation so it sees the most plausible generations as ones that solve the actual problem and not (1) a similar but well-attested problem from the training data or (2) ignore the results of our tests and parts of our specs. As programmers, we're at a disadvantage here too; the model has seen _a lot_ of posts about why it's fine that some test fails, or how it's OKAY to skip a tricky requirement due to deadline pressure, etc. We have a lot of community practices about ignoring our own oracles, and the LLMs have picked those up. The engineering problem here is designing useful oracles that have actual payoff (if the oracle takes more time to design than the problem would take to solve, well, that's pretty useless), and then designing our prompts in such a way that they foreclose ignoring the oracle. This is not a solved problem, but seeing it in those terms makes it tractably crisp for me. ## Oh right, a conclusion The above analysis, for me, offers plausible explanations for a lot of confusing behaviors I see every day: - Why do they sometimes ignore tests? Because a lot of communities of programmers sometimes ignore tests - Why is it so hard for me to get an LLM to copy an inelegant-but-battle-tested solution from one service to another (or to a new library)? Because other solutions to the problem are better represented in the training data, and so it reproduces those instead - Why does context seem to degrade? Because that context is folded back into every prompt, and includes your previously failed attempts, corrections to those, approaches you abandoned, and each of these activates different evaluative regimes, and the model has to satisfy all of them simultaneously. That's a constraint-satisfaction problem that gets harder as the constraints multiply and conflict. This also helps me see a path for human software engineers that I don't think the LLMs will actually be able to solve on their own: those who stay in the field will have work in zone 2 of creating valid oracles for models and then foreclosing ignoring them, and work in zone 3 in terms of developing useful abstractions that allow complex systems to remain maintainable over time. It is a really different job from what was dominated by zone 1 and 2 work with a smattering of zone 3 for a treat. I have no idea if we'll need as many developers doing the work, or somehow more (hey [Jevons](https://en.wikipedia.org/wiki/Jevons_paradox)). But I would predict that with current LLM technology, you will not eliminate the work outlined above, even with more compute and training data.