James Routley

How do we scale LLMs to larger codebases? Nobody knows yet. But by understanding how LLMs contribute to engineering, we realize that investments in guidance and oversight are worthwhile.

Guidance: The context, the environment.
Oversight: The skill set needed to guide, validate, and verify the implementor's¹ choices.

Investing in guidance

When an LLM can generate a working high-quality implementation in a single try, that is called one-shotting. This is the most efficient form of LLM programming.

An arrow hitting the center of a dart board.

The opposite of one-shotting is rework. This is when you fail to get a usable output from the LLM and must manually intervene.² This often takes longer than just doing the work yourself.

Multiple arrows on a dart board. They have all missed the center.

So how do we create more opportunities for one-shotting? Better guidance.

Better guidance

LLMs are choice generators. Every set of tokens is a choice added to your codebase: how a variable is named, where to organize a function, whether to reuse/extend/or duplicate functionality to solve a problem, whether Postgres should be chosen over Redis, and so on.

Often, these choices are best left up to the designer (e.g., via the prompt). However, it's not efficient to exhaustively list all of these choices in a prompt. It's also not efficient to rework an LLM output whenever it gets these choices wrong.

In the ideal world, the prompt only captures the business requirements of a feature. The rest of the choices are either inferrable or encoded.

Write a prompt library

A prompt library is a set of documentation that can be included as context for an LLM.

Writing this is simple: collate documentation, best practices, a general map of the codebase, and other context an engineer needs to be productive in your codebase.³

Making a prompt library useful requires iteration. Every time the LLM is slightly off target, ask yourself, "What could've been clarified?" Then, add that answer back into the prompt library.

A prompt library needs to strike the right balance between comprehensive and lean.

The environment is your context

A peer at Meta told me that they weren't in a position to make Zuckerberg's engineering automation claims a reality. The reason is their codebase is riddled with technical debt. He wasn't surprised by this. Meta (apparently) historically has not prioritized paying down their debts.

Compare this to the mentality from the Cursor team:

I think ultimately the principles of clean software are not that different when you want it to be read by people and by models. When you are trying to write clean code you want to, not repeat yourself, not make things more complicated than they need to be.

I think taste in code... is actually gonna become even more important as these models get better because it will be easier to write more and more code and so it'll be more and more important to structure it in a tasteful way.

This is the garbage in, garbage out principle in action. The utility of a model is bottlenecked by its inputs. The more garbage you have, the more likely hallucinations will occur.

Here's a LLM literacy dipstick: ask a peer engineer to read some code they're unfamiliar with. Do they understand it? Do they struggle to navigate it? If it's a module, can they quickly understand what all that module exposes? Do they know the implications of using a certain function, the side-effects they must be aware of? No? Then the LLM won't either.

Here's another dipstick: Ask an LLM agent to tell you how certain functionality works. You should know the answer before asking the LLM. Is their answer right? More importantly, how did they go about answering your question? Follow the LLM's trail and document its snags. You'll notice it tends to grep, ls, and cat to search. How can you give it a map so it isn't left to rediscover the codebase on each new prompt? When a map can't be given, how do you make it easier for them to navigate the codebase?

How you make the environment better suited for LLM literacy is dependent on the tech stack and domain. But general principles apply: modularity, simplicity, things are well-named, logic is encapsulated. Be consistent and encode these conventions in your prompt library.

Investing in oversight

We need guidance and oversight. A 3-ton truck with a middle-schooler behind the wheel puts people in the hospital (and in jail). This is why the mentality of automating engineers is objectionable. We should be fostering our teams, not discarding them.

Remember, engineers operate on two timelines. As overseers of implementation, we must plan for the future of the codebase. If an LLM makes a choice, the overseer should be able to discern whether it was a good one or a bad one. For example, let's say the LLM opted to use Redis over Postgres to store some metadata. Was that a good choice? The overseer should know.

An investment in oversight is an investment in team, alignment, and workflows.

For team, it's worth investing in elevating everyone's design capabilities.

Design produces architecture. Architecture is a bet on the future. It's a bet that by setting up a program in a certain way, it will make the future feature development easier.

Architects are often created through experience. A career of shooting yourself in the foot builds intuition. This intuition shapes new software from having the same mistakes.

Aside: Some thoughts on how to grow design skills

Read books, blogs, and code. Watch conference talks. Replicate masterworks. Practice by writing programs that you don't know how to build.

On replicating masterworks:

Student artists are often required to replicate masterworks. A masterwork is an art piece that an expert artist made. Through the process of replicating this masterwork, an artist gains practical experience at the bleeding edge of art. (This experience also builds confidence, which is a bonus.)
The same is true for engineering. I've learned more by writing a programming language than I have in reading a hundred blog posts.
Why does this work?
- You understand a layer of abstraction deeper than the layer you're working at (this is a Cantrill-ism, but I can't find the quote).
- Masters use techniques and workflows that are best learned via practice. Thorsten Ball taught me how to break down large problems into tractable phases. Each phase had a contract and each contract could be tested.

On reading code:

A good way to expand your vocabulary of solutions.
In Steve Ruiz's V1 of TLDraw, I learned the patterns necessary to later implement session-based undo/redo for an internal tool at work.
Reading code from leaders in the field is also a good way to build taste.

Oversight is not only about architecture, but also temperament, alignment to values, and workflows. Operators need to be both technical and product experts. Without a deep understanding of the product, it's easy to accidentally build the wrong solution.

Automating oversight

Some design concerns can be checked programmatically.

Moving more implementation feedback from human to computer helps us improve the chance of one-shotting. Agents can get feedback directly from their environment (e.g., type errors).

Think of these as bumper rails. You can increase the likelihood of an LLM reaching the bowling pins by making it impossible to land in the gutter.

One way to do this is through writing safety checks. But what is safety? Safety is protecting your abstractions. Pierce's Types and Programming Languages has my favorite definition of safety:

Informally, though, safe languages can be defined as ones that make it impossible to shoot yourself in the foot while programming.

Refining this intuition a little, we could say that a safe language is one that protects its own abstractions.

Safety refers to the language's ability to guarantee the integrity of these abstractions and of higher-level abstractions introduced by the programmer using the definitional facilities of the language. For example, a language may provide arrays, with access and update operations, as an abstraction of the underlying memory. A programmer using this language then expects that an array can be changed only by using the update operation on it explicitly—and not, for example, by writing past the end of some other data structure.

We tend to write tests for business-logic but don't always write tests for architecture-logic. Some programming languages have facilities for this built in.

Addressing verification

A very simplified graphic for the design / implementation / verification cycle

Design and implementation are only two pieces of a project's lifecycle. Verification, like code review or QA, are necessary for building quality software.

As the volume of work increases, our ability to ship that work becomes constrained by our ability to review it.

Here are some incomplete ideas for addressing the verification bottleneck:

Lowering the barrier of entry to perform manual QA (not needing a dev env).
Invest in a testing setup that makes it easy to express tests (including setting up tests, e.g., with test data creation) with minimal code.
Encode frequent PR feedback into documentation so that there is some level of PR review an LLM can reasonably do.
Security is baked in as defaults in the framework, not context.

That's it, for now

This was the third part of a series on LLMs in software engineering.

First we learned what LLMs and genetics have in common. ^{(part 1)} LLMs don't simply improve all facets of engineering. Understanding which areas LLMs do improve^{(part 2)} is important for knowing how to focus our investments.^{(part 3)}

Scaling LLMs to Larger Codebases