Back Original

Tool use and notation as shaping LLM generalization

Using notation, tools or code don’t make the model smarter. They make the task simpler.

That’s the quiet trick behind “tool-using agents,” DSL prompting, and “just write code.” These prompts rearrange the work the model has to do so that the parts requiring real generalization get pushed into deterministic machinery—databases, runtimes, calculators, typecheckers—and the model is left with what it already does well: mapping intent onto a familiar "linguistic" interface that is well present in the training corpus.

I was thinking about this while reading Cameron Buckner and Raphael Millière’s Philosophy of Large Language Models, where they lean on François Chollet’s taxonomy of task generalization: local generalization (performing well on a known distribution for a task), broad generalization (handling novel instances across a family of related tasks), and extreme generalization (adapting to genuinely new domains and task types). The common assessment is basically: LLMs look strong on local generalization, uneven on broad, and weak on extreme.

The tool-use story slots into that picture in a way that’s easy to miss. Most people talk as if tools increase generalization—like we’ve bolted on intelligence. But what we’re really doing is moving the distribution boundary. We take an out-of-distribution problem and re-represent it until it becomes a sequence of in-distribution interface-mapping moves, with correctness carried by the machinery on the other side.

This is also what humans do. A pencil doesn’t raise your IQ; it turns mental arithmetic into a mechanical procedure. Leibniz notation didn’t “solve” differential equations by itself; it made them legible and manipulable. Notation and tools don’t make cognition deeper—they make the world simple in exactly the places we need it to be.

I want a name for this design move: generalization shaping—choosing representations, tool surfaces, and target languages that transform an out-of-distribution problem into a sequence of local, familiar compilations, while the hard parts are handled by systems that don’t have to generalize at all.

Two axes

There are two things worth tracking separately as we think about this.

The first is actuator expressivity: how powerful is the language or tool the model is targeting? A calculator has low expressivity. SQL is medium – it handles filtering, joining, and aggregation but not control flow or side effects. A full programming language like JS or Python is near-universal.

The second is model burden: how hard is the mapping from intent to action? This doesn't always correlate with expressivity the way you'd expect. SQL is more expressive than a set of bespoke tools, but NL (Natural Language)→SQL is so heavily represented in training data that it can actually be easier for the model than chaining three simpler tools together, which requires multi-step planning and carrying state across calls.

The sweet spot – and the whole game of agent design, really – is maximizing actuator expressivity while minimizing model burden. You do this through what I'll call generalization shaping: choosing representations, tool surfaces, and target languages that transform an out-of-distribution problem into a sequence of in-distribution interface-mapping tasks, while correctness is carried by the deterministic machinery on the other side.

I’ve been trying to frame this approach into the following concepts:

A running example

Let's walk through a single question across five different system designs and watch how each one reshapes the generalization burden.

We have a product catalog in Markdown (rich text descriptions, SKUs, categories, keywords – imagine a few hundred products) and a sales log in CSV (date, store, SKU, units sold – imagine tens of thousands of rows). A user asks: "What's the total number of snow shovels sold?"

Here's an excerpt of what the data looks like:

## ArcticGrip 24" Aluminum Snow Shovel
- SKU: SNW-SHV-02
- Category: Snow Shovel
- Notes: Lightweight aluminum blade; ergonomic handle.

## SnowPusher 30" Poly Blade
- SKU: SNW-PUS-01
- Category: Snow Pusher
- Notes: Wide pusher for clearing driveways fast (not a scoop).
date,store,sku,units_sold
2026-01-01,EAST-01,SNW-SHV-03,10
2026-01-01,EAST-01,SNW-SHV-02,12
2026-01-01,PROV-02,SNW-PUS-01,9
...

There are three snow shovel SKUs (SNW-SHV-01, -02, -03) and a snow pusher (SNW-PUS-01) that should not be counted. The correct answer is 148 units.

Rung 1: Unshaped task

The naive approach: load everything into the context window and ask. The model has to parse the Markdown, decide which products count as "snow shovels" (entity resolution – the pusher says "snow" in its name and keywords), extract the relevant SKUs, scan the CSV, match rows, and sum a column of numbers. That's schema induction, fuzzy entity resolution, and multi-digit arithmetic, all inside attention.

This works on toy data. It falls apart at scale – both because the context window fills up and because each of those sub-tasks is somewhere between broad and extreme generalization. The model is doing the join and the filtering and the computation and the disambiguation all at once, with no external verification of any step.

Rung 2: Representation shaping

We pre-process: extract SKU and category from the Markdown and merge it into the CSV. Now the model sees:

date,store,sku,product_name,category,units_sold
2026-01-01,EAST-01,SNW-SHV-03,FrostBite 21in Steel Snow Shovel,Snow Shovel,10
2026-01-01,EAST-01,ICE-MLT-01,IceMelt Pro 20lb Bag,Ice Melt,22
2026-01-01,EAST-01,SNW-SHV-02,ArcticGrip 24in Aluminum Snow Shovel,Snow Shovel,12
...

"What counts as a snow shovel" is no longer a fuzzy inference step – it's a column value. The join is gone. The entity resolution is gone. The model still has to scan and sum, but we've already externalized two of the hardest sub-tasks by changing the notation. The context window is smaller too, because we dropped the Markdown.

This is generalization shaping via notation: we changed the representation so the residual task is easier.

Rung 3: Tool decomposition

Instead of asking the model to be a database, give it three small tools: search_products(query), sales_by_sku(skus), and calculator(expr). A plausible trace:

Step 1search_products("snow shovel"):

sku,name,category
SNW-SHV-01,ArcticGrip 18" Poly Snow Shovel,Snow Shovel
SNW-SHV-02,ArcticGrip 24" Aluminum Snow Shovel,Snow Shovel
SNW-SHV-03,FrostBite 21" Steel Snow ShovelSnow Shovel

Step 2sales_by_sku(["SNW-SHV-01","SNW-SHV-02","SNW-SHV-03"]):

SNW-SHV-0,14
SNW-SHV-02,98
SNW-SHV-03,46

Step 3calculator("4 + 98 + 46")148

Now the model's residual task is a local chain: map a user query to a search call, pass the results to a lookup, pass those results to a calculator. Each individual step is in-distribution. The chaining logic – question → search → filter → compute → answer – is a pattern the model has seen thousands of times. Correctness of the arithmetic is guaranteed by the calculator; correctness of the filtering is guaranteed by the database behind search_products. The model only has to get the mapping right.

But notice the model burden isn't zero: it has to plan three steps, carry state across them (the SKU list from step 1 feeds into step 2), and decide not to include the snow pusher. And importantly, the concept of "sum" is not integrated into the query tool – the model has to know it needs a separate calculator call and correctly transcribe the numbers.

Rung 4: DSL compression

Collapse the three-step chain into a single target language whose semantics already include filtering, joining, and aggregation: SQL.

SELECT SUM(s.units_sold) AS snow_shovels_sold
FROM sales s
JOIN products p ON p.sku = s.sku
WHERE p.category = 'Snow Shovel';

Result: 148.

This is a single tool call. NL→SQL is heavily represented in training data, so the mapping is highly in-distribution – arguably more local than the three-step tool chain, even though SQL is a more expressive language. The model doesn't have to plan multiple steps, carry state, or decide which tool to call next. It compiles intent into one query, and the database handles everything else: the join, the filter, the sum. All deterministic, all verifiable.

SQL also gives the model something the bespoke tools didn't: the ability to create and reuse ad-hoc abstractions. It can define views, functions, triggers. This is a consequential shift: we are nowcreating reusable actuators.

Rung 5: Programmable actuator

Beyond SQL – which is a DSL scoped to relational querying – models are really good at writing general-purpose code. So instead of a db_query() tool, give the model an eval() tool with DB access. For our simple question:

import pandas as pd

sales = pd.read_sql("SELECT * FROM sales", db)
products = pd.read_sql("SELECT * FROM products", db)
merged = sales.merge(products, on="sku")
total = merged[merged["category"] == "Snow Shovel"]["units_sold"].sum()
# → 148

For this particular question, this isn't obviously better than the SQL version – it's more verbose and the model burden is roughly the same. But the actuator is now qualitatively different, because we've entered the world of writing programs.

Suppose the user asks something slightly different: "How many snow shovels did we sell, and if any SKU is trending up while stock is low, send a heads-up to the ops channel."

This isn't a query anymore; it requires analytics (sales aggregation), operations (inventory lookup), decision logic (trend detection, threshold comparison), and action (sending a notification). SQL can handle the first part. It cannot do the rest.

With a full JS eval and a few host primitives — db.query(), inventory.getCurrentStock(), notify.send(), log.info() — the model can write:

// 1) Sales totals by SKU
const totals = db.query(`
  SELECT s.sku, p.product_name, SUM(s.units_sold) AS units
  FROM sales s JOIN products p ON p.sku = s.sku
  WHERE p.category = 'Snow Shovel'
  GROUP BY s.sku, p.product_name
`);

// 2) Trend: last 2 days vs previous days
const trend = db.query(`
  WITH daily AS (
    SELECT s.sku, s.date, SUM(s.units_sold) AS units
    FROM sales s JOIN products p ON p.sku = s.sku
    WHERE p.category = 'Snow Shovel'
    GROUP BY s.sku, s.date
  )
  SELECT sku,
    SUM(CASE WHEN date >= '2026-01-05' THEN units ELSE 0 END) AS last2,
    SUM(CASE WHEN date <  '2026-01-05' THEN units ELSE 0 END) AS prev
  FROM daily GROUP BY sku
`);

// 3) Current stock from the operational API
const stock = new Map(
  inventory.getCurrentStock().map(x => [x.sku, x.on_hand])
);

// 4) Decide and act
const alerts = [];
for (const row of totals) {
  const t = trend.find(x => x.sku === row.sku) || { last2: 0, prev: 0 };
  const onHand = stock.get(row.sku) ?? 0;
  if (t.prev > 0 && t.last2 / t.prev > 1.3 && onHand <= 12) {
    alerts.push({ sku: row.sku, name: row.product_name,
      onHand, last2: t.last2, prev: t.prev });
  }
}

if (alerts.length) {
  notify.send({
    channel: "ops-inventory",
    subject: "Snow shovel demand spike + low stock",
    body: alerts.map(a =>
      `- ${a.sku} (${a.name}): on_hand=${a.onHand}, `
      + `last2days=${a.last2}, prev=${a.prev}`
    ).join("\n")
  });
}

log.info({ totalSnowShovels: totals.reduce((s, r) => s + +r.units, 0), alerts });

SQL is used inside the program – it's great for the relational parts – but the model is now mixing analytics with operations, encoding business heuristics (what counts as a "spike"? what's "low stock"?), and producing side effects. It's authored a little program that could for example be run on a schedule without any LLM calls involved.

And yet the model burden for writing this is still largely local: it's a short, idiomatic Node-style script that glues together APIs. The pattern – query, loop, filter, notify – is ubiquitous in training data.

Design heuristics

If you're designing tool surfaces for LLM agents, the framing above suggests a few principles.

  1. Keep your interfaces at a narrow waist: stable schemas, typed inputs and outputs, small sets of composable primitives. The model's search problem scales with the degrees of freedom at the interface, so minimize them. A tool that takes {category: string, date_range: [start, end]} is easier to target than one that takes {query: any}.

  2. Tool outputs should be structured enough to be reliable but malleable enough that short code can reshape them. JSON and tables are the sweet spot. The trick is keeping results malleable enough that a three-line transform can reshape them – you don't need to anticipate every downstream use in advance.

  3. Let the model use a semantic DSL when one naturally fits. NL→SQL for relational data, NL→regex for pattern matching, NL→jq for JSON transforms. These mappings are heavily in-distribution and they externalize correctness to a well-tested runtime. Don't force the model through bespoke tool chains when a DSL compresses the whole chain into a single, verifiable expression.

  4. Use full eval() when you need control flow or business logic that isn't naturally relational. But constrain the environment: provide libraries, establish conventions, give the model a schema to work against. Unconstrained code generation is where local generalization breaks down.

And finally: the whole point of generalization shaping is that you're not trying to make the model smarter. You're trying to make the task dumber – dumb enough that a very good pattern-matcher can handle it, while the hard parts are carried by machinery that doesn't need to generalize at all.