James Routley

Three weeks ago, Leo de Moura posted a challenge: can AI learn to annotate theorems for Lean 4’s grind tactic? The idea is that AI should press buttons on existing automation, not synthesize proof terms from scratch.

I pointed Claude at Mathlib and let it run. After 25+ experiments, the agent found 44 annotations that let grind close 806 proofs across 20+ mathematical domains — group theory, ring arithmetic, power laws, set operations, list combinatorics, logic, ordering, lattices, quantifiers, modular arithmetic, function composition, and more. 100% pass rate across four test suites.

The breakdown:

170 synthetic benchmarks (Bench.lean) — all sorry replaced by grind
193 real Mathlib proofs (MathlibTest.lean) — from Algebra.Group.Basic, Algebra.Ring.Basic, Data.Set.*, Logic.Basic, plus newly discovered domains
100 minimal self-contained proofs (Minimal.lean) — across 14 domains
343 native grind proofs (NewDomains.lean) — requiring zero annotations: CommRing, Lattice, quantifiers, modular arithmetic, Finset, Array, Set distributivity, List functor laws, function composition, Prod/Sum/Sigma types, and more

What the annotations look like

The 44 annotations are unremarkable: mul_inv_cancel, pow_succ, div_eq_mul_inv, etc. What matters is the selection. Mathlib has 200,000+ theorems. The agent found a minimal set that covers a wide goal space without triggering instantiation cascades.

Only one annotation causes blowup: Nat.mul_le_mul [grind ->] at 100+ instantiation chains. This is inherent — any forward-reasoning rule with two <= antecedents blows up O(n^2) on the number of <= facts. This is exactly the constraint design problem Leo described as Level 3.

The real finding: where grind works and where it doesn’t

I tested grind on production code (tinygrad Lean 4 formalization). Result: 1/15 pass. The failures aren’t annotation problems — they’re paradigm mismatches:

Pattern	Count	Fixable via annotations?
Needs structural induction	8	No — grind has no induction
Custom defs not reduced	4	No — grind uses E-matching, not simp-style reduction
Needs helper lemmas	2	Maybe

Grind excels at algebraic reasoning (groups, rings, modules) where proofs are equational rewrites. For computational libraries (compilers, data structures, ML frameworks) where proofs are primarily definitional unfolding + induction, simp remains the right tool.

Surprise: what grind handles natively

The biggest finding is that 343 proofs need zero annotations. Grind handles these domains out of the box:

CommRing: distributivity, negation, squaring, cube expansion, difference of squares
Bool/Prop: De Morgan, double negation, excluded middle, S/K combinators
Set algebra: union, inter, comm, assoc, distributivity, empty, univ, subset
Nat/Int arithmetic: ring identities, ordering, min/max, concrete computation
Constructor theory: Option (some != none, map, bind, getD), Prod (eta, swap, map), Sum (injectivity, disjointness, elim)
Type wrappers: PLift, ULift, Subtype, Sigma, PSigma, Unit, PUnit
List: length, head?, tail, append, zip, map functor laws, cons/nil reasoning
Finset: subset, union, intersection, membership, singleton
Array: size, push
Function: id, const, composition laws
Lattice: idempotence, commutativity, associativity of sup/inf
Ordering: constructor disjointness
Quantifiers: forall/exists manipulation, specialization

Grind’s built-in Boolean algebra reasoning covers Set operations comprehensively. Its ring normalization handles CommRing identities including cube expansions. Constructor injectivity and disjointness are fully automatic.

What grind can’t do (yet)

Some domains I tested that fail:

Nat bit ops (land/lor/xor): needs native_decide or decide
Symbolic divisibility (a b -> b c -> a c): non-linear constraint
Lattice order->join/meet (a <= b -> a sup b = b): doesn’t connect LE to sup/inf
List.head (requires nonemptiness proof): type mismatch
Nat.factorial: not available without extra imports

The two unsolved goals

bench_int_sq_nonneg (0 <= a * a for integers) needs mul_self_nonneg, but grind’s proof-by-contradiction negates the goal before the lemma can pattern-match. bench_group_pow_neg has a normalization gap where -(n:Int) normalizes to -1 * n. Both are Level 4 challenges — the research frontier.

Code: autoresearch-grind

806 Proofs, 44 Annotations: What grind Can and Can't Do

What the annotations look like

The real finding: where grind works and where it doesn’t

Surprise: what grind handles natively

What grind can’t do (yet)

The two unsolved goals