Three weeks ago, Leo de Moura posted a challenge: can AI learn to annotate theorems for Lean 4’s grind tactic? The idea is that AI should press buttons on existing automation, not synthesize proof terms from scratch.
I pointed Claude at Mathlib and let it run. After 25+ experiments, the agent found 44 annotations that let grind close 806 proofs across 20+ mathematical domains — group theory, ring arithmetic, power laws, set operations, list combinatorics, logic, ordering, lattices, quantifiers, modular arithmetic, function composition, and more. 100% pass rate across four test suites.
The breakdown:
- 170 synthetic benchmarks (Bench.lean) — all
sorryreplaced bygrind - 193 real Mathlib proofs (MathlibTest.lean) — from
Algebra.Group.Basic,Algebra.Ring.Basic,Data.Set.*,Logic.Basic, plus newly discovered domains - 100 minimal self-contained proofs (Minimal.lean) — across 14 domains
- 343 native grind proofs (NewDomains.lean) — requiring zero annotations: CommRing, Lattice, quantifiers, modular arithmetic, Finset, Array, Set distributivity, List functor laws, function composition, Prod/Sum/Sigma types, and more
What the annotations look like
The 44 annotations are unremarkable: mul_inv_cancel, pow_succ, div_eq_mul_inv, etc. What matters is the selection. Mathlib has 200,000+ theorems. The agent found a minimal set that covers a wide goal space without triggering instantiation cascades.
Only one annotation causes blowup: Nat.mul_le_mul [grind ->] at 100+ instantiation chains. This is inherent — any forward-reasoning rule with two <= antecedents blows up O(n^2) on the number of <= facts. This is exactly the constraint design problem Leo described as Level 3.
The real finding: where grind works and where it doesn’t
I tested grind on production code (tinygrad Lean 4 formalization). Result: 1/15 pass. The failures aren’t annotation problems — they’re paradigm mismatches:
| Pattern | Count | Fixable via annotations? |
|---|---|---|
| Needs structural induction | 8 | No — grind has no induction |
| Custom defs not reduced | 4 | No — grind uses E-matching, not simp-style reduction |
| Needs helper lemmas | 2 | Maybe |
Grind excels at algebraic reasoning (groups, rings, modules) where proofs are equational rewrites. For computational libraries (compilers, data structures, ML frameworks) where proofs are primarily definitional unfolding + induction, simp remains the right tool.
Surprise: what grind handles natively
The biggest finding is that 343 proofs need zero annotations. Grind handles these domains out of the box:
- CommRing: distributivity, negation, squaring, cube expansion, difference of squares
- Bool/Prop: De Morgan, double negation, excluded middle, S/K combinators
- Set algebra: union, inter, comm, assoc, distributivity, empty, univ, subset
- Nat/Int arithmetic: ring identities, ordering, min/max, concrete computation
- Constructor theory: Option (some != none, map, bind, getD), Prod (eta, swap, map), Sum (injectivity, disjointness, elim)
- Type wrappers: PLift, ULift, Subtype, Sigma, PSigma, Unit, PUnit
- List: length, head?, tail, append, zip, map functor laws, cons/nil reasoning
- Finset: subset, union, intersection, membership, singleton
- Array: size, push
- Function: id, const, composition laws
- Lattice: idempotence, commutativity, associativity of sup/inf
- Ordering: constructor disjointness
- Quantifiers: forall/exists manipulation, specialization
Grind’s built-in Boolean algebra reasoning covers Set operations comprehensively. Its ring normalization handles CommRing identities including cube expansions. Constructor injectivity and disjointness are fully automatic.
What grind can’t do (yet)
Some domains I tested that fail:
- Nat bit ops (land/lor/xor): needs
native_decideordecide -
Symbolic divisibility (a b -> b c -> a c): non-linear constraint - Lattice order->join/meet (a <= b -> a sup b = b): doesn’t connect LE to sup/inf
- List.head (requires nonemptiness proof): type mismatch
- Nat.factorial: not available without extra imports
The two unsolved goals
bench_int_sq_nonneg (0 <= a * a for integers) needs mul_self_nonneg, but grind’s proof-by-contradiction negates the goal before the lemma can pattern-match. bench_group_pow_neg has a normalization gap where -(n:Int) normalizes to -1 * n. Both are Level 4 challenges — the research frontier.
Code: autoresearch-grind