Back Original

Autoresearch for SAT Solvers

    _                    _   ____    _  _____
   / \   __ _  ___ _ __ | |_/ ___|  / \|_   _|
  / _ \ / _` |/ _ \ '_ \| __\___ \ / _ \ | |
 / ___ \ (_| |  __/ | | | |_ ___) / ___ \| |
/_/   \_\__, |\___|_| |_|\__|____/_/   \_\_|
        |___/

An autonomous AI agent that teaches itself to become the world's top expert on MaxSAT. Given 229 weighted MaxSAT instances from the 2024 MaxSAT Evaluation (main anytime weighted track), it discovers novel strategies, finds better solutions and iteratively refines its toolbox. No human guidance.

  1. An AI agent (e.g. Claude Code) reads program.md for instructions
  2. It reads expert.md for accumulated knowledge from prior runs
  3. It reads the library for available tools
  4. It runs solvers on instances, discovers what works, updates everything
  5. It commits and pushes to this repo so other agents can build on its findings
                              ┌─────────────────┐
                              │    GitHub Repo   │
                              │                  │
                              │  expert.md       │
                              │  library/        │
                              │  best-solutions  │
                              │  experiments.log │
                              └────────┬─────────┘
                           git pull/push │
                 ┌─────────────┬────────┴────────┬─────────────┐
                 │             │                 │             │
          ┌──────▼──────┐ ┌───▼────────┐ ┌──────▼──────┐     ...
          │   VM  1     │ │   VM  2    │ │   VM  3     │
          │             │ │            │ │             │
          │ ┌─────────┐ │ │ ┌────────┐ │ │ ┌─────────┐ │
          │ │ Agent 1 │ │ │ │Agent 3 │ │ │ │ Agent 5 │ │
          │ │ ┌─┬─┬─┐ │ │ │ │┌─┬─┬─┐ │ │ │ │ ┌─┬─┬─┐ │ │
          │ │ │S│S│S│ │ │ │ ││S│S│S│ │ │ │ │ │S│S│S│ │ │
          │ │ └─┴─┴─┘ │ │ │ │└─┴─┴─┘ │ │ │ │ └─┴─┴─┘ │ │
          │ ├─────────┤ │ │ ├────────┤ │ │ ├─────────┤ │
          │ │ Agent 2 │ │ │ │Agent 4 │ │ │ │ Agent 6 │ │
          │ │ ┌─┬─┬─┐ │ │ │ │┌─┬─┬─┐ │ │ │ │ ┌─┬─┬─┐ │ │
          │ │ │S│S│S│ │ │ │ ││S│S│S│ │ │ │ │ │S│S│S│ │ │
          │ │ └─┴─┴─┘ │ │ │ │└─┴─┴─┘ │ │ │ │ └─┴─┴─┘ │ │
          │ └─────────┘ │ │ └────────┘ │ │ └─────────┘ │
          └─────────────┘ └────────────┘ └─────────────┘

          S = solver process (python)
# Launch on EC2 (handles everything: installs deps, clones repo,
# downloads benchmarks from Helsinki, launches agents in tmux)
./run.sh --host ec2-user@<ip> --agents 3

Requires a .env file with CLAUDE_CODE_API_KEY and GITHUB_ACCESS_TOKEN. The API key is auto-refreshed from your local Claude Code login on each deploy.

Multiple agents can work on the same repo simultaneously, communicating through git — each agent pulls the latest solutions and expert knowledge, builds on what others found, and pushes its own improvements. No coordination needed beyond git pull and git push.

Metric Count
Instances solved 220 / 229
Optimal (matching competition best) 30
Better than competition 5
Novel solve (no known solution existed) 1
Within 1.1x of reference 123
Within 1.5x 183
Within 2x 209
Unsolved 9

Beat the 2024 MaxSAT Competition

Instance Our cost Competition best Improvement
switchingactivity_74 10 16 37.5% better
synplicate dag_run2_10_size_11 374 518 27.8% better
synplicate dag_run2_16_size_9 333 398 16.3% better
switchingactivity_68 8 9 11.1% better
BTBNSL hailfinder_10000 49,986,819,152 50,007,681,202 0.04% better
pseudoBoolean mod010 8,081 no solution novel solve
Instance Ratio Why it's hard
relational-inference pa-1 603x 2.5M vars, 1.1M soft clauses
twitter 9.7x 51K softs, 9.7M hard clauses
causal-discovery Water 4.4x 8.3M cost vs 1.9M ref
timetabling test4 3.3x 131K vars
decision-tree tic-tac-toe 2.6x adaboost ensemble

9 instances remain unsolved — mostly >16M variables or no known reference solution.

The agent developed these approaches autonomously, discovering what works through experimentation:

Technique Best for Key insight
Greedy SAT with selector variables Few soft clauses (<500) Heaviest-first greedy with CaDiCaL assumptions
Core-guided search Unit soft clauses Iterative UNSAT core relaxation. comp07.lp: 1778x → optimal
WPM1 core-guided Weighted unit softs Proper relaxation variables + at-most-one constraints
Biased-SAT Breaking local optima Random assumption subsets produce diverse solutions
Clause-weighting LS (SATLike) Stuck at local optima Dynamic weight adjustment escapes single-flip traps
Tabu search No-hard / unit-soft instances With SAT init + restarts. judgment-aggregation: 49x → 1.5x
Multi-init Diverse starting points Different solvers (CaDiCaL, glucose4, MiniCard) + random assumptions
Alternating CWLS + WalkSAT Deep optimization Alternating phases for continuous improvement. pa-1: 5445x → 612x
RC2 with CaDiCaL Weighted unit softs Previously dismissed, but solver='cd19' finds optimals. haplotyping-12: 3.8x → 1.01x

All code the agent writes lives in library/:

Module Functions Purpose
solvers.py greedy_sat, tabu_search, multi_init, sat_init, walksat_hard, walksat_soft, sat_solve_with_timeout Core solver building blocks
core_guided.py core_guided_solve, core_guided_budget, wpm1_solve UNSAT core-based optimization for unit soft clauses
clause_weight_ls.py clause_weight_local_search SATLike-inspired dynamic clause weighting
solutions.py load_solutions, update_solution, get_best_costs Compressed solution storage (1.7GB → 1.5MB)
wcnf_parser.py parse_wcnf, evaluate_cost, check_hard_clauses Single-pass streaming WCNF parser
  • Low parallelism: Claude Code rarely launches more than 6 parallel scripts, and often runs just 1-2 at a time, leaving most cores idle on large machines.
  • Tunnel vision: The agent can fixate on grinding one instance for hours (e.g. pa-1 from 5445x to 612x over many rounds) while ignoring easier wins elsewhere.
  • Session length: Despite "never stop" instructions, the agent tends to wrap up after a few hours, deciding it has reached a natural stopping point.

The agent maintains expert.md as a living knowledge base and improves the library as it learns.