This experiment shows an AI agent autonomously improving a small GPT's training recipe. Using AutoResearch (Karpathy et al.) – which iteratively edits training code, runs experiments, and keeps only changes that lower validation bits-per-byte (BPB) – the agent ran 123 experiments over ~14 hours on a single H100 GPU. Each line traces a system's best BPB as experiments accumulate: Fugu-Ultra is in bold red (solid = mean over three seeds, dashed = best single run), with three frontier-model baselines (Model A, B, and C) faded behind it, and the callouts mark each new improvement the agent found on its own — spanning batch size, model depth, learning rates, and optimizer settings. Fugu-Ultra finishes with the best mean BPB (0.9774 ± 0.0019), ahead of Model C (0.9781), Model B (0.9793), and Model A (0.9822), and its best single run reaches 0.9748, leading every baseline. This suggests that orchestrating multiple strong models can outperform any individual frontier model on agentic ML research.
例1 — AutoResearch / LLM学習
This case study tests whether the reading order of classical Japanese kana letters (仮名消息) can be recovered — letters whose scattered chirashigaki ("scattered-writing") layout makes that genuinely hard even for trained readers of classical Japanese. Each model is given the character bounding boxes together with a rough set of reading-order rules, and writes code that outputs the order the characters should be read in; here it runs on a letter written in 1610 by Hōshun'in (芳春院, 1547–1617), scored by NED (a score based on normalized edit distance from an expert's ground-truth order, where 1.0 is a perfect match). Several frontier models were put through the identical pipeline, but none came close to Fugu-Ultra on this letter: Model A reached only NED 0.24 and Model B scored no better, both far below Fugu-Ultra's 0.80, while Model C produced no predictor at all. The clip shows the two extremes — each panel draws its predicted path in red over the expert's ground truth in green: Fugu-Ultra (top) traces the letter almost exactly, while Model A (bottom) jumps all over the page. (Letter held by the Keio Institute of Oriental Classics.)
例2 — 仮名消息の読み順推定
In this benchmark, each of Fugu-Ultra and 3 frontier models is given a single prompt to write a Rubik's Cube solver from scratch in pure Python — no off-the-shelf solving libraries allowed — and the resulting program is run locally on a held-out set of 300 randomly scrambled cubes. Solution quality is measured by the number of moves a solution uses, where lower is better. Fugu-Ultra and the frontier Model A wrote solvers that ran and solved all 300 cubes, while Model B and Model C each shipped sophisticated-looking code that crashed on execution and returned no valid solution at all (0/300). The clip follows cube #17: from the same scramble, Fugu-Ultra's solver reaches the solved state in 19 moves while Model A needs 21 — and across all 300 cubes Fugu-Ultra averages 19.72 moves versus 19.76 for Model A, both right at the optimal frontier, with Fugu-Ultra never a move longer than Model A on any cube (7 wins, 293 ties, 0 losses).
例3 — ルービックキューブ・ソルバー
Task: Create a mechanical iris in CAD, like a camera aperture, where multiple blades move together to open and close the central hole. For each model, we show both the generated detailed CAD itself and a simplified view that makes the structure easier to see. In the CAD generated by Fugu Ultra, the blades rotate around outer pins and clearly open and close the aperture. In contrast, the CAD generated by the other models shows problems such as gaps appearing, weak linkages, or the aperture not closing fully.
例4 — CAD メカニカルアイリス
Four blindfold chess games, back to back. Every model plays the same way — no board shown — holding the full game in memory. Fugu outplays four strong opponents: three leading frontier models and a 2100-Elo Stockfish engine, staying accurate where they drift and ending each game in checkmate.
例5 — 目隠しチェス
This benchmark uses a single anonymized equity over one historical 50-week window and is intended to compare sequential, no-look-ahead decision-making rather than to establish generalizable trading performance. Past performance does not guarantee future results, and results may not transfer to other assets, time periods, or live markets. Each model makes online trading decisions on anonymized STOCK_X, using only current and past weekly market data: opening, high, low, and closing prices, volume, returns, moving averages, volatility, drawdown, portfolio state, and prior feedback. Starting with $10,000, the agent chooses whether to buy, hold, or sell, and what fraction of cash or shares to trade. After each action, the next week's price is revealed and the portfolio is updated, so the model must adapt from feedback rather than seeing the future. Across five runs of the identical 50-week pipeline, Fugu-Ultra grew the portfolio to $11,943.22 ± $633.86, a +19.43% mean return, while the other frontier models reached their return less than +15%.
例6 — 株式トレーディング