Back to Projects

AIMO Prize 3 · Kaggle · March 2026

Project Ramanujan

The competition had been running for five months. Most teams started in November. I entered in the second week of March with nothing but a model checkpoint and an idea. Three weeks later, I had the highest public score on the leaderboard.

36/50 Public Score
3 Weeks Built
Solo Entry
116.8B Parameters (5.1B Active)
N=8 Parallel Samples
5h Wall Clock / 50 Problems

The Competition

AIMO Prize 3 was one of the most competitive AI challenges on Kaggle in 2026: a $2.2 million prize pool, international teams with months of preparation, and a problem set drawn from olympiad-level mathematics. The goal was to build a system that could solve 50 competition-grade math problems under a strict compute budget on Kaggle's T4x2 GPU environment. The competition opened in November 2025, and most serious contenders had been iterating since day one.

I did not start in November. I was not on a team. I entered in the second week of March 2026, roughly three weeks before the final deadline.

The reasoning was simple. I had been watching the leaderboard and studying the problem distribution. The gap between top scores was not about who had the best model -- nearly everyone was converging on a similar set of open-source reasoning models. The gap was in the infrastructure: how you prompted, how you sampled, how you executed tool calls, how you voted across samples. I realized the bottleneck was not the model. It was everything around the model. So I built it.

Three Weeks, Start to Score

Week 1 -- The Wall

The first 72 hours went entirely to getting vLLM to start on the Kaggle kernel. Not inference. Not prompting. Just getting the server to boot without crashing. FlashInfer threw an fp8 dtype assertion on the T4 hardware. The attention backend enum values had silently changed between vLLM 0.10.x and 0.11.x, so configurations that worked locally produced cryptic failures on Kaggle. KV cache misconfigurations caused silent OOM kills. A startup race condition meant the server would sometimes bind to the port before the model weights finished loading, causing the first request to hang indefinitely. Every one of these took hours to isolate because Kaggle kernels have limited logging and no interactive debugger.

Week 2 -- First Runs and the Regression Lesson

Once inference was stable, the stack started producing answers. Early reference runs on 10-problem subsets scored 7/10 with a clean prompt and majority voting. Feeling confident, I added a problem classifier and a policy book -- a set of strategy-selection rules based on detected problem type. The score dropped to 5/10. Two days of work, and the system was measurably worse. The classifier was miscategorizing geometry problems as algebra, which triggered the wrong reasoning chain, which cascaded into answer extraction failures.

I learned that additive complexity without controlled ablation is a trap. Every new component has to earn its place against the clean baseline, measured on the same problem set, with the same seeds.

Week 3 -- Convergence

Stripped the classifier. Went back to the clean Harmony protocol with tool-integrated reasoning. Refined the prompt to give the model maximal freedom while enforcing output structure. Tuned the voting layer. Ran full 50-problem submissions. The final system scored 36/50 on the public leaderboard -- the official highest public score posted during the competition. For context, the previous year's AIMO Prize 2 winning solution scored 34/50.

Architecture

The system was built as a sequential pipeline. Each problem passes through six stages, with N=8 parallel samples generated at the inference layer. The design prioritized robustness over cleverness -- every stage had to be independently testable and recoverable.

Inference Pipeline

Problem (LaTeX)
    |
    v
Prompt Construction (Harmony Protocol)
    |
    v
vLLM  (GPT-OSS-120B, 116.8B params, 5.1B active MoE)
    |
    +---> Sample 1 ---> Tool Exec (Jupyter) ---> Answer Detection
    +---> Sample 2 ---> Tool Exec (Jupyter) ---> Answer Detection
    +---> Sample 3 ---> Tool Exec (Jupyter) ---> Answer Detection
    +---> ...
    +---> Sample 8 ---> Tool Exec (Jupyter) ---> Answer Detection
    |
    v
Entropy-Weighted Majority Voting
    |
    v
Final Answer (integer)

Prompt Construction (Harmony): The Harmony protocol wraps each problem in a structured reasoning template that guides the model through problem comprehension, strategy selection, step-by-step solution, and verification. Crucially, it includes a tool-use preamble that teaches the model to write and execute Python code mid-reasoning when symbolic manipulation or numerical computation is needed.

Tool-Integrated Reasoning (TIR): Each sample gets its own Jupyter kernel. When the model generates a code block during its reasoning chain, the code is intercepted, executed, and the output is injected back into the context. This allows the model to verify intermediate steps, compute large combinatorics, and catch its own arithmetic errors before committing to a final answer.

Entropy-Weighted Voting: Rather than simple majority voting, the system weights each sample's answer by an entropy-derived confidence signal. Samples where the model's reasoning chain was internally consistent and converged cleanly carry more weight than samples that oscillated between approaches.

Model Performance in Context

GPT-OSS-120B is a 116.8 billion parameter Mixture-of-Experts model with 5.1 billion active parameters per forward pass. Running it on Kaggle's two T4 GPUs required careful quantization and KV cache management to fit within the 30 GB combined VRAM budget.

In draft runs against the AIME 2025 problem set, the system with tool-integrated reasoning solved roughly 97.9% of problems -- nearly perfect on the hardest national-level high school competition in the United States. To put that in perspective, that level of mathematical reasoning is roughly on par with consistently solving 5 out of 6 problems on an IMO paper, which is the threshold for a gold medal at the International Mathematical Olympiad.

On the actual AIMO Prize 3 competition set, which draws from an even harder distribution, the system scored 36 out of 50. That was the highest public score recorded on the leaderboard.

36/50

AIMO 3 (2026)

Project Ramanujan

34/50

AIMO 2 (2025)

Winning Solution

Where It Breaks -- A Failure Taxonomy

Scoring 36 out of 50 means 14 problems were missed. Understanding why matters more than the score itself. Over the course of hundreds of draft runs, five distinct failure classes emerged.

F1 -- Attractor Traps

The model converges on a wrong answer through a clean, plausible factor. The reasoning looks correct. The intermediate steps check out. But there is a subtle error early in the chain -- often a sign error or a miscounted case -- and the rest of the proof builds consistently on that mistake. Because the error is clean, multiple samples independently converge on the same wrong answer, and majority voting locks it in.

F3 -- Long-Horizon Failures

Problems requiring more than 15-20 reasoning steps. The model's accuracy degrades roughly exponentially with chain length. By the time it reaches the final computation, accumulated approximations and dropped constraints make the answer unreliable. Tool execution helps, but cannot compensate for conceptual drift in the reasoning itself.

F4 -- Existential Failures

Problems where the model fails to recognize the nature of what is being asked. Not a computation error -- a comprehension error. The model answers a related but different question, or misidentifies the mathematical domain entirely. These failures are invisible to the voting layer because all samples share the same misunderstanding.

F5 -- Generation-Miss Walls

The model never generates the correct approach in any of the 8 samples. Not because the approach is unknown to the model -- it may surface it in larger sample sizes -- but because the sampling temperature and top-p configuration happen to never explore that region of the solution space within 8 draws. This is a pure coverage failure: the correct answer exists in the model's distribution but is not sampled.

The P10 Heartbreak

Of all 50 problems, Problem 10 taught me the most. It was a Hamiltonian paths problem -- a combinatorial question about counting specific traversals in a graph. Across 8 parallel samples, the results split: 4 out of 8 converged on 276, and 3 out of 8 converged on 552. The correct answer was 552.

The majority voting system locked in 276. The wrong answer won the popular vote.

I tried adjusting the voting threshold. I tried weighting by chain length, by code execution success rate, by final-step confidence. Each adjustment fixed P10 but broke other problems that had been scoring correctly. The fundamental issue was not the voting mechanism.

The wrong answer is simply more probable per sample. This is not an infrastructure problem. This is a model-level attractor -- the incorrect reasoning path has higher likelihood than the correct one, and no amount of voting-layer engineering can overcome that when the majority of samples converge on the same mistake.

P10 crystallized the distinction between problems I could solve with better engineering and problems that required a fundamentally different model or approach. It was the clearest example of an F1 attractor trap in the entire competition set, and accepting that some problems were beyond the reach of the current architecture was part of knowing where to invest the remaining time.

The Stack

Everything was built from scratch. There was no starter kit, no shared team infrastructure, no prior competition codebase to fork from. The full list of components written over those three weeks:

Harmony Protocol

Structured reasoning prompt with tool-use preamble, chain-of-thought enforcement, and answer extraction

vLLM Integration

Custom server configuration for T4x2 with quantization, KV cache tuning, and health-check gating

Tool Execution Engine

Sandboxed Jupyter kernel pool with per-sample isolation, timeout management, and output injection

Voting Layer

Entropy-weighted majority voting with answer normalization and confidence-based tie-breaking

Answer Detection

Multi-pattern parser for extracting integer answers from natural language, LaTeX, and code output

Evaluation Harness

Offline test runner with reference problem sets, ablation tracking, and regression detection

Technologies

GPT-OSS-120B vLLM Harmony Protocol Tool-Integrated Reasoning Kaggle T4x2 N=8 Majority Voting FlashInfer Jupyter Kernels Python MoE Inference

Reflection

Project Ramanujan was not about being fast. It was about understanding what the actual problem was. Most of the competition was not about mathematical reasoning -- the model handled that. The competition was about systems engineering: getting a 116-billion-parameter model to run reliably on consumer GPUs, building a tool execution layer that did not corrupt the reasoning chain, and designing a voting mechanism that surfaced the right answer from a noisy distribution of samples.

The three weeks forced a discipline that I think longer timelines sometimes erode. Every component had to justify itself against the baseline. Every hour had to be allocated to the highest-leverage problem remaining. There was no time for speculative architecture -- only measured, incremental improvement.

The 14 problems that were missed are not a failure. They are a map. They tell me exactly where the model's boundaries are, what types of reasoning still require fundamentally different approaches, and what the next generation of systems needs to solve. That map is, in some ways, more valuable than the score.

The score tells you where you are. The failures tell you where to go next.