ARC AGI 3

Competing on the frontier of machine intelligence

The Abstraction and Reasoning Corpus exists to test what current AI cannot do. Every task is a unique visual puzzle — grids of colored cells where you must infer a transformation rule from two or three examples, then apply it to a test input you have never seen. No task appears twice. Brute force is useless. Memorization is useless. The only thing that works is genuine abstraction.

That is exactly why it is interesting. Most benchmarks reward scale — more parameters, more data, more compute. ARC rewards something else entirely. A child can solve many of these tasks on first sight. The largest language models in the world struggle with them. The gap between those two facts is where the real questions live.

"I'm drawn to this competition because it sits at the boundary of what we understand about intelligence. Every technique that fails tells you something about what's missing."

The Challenge

Novel Tasks, Every Time

Each puzzle in ARC is a one-off. The transformation might involve symmetry, counting, topology, object tracking, or some combination no one has named yet. You cannot train a model on ARC tasks and expect it to generalize — because generalization itself is the test. The evaluation set contains puzzles that share no surface similarity with the training set.

Few-Shot Reasoning

Two or three input-output pairs. That is all you get. The solver must extract the rule, verify it against the examples, and apply it to the held-out test grid. This is program synthesis under extreme data scarcity — closer to how humans solve IQ test questions than how neural networks typically learn.

Mesa-Optimization Connection

There is a deeper thread here that connects to alignment research. If a model can learn to learn from examples in-context — constructing an internal optimization process on the fly — that is arguably a form of mesa-optimization. Understanding how and when this emerges is not just an academic question. It has implications for how we build systems we can trust.

What a Task Looks Like

  Example 1:                    Example 2:
  Input        Output           Input        Output
  . . . . .    . . . . .        . . . .      . . . .
  . R . . .    . R R R .        . B . .      . B B .
  . . . . .    . . . . .        . . . .      . . . .
  . . . . .    . . . . .

  Test Input:       Your answer:
  . . . . . .       ???
  . . G . . .
  . . . . . .
  . . . . . .

  Rule: extend the colored cell horizontally
  to fill its row segment. Can you see it from
  just two examples?

Approach

The competition is live on Kaggle, and the approach is still evolving. What I can say: pure LLM prompting hits a ceiling fast. The most promising directions combine program synthesis — searching over small programs that could explain the transformation — with neural guidance to prune the search space. The key insight is that the right representation matters more than the right model.

Every failed attempt is data. Every technique that breaks on a specific puzzle class reveals a gap in the solver's abstraction vocabulary. The competition is as much about building understanding as it is about building systems.

Stack

PythonPyTorchKaggleProgram SynthesisDSL DesignvLLM