ARC-AGI 3 Benchmark: What the New Test Reveals About the True Gap Between AI and Human Reasoning

The ARC-AGI 3 benchmark has landed — and the numbers are more striking than anything we've seen in AI evaluation this decade. The best AI model in the world scores **0.37%**. Humans score **100%**. That single data point demands a much longer conversation than a headline can hold.

This isn't a story about AI failing. It's a story about what that failure *means* — and why François Chollet engineered ARC-AGI 3 specifically to produce it. Understanding the design philosophy behind this benchmark is how we cut through both the doomer panic and the hype dismissal that inevitably follow these results. It also connects directly to the latest AI trends and advances shaping the field — where capability gains are real but their nature is frequently mischaracterized.

---

Why Chollet Keeps Moving the Goalposts — And Why He's Right To

ARC-AGI-1 is, for practical purposes, solved. Gemini 3.1 Pro hits **98%** on the private set. That benchmark, released in 2020, was designed to test abstract reasoning through visual pattern tasks that couldn't be brute-forced with memorization. Within four years, the AI field found a way through — primarily via test-time compute scaling and test-time training, which pushed scores from near-zero to 53.5% by mid-2024, then to near-perfect by early 2026.

ARC-AGI-2 followed. Gemini 3.1 Pro reached **77.1%** in February 2026. Gemini 3 Deep Think climbed to **84.6%**, brushing the benchmark's ceiling. At that pace, ARC-AGI-2 would have been "solved" within months.

Chollet's response wasn't to patch the existing test. He redesigned the problem from scratch.

---

What Makes ARC-AGI 3 Structurally Different

The new benchmark isn't just harder. It's harder in a *specific and deliberate way* that targets the exact mechanisms AI uses to fake general intelligence.

ARC-AGI 3 contains **1,000+ levels across 150+ environments**. Critically, only **25 environments are public**, compared to the roughly 10:1 public-to-private ratio in ARC-AGI-2. The remaining **110 environments are private** — split between 55 semi-private and 55 fully private. That ratio change is not incidental. It closes the door on the strategy of reverse-engineering evaluation patterns from the public portion of the test.

The scoring system is equally radical. ARC-AGI 3 uses **Relative Human Action Efficiency (RHAE)** — a metric that compares an AI's action count to the second-best human performance on the same level. Crucially, inefficiency is *squared*. If a model requires 10× the human action count to solve a level, it earns just **1%** of available credit for that level. This penalizes brute-force search in a way that raw accuracy scores never could.

The combined effect: more environments means less overfitting surface. Private-heavy ratios mean less pattern leakage. And RHAE means you can't just "try everything" and claim credit for eventually stumbling through.

For deeper context on how generative AI tools and LLMs work in practice, it's worth understanding that even the most capable models today fundamentally operate through pattern recognition over training distributions — and ARC-AGI 3 is explicitly designed to sit outside those distributions.

---

Reading the Performance Curves: What 0.37% Actually Tells Us

The headline number — Gemini 3.1 Pro at **0.37%** versus humans at **100%** — sounds catastrophic for AI. But interpreting it requires nuance.

First, this isn't a static test. It's an *interactive* reasoning challenge. Models aren't just parsing images and selecting outputs; they're operating across environments that require sequential decision-making, causal inference, and adaptive strategy adjustment across many steps. That's a fundamentally different cognitive demand than standard LLM reasoning benchmarks.

Second, the RHAE scoring means partial performance is heavily discounted. A model that solves a level in 50 steps when a human does it in 5 receives almost no credit, even though it "solved" the problem. The gap in the score isn't just about *whether* AI can reason — it's about *how efficiently and generalizably* it reasons.

Third, and most importantly: humans didn't train on these environments. They approached them cold, using cognitive generalization — the ability to transfer reasoning strategies to genuinely novel situations. That's precisely what the abstract reasoning corpus framework was built to isolate and test. Current LLMs, despite their extraordinary breadth of knowledge, don't demonstrate this form of transfer at scale.

Nicole Holliday, Associate Professor of Linguistics at UC Berkeley, has argued that "there is no such thing as general intelligence, artificial or natural," and anticipates progress in more realistic models like intrinsically motivated reinforcement learning. That framing is worth sitting with — because it reframes ARC-AGI 3's value. The benchmark isn't proving AI is "dumb." It's isolating *which specific cognitive capabilities* remain genuinely unscaled.

---

The Brute-Force Ceiling: Why Scaling Alone Won't Close the Gap

Every major AI evaluation breakthrough of the past three years has involved some form of compute amplification at inference time. Chain-of-thought prompting, test-time compute scaling, test-time training — these techniques drove ARC-AGI-1 from ~0% to ~98% in roughly four years. The AI community interpreted this as evidence that scaling continued to be the primary lever.

ARC-AGI 3 stress-tests that assumption directly.

The RHAE metric makes brute-force search expensive in the scoring function. The private-heavy environment split makes test-time training difficult to apply without the benchmark exposure it requires. The sheer scale of 150+ environments makes pattern memorization across the full space computationally prohibitive even at significant inference budgets.

Sam Altman has stated publicly that "in some big sense, ChatGPT is already more powerful than any human who has ever lived," citing its global reach and breadth of utility. That's a meaningful observation about *informational* power. But ARC-AGI 3 measures something different: adaptive efficiency under genuine novelty. On that dimension, the data suggests current systems are closer to zero than to human-level.

Dario Amodei of Anthropic has emphasized that the future of AI is fundamentally about alignment — making systems that are "truly beneficial at every level." That framing quietly acknowledges the gap ARC-AGI 3 measures: a system that can't generalize reliably under novel conditions is also a system whose alignment properties are harder to predict under novel conditions.

These ethical concerns and risks surrounding advanced AI development become significantly more acute as systems move from narrow task performance into genuinely agentic roles.

---

What the Prize Structure Signals About Expected Timelines

The ARC Prize 2026 competition offers a total of **$2 million** in prizes. The ARC-AGI 3 Grand Prize — awarded for achieving **100%** on the benchmark — carries **$700,000**. A guaranteed **$75,000** is allocated for top score awards regardless of absolute performance.

The prize structure is itself a signal. The $700K Grand Prize for 100% performance is structured as a genuine challenge, not an expected payout. Given that the current leading score is 0.37%, the organizers are clearly not anticipating near-term solutions. The guaranteed $75K top score award exists precisely because achieving any meaningful score is expected to be noteworthy in 2026.

Compare this to the 2024 ARC Prize competition, which drew **1,430 teams** competing on ARC-AGI-1 — a benchmark that is now effectively solved. The contrast illustrates how quickly the competitive landscape around AI evaluation has moved, and how deliberately ARC Prize has tried to stay ahead of saturation.

One useful reference point from the ARC-AGI-3 benchmark scores and methodology breakdown is the historical trajectory: ARC-AGI-1 went from ~0% to 98% in roughly four years. If ARC-AGI-3 follows a similar curve — and there's no guarantee it will — meaningful scores might not appear until 2028 or 2029. But that projection depends entirely on whether the next wave of AI architecture improvements targets the specific generalization deficit this benchmark measures.

---

What This Means for AGI Timelines — Cutting Through the Noise

The 0.37% score will be weaponized by two opposing camps. AI skeptics will use it to argue that AGI is decades away, or a philosophical impossibility. AI accelerationists will note that benchmarks always get solved eventually and dismiss the result as temporary.

Both interpretations miss the more valuable insight Chollet is providing.

ARC-AGI 3 is not an arbitrary difficulty increment. It's a principled attempt to isolate one specific capability: **efficient generalization to genuinely novel tasks under interactive conditions**. The benchmark is telling us that this capability does not currently scale with parameter count, context window size, or inference compute — at least not at any level we've seen deployed.

That's a precise diagnostic, not a verdict. It doesn't say AGI is impossible. It says the current dominant paradigm — scale transformers on internet text, then scale inference compute — has a measurable ceiling below human-level on this specific dimension.

The research community most worth watching, per UC Berkeley AI experts on what to watch in 2026, includes those focused on child-like AI experimentation with the external world — approaches that build reasoning through interaction rather than through pattern compression over static corpora. Professor Alison Gopnik's work on how children build causal models through play is increasingly cited by AI researchers as a north star for the kind of generalization ARC-AGI 3 demands.

The implication for AGI timelines is this: if you believe AGI requires only more of what current systems do, ARC-AGI 3 suggests that path is stalling. If you believe AGI requires architectural novelty in how systems form and transfer generalizations, then the benchmark is a useful guide to what those architectures need to solve — and what expert predictions say about the path to AGI by 2030 should be read with this specific capability gap in mind.

---

Conclusion: The Benchmark Is the Argument

ARC-AGI 3 isn't just a new test. It's a public, quantified argument about what general intelligence evaluation should require — and a standing challenge to the assumption that capability gains in AI map smoothly onto cognitive generalization.

The 0.37% vs. 100% gap is not evidence that AI has failed. It's evidence that the question of general intelligence evaluation has been asked more precisely than before. That precision is valuable regardless of how quickly teams close the gap.

The $2 million prize pool will attract serious engineering effort. The private-heavy environment split will make gaming the benchmark genuinely difficult. And the RHAE metric will ensure that solutions require genuine efficiency, not just eventual success.

Watch who makes the first meaningful dent in ARC-AGI 3. The approach they use will tell us more about the trajectory of AGI development than any model release announcement this year.

**Stay ahead of AI — follow TechCircleNow for daily coverage.**

---

FAQ: ARC-AGI 3 Benchmark

**Q1: What is the ARC-AGI 3 benchmark?**
ARC-AGI 3 is the third iteration of François Chollet's Abstract Reasoning Corpus benchmark, designed to test general intelligence evaluation by measuring whether AI can efficiently solve genuinely novel interactive tasks — ones that resist brute-force approaches and require cognitive generalization similar to human reasoning.

**Q2: What score did the best AI model achieve on ARC-AGI 3?**
Gemini 3.1 Pro achieved **0.37%** on ARC-AGI 3 as of launch, compared to humans who scored **100%** on the same environments. This marks a dramatic regression from ARC-AGI-2, where the same model reached 77.1%.

**Q3: Why did AI performance drop so sharply from ARC-AGI-2 to ARC-AGI-3?**
The drop is by design. ARC-AGI 3 introduces a private-heavy environment split (only 25 of 150+ environments are public), a new RHAE scoring metric that penalizes inefficient search, and fully interactive tasks — all of which specifically counter the test-time compute scaling and test-time training strategies that drove scores up on earlier versions.

**Q4: What is RHAE scoring and why does it matter?**
RHAE stands for Relative Human Action Efficiency. It scores AI performance relative to the second-best human's action count on the same level, and squares the penalty for inefficiency. A model taking 10× more actions than a human earns just 1% credit per level. This ensures that brute-force or trial-and-error approaches — common in current AI systems — produce near-zero scores.

**Q5: What is the prize for solving ARC-AGI 3?**
The ARC Prize 2026 offers a **$700,000 Grand Prize** for achieving 100% on ARC-AGI 3, plus a guaranteed **$75,000** for top-scoring entries regardless of absolute performance, within a total competition pool of **$2 million** spanning both ARC-AGI-3 and ARC-AGI-2 tracks.