Why the ARC-AGI-2 Test is Breaking AI Models

Why the ARC-AGI-2 Test is Breaking AI Models Why the ARC-AGI-2 Test is Breaking AI Models
IMAGE CREDITS: REDHAT

The Arc Prize Foundation, co-founded by renowned AI researcher François Chollet, has launched a groundbreaking new test designed to push the limits of artificial intelligence. Known as the ARC-AGI-2 test, this advanced benchmark is raising tough questions about just how close current AI models are to achieving true general intelligence.

According to the Arc Prize Foundation’s latest blog post, most top-performing AI models struggled badly on the ARC-AGI-2 test. AI systems that claim reasoning abilities, including OpenAI’s o1-pro and DeepSeek’s R1, only managed scores between 1% and 1.3%. Meanwhile, powerful non-reasoning models like GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash hovered around 1% — a clear indication that this new benchmark sets a higher bar for AI intelligence.

The ARC-AGI-2 test presents AI models with visual puzzles that require identifying patterns within grids of multi-colored squares. What makes this test so challenging is that it forces models to solve problems they’ve never encountered before, eliminating the advantage of pre-training or brute force memorization.

To set a human baseline, the foundation had over 400 participants attempt the test. On average, human teams answered 60% of the questions correctly, outperforming every AI model tested. This stark difference highlights the complexity of the ARC-AGI-2 benchmark and how far AI still lags behind human general intelligence in adapting to unfamiliar tasks.

Chollet, sharing his thoughts on X (formerly Twitter), emphasized that the new version significantly improves on its predecessor, ARC-AGI-1, which suffered from loopholes that allowed models to succeed using vast computing resources rather than true reasoning.

The ARC-AGI-2 test now incorporates a crucial new metric — efficiency — to measure not just whether an AI can solve problems, but how effectively it can acquire new skills with limited computing power. Models are required to interpret novel patterns in real-time, making memorization or reliance on training data obsolete.

As Arc Prize Foundation co-founder Greg Kamradt explained, “True intelligence isn’t just about solving problems or hitting high scores. It’s about how efficiently those abilities are learned and used. The real question is, can AI solve tasks and at what cost?”

The need for such an unsaturated, efficiency-focused AI benchmark has become more pressing as existing evaluations grow outdated. Hugging Face’s co-founder, Thomas Wolf, recently pointed out that the AI industry lacks strong tests to measure core artificial general intelligence traits like creativity and reasoning agility.

ARC-AGI-1, the earlier version of the test, remained undefeated for nearly five years until December 2024. That’s when OpenAI’s o3 model finally matched human-level performance, scoring 75.7% on the test. However, achieving that milestone came at a massive computational cost, raising concerns about scalability and efficiency.

The picture looks even grimmer for the same model on ARC-AGI-2. OpenAI’s o3 (low) model, which previously dominated ARC-AGI-1, scored a mere 4% on the new test, burning through $200 worth of compute per task.

The Arc Prize Foundation also released a performance comparison chart, which paints a clear picture: even frontier AI models falter dramatically when faced with the ARC-AGI-2’s tougher requirements.

To raise the stakes, the foundation has launched the Arc Prize 2025 contest, offering developers a fresh challenge: achieve 85% accuracy on ARC-AGI-2 while keeping the compute cost per task under $0.42. This bold prize is designed to drive breakthroughs not just in accuracy, but in efficient AI reasoning.

As the AI industry continues racing toward general intelligence, the arrival of ARC-AGI-2 signals a turning point. It serves as a tough reminder that real intelligence isn’t just about brute force or big datasets — it’s about agility, adaptability, and cost-effective reasoning. Whether any model can rise to this challenge remains to be seen.

Share with others

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Service

Follow us