Humanity’s Last Exam: The Ultimate Test for AI Knowledge

Humanity’s Last Exam (HLE) is a rigorous new benchmark designed by the Center for AI Safety and Scale AI to push large language models to their limits. Current top-performing models like ChatGPT, Gemini, and DeepSeek score as low as 3–14%, but researchers expect rapid improvement by 2025. The exam highlights AI’s current limitations, emphasizing the importance of uncertainty measurement and the potential for future breakthroughs in reasoning and academic knowledge.

USAGE

The AI Maker

8/7/20252 min read

Humanity’s Last Exam challenges AI like ChatGPT and Gemini with near-impossible academic questions, revealing current limits
Humanity’s Last Exam challenges AI like ChatGPT and Gemini with near-impossible academic questions, revealing current limits

Artificial intelligence has evolved at a pace that makes human progress look glacial. While our brains have taken millions of years to develop, today’s large language models (LLMs) like ChatGPT are improving at breakneck speed. But a new academic challenge, dubbed Humanity’s Last Exam (HLE), is pushing AI to its limits—forcing it to prove that it can master the toughest human knowledge without simply guessing or memorizing.

The HLE is the brainchild of the Center for AI Safety, a nonprofit dedicated to reducing the societal risks of advanced AI, and Scale AI, a for-profit company that supplies training data to some of the world’s most powerful AI systems. Their joint mission: design an exam so difficult that even the most advanced AI models stumble, providing a clearer picture of where these systems truly stand in their intellectual development.

According to the research posted on arXiv, the HLE consists of thousands of multiple-choice and short-answer questions spanning a wide range of disciplines. Roughly 41% of the test is math, with other sections covering biology, medicine, computer science, physics, chemistry, engineering, and even humanities. Imagine asking an AI to translate an ancient Roman inscription or identify the exact number of paired tendons in a hummingbird wing—this is the kind of hyper-specific, unforgiving knowledge HLE demands.

So far, the results have been humbling. State-of-the-art LLMs, including Google’s Gemini and the emerging DeepSeek model, have scored in the 3% to 14% range. Part of this comes down to the nature of the questions: they are carefully designed to be unambiguous, impossible to answer via a quick internet retrieval, and easy to automatically grade. Even with these dismal scores, experts predict a dramatic improvement, with many models expected to cross 50% accuracy by the end of 2025.

The grading itself is handled by another AI—OpenAI’s GPT-40—which can recognize correct answers even if they are phrased slightly differently. In other words, a response like “T. rex” would still count for “Tyrannosaurus rex.” Researchers are also exploring a next step in AI training: teaching models to admit uncertainty. Instead of confidently producing a wrong answer, future LLMs may provide a confidence rating from 0 to 100%, an ability that could be critical in fields like medicine, law, and autonomous systems.

The bigger picture is clear: benchmarks like HLE are intentionally designed to be out of reach, but history suggests that LLMs catch up fast. As the researchers noted, “recent history shows benchmarks are quickly saturated,” and what seems impossible for AI today may become trivial tomorrow.

For now, AIs are failing HLE spectacularly—but that’s the point. Each wrong answer represents a step toward better understanding AI’s limits and how to train it responsibly. While machines can’t yet feel the sting of failure, the clock is ticking on how long Humanity’s Last Exam will truly remain unpassable.

Cited: https://www.popularmechanics.com/science/a64218773/ai-humanitys-last-exam/