Exploring the Evolution of AI Evaluations: A Software Engineer's Perspective

AI evaluations, or 'evals,' are crucial for understanding the capabilities and limitations of advanced AI systems, which have shown rapid progress in recent years. New, more challenging evals like FrontierMath and Humanity's Last Exam are being developed to keep pace with AI advancements and assess potential risks in areas such as cybersecurity and bioterrorism. Despite the complexity and cost of designing effective evals, they are essential for ensuring the safety and reliability of AI systems as they continue to evolve.

HALLUCINATIONSWORK

The AI Maker

4/11/20253 min read

man working with AI computer
man working with AI computer

As a software engineer, I’ve always been fascinated by the rapid advancements in artificial intelligence (AI). One of the most intriguing aspects of AI development is the process of evaluating these systems to understand their capabilities and limitations. Despite the expertise of AI developers, it’s not always clear what their most advanced systems can do right off the bat. This is where evaluations, or ‘evals,’ come into play. These tests are designed to push AI systems to their limits and reveal their true potential.

In recent years, the field of AI has seen unprecedented progress. Today’s AI systems are achieving top scores on many popular tests, such as the SATs and the U.S. bar exam. This rapid improvement has made it challenging to gauge just how quickly these systems are advancing. In response, a new set of more challenging evals has emerged, created by companies, nonprofits, and governments. These advanced evals are designed to keep pace with the astonishing progress of AI systems.

One notable example is the FrontierMath benchmark, developed by the nonprofit research institute Epoch AI in collaboration with leading mathematicians. This set of exceptionally challenging math questions initially saw available models scoring only 2%. However, just one month later, OpenAI’s newly-announced o3 model achieved a score of 25.2%, far exceeding expectations. This rapid progress highlights the potential of these new evals to help us understand the capabilities of advanced AI systems and serve as early warning signs for potential risks in areas like cybersecurity and bioterrorism.

In the early days of AI, capabilities were measured by evaluating a system’s performance on specific tasks, such as classifying images or playing games. The time between a benchmark’s introduction and an AI matching or exceeding human performance was typically measured in years. For example, it took five years for AI systems to surpass humans on the ImageNet Large Scale Visual Recognition Challenge, established by Professor Fei-Fei Li and her team in 2010. Similarly, it was only in 2017 that an AI system (Google DeepMind’s AlphaGo) was able to beat the world’s number one ranked player in Go, an ancient Chinese board game.

However, the gap between a benchmark’s introduction and its saturation has decreased significantly in recent years. For instance, the GLUE benchmark, designed to test an AI’s ability to understand natural language, was considered solved just one year after its debut in 2018. In response, a harder version, SuperGLUE, was created in 2019, and within two years, AIs were able to match human performance across its tasks.

Evals now take many forms, and their complexity has grown alongside model capabilities. Major AI labs systematically test their models before release, assessing their ability to produce harmful outputs, bypass safety measures, or engage in undesirable behavior. Last year, companies including OpenAI, Anthropic, Meta, and Google made voluntary commitments to the Biden administration to subject their models to both internal and external red-teaming in areas such as misuse, societal risks, and national security concerns.

Other tests assess specific capabilities, such as coding, or evaluate models' capacity for potentially dangerous behaviors like persuasion, deception, and large-scale biological attacks. One popular contemporary benchmark is Measuring Massive Multitask Language Understanding (MMLU), which consists of about 16,000 multiple-choice questions spanning academic domains like philosophy, medicine, and law. OpenAI’s GPT-4o, released in May, achieved 88%, while the company’s latest model, o1, scored 92.3%.

Designing evals to measure the capabilities of advanced AI systems is incredibly challenging. The goal is to elicit and measure the system’s actual underlying abilities, for which tasks like multiple-choice questions are only a proxy. Data contamination, where answers to an eval are contained in the AI’s training data, is another challenge. Additionally, evals can be “gamed” when the AI model targets what is measured by the eval rather than what is intended.

In response to these challenges, new, more sophisticated evals are being built. For example, Epoch AI’s FrontierMath benchmark consists of approximately 300 original math problems, created in collaboration with over 60 leading mathematicians, including Fields-medal winning mathematician Terence Tao. These problems vary in difficulty, with some requiring graduate-level education in math to solve.

As AI models rapidly advance, evaluations are racing to keep up. Sophisticated new benchmarks—assessing advanced mathematical reasoning, novel problem-solving, and the automation of AI research—are making progress. However, designing effective evals remains challenging, expensive, and underfunded relative to their importance as early-warning detectors for dangerous capabilities. With leading labs rolling out increasingly capable models every few months, the need for new tests to assess frontier capabilities is greater than ever.

Cited: https://time.com/7203729/ai-evaluations-safety/