AI groups rush to redesign model testing and create new benchmarks

By News Room Last updated Nov 9, 2024

Tech groups are rushing to redesign how they test and evaluate their artificial intelligence models, as the fast advancing technology surpasses current benchmarks.

OpenAI, Microsoft, Meta and Anthropic have all recently announced plans to build AI agents that can execute tasks for humans autonomously on their behalf. To do this effectively, the systems must be able to perform increasingly complex actions, using reasoning and planning.

Companies conduct “evaluations” of AI models by teams of staff and outside researchers. These are standardised tests, known as benchmarks, that assess models’ abilities and the performance of different groups’ systems or older versions.

However, recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks.

“The pace of the industry is extremely fast. We are now starting to saturate our ability to measure some of these systems [and as an industry] it is becoming more and more difficult to evaluate [them],” said Ahmad Al-Dahle, generative AI lead at Meta.

To deal with this issue, several tech groups including Meta, OpenAI and Microsoft have created their own internal benchmarks and tests for intelligence. But this has raised concerns within the industry over the ability to compare the technology in the absence of public tests.

“Many of these benchmarks let us know how far away we are from automation of tasks and jobs. Without them being made public, it is hard for businesses and wider society to tell,” said Dan Hendrycks, executive director of the Center for AI Safety and an adviser to Elon Musk’s xAI.

Current public benchmarks — Hellaswag and MMLU — use multiple-choice questions to assess common sense and knowledge across various topics. However, researchers argue this method is now becoming redundant and models need more complex problems.

“We are getting to the era where a lot of the human-written tests are no longer sufficient as a good barometer for how capable the models are,” said Mark Chen, SVP of research at OpenAI. “That creates a new challenge for us as a research world.”

One public benchmark, SWE-bench Verified, was updated in August to better evaluate autonomous systems based on feedback from companies, including OpenAI.

It uses real-world software problems sourced from the developer platform GitHub and involves supplying the AI agent with a code repository and an engineering issue, asking them to fix it. The tasks require reasoning to complete.

On this measure OpenAI’s latest model, GPT-4o preview, solves 41.4 per cent of issues, while Anthropic’s Claude 3.5 Sonnet gets 49 per cent.

“It is a lot more challenging [with agentic systems] because you need to connect those systems to lots of extra tools,” said Jared Kaplan, chief science officer at Anthropic.

“You have to basically create a whole sandbox environment for them to play in. It is not as simple as just providing a prompt, seeing what the completion is and then evaluating that,” he added.

Another important factor when conducting more advanced tests is to make sure the benchmark questions are kept out of the public domain, in order to ensure the models do not effectively “cheat” by generating the answers from training data rather than solving the problem.

The ability to reason and plan is critical to unlocking the potential of AI agents that can conduct tasks over multiple steps and applications, and correct themselves.

“We are discovering new ways of measuring these systems and of course one of those is reasoning, which is an important frontier,” said Ece Kamar, VP and lab director of AI Frontiers at Microsoft research.

As a result, Microsoft is working on its own internal benchmark, incorporating problems that have not previously appeared in training to assess whether its AI models can reason as a human would.

Some, including researchers from Apple, have questioned whether current large language models are “reasoning” or purely “pattern matching” the closest similar data seen in their training.

“In the narrower domains [that] enterprises care about, they do reason,” said Ruchir Puri, chief scientist at IBM Research. “[The debate is around] this broader concept of reasoning at a human level, that would almost put it in the context of artificial general intelligence. Do they really reason, or are they parroting?”

OpenAI measures reasoning primarily through evaluations covering maths, STEM subjects and coding tasks.

“Reasoning is a very grand term. Everyone defines it differently and has their own interpretation . . . this boundary is very fuzzy [and] we try not to get too bogged down with that distinction itself, but look at whether it is driving utility, performance or capabilities,” said OpenAI’s Chen.

The need for new benchmarks has also led to efforts by external organisations.

In September, the start-up Scale AI and Hendrycks announced a project called “Humanity’s Last Exam”, which crowdsourced complex questions from experts across different disciplines that required abstract reasoning to complete.

Another example is FrontierMath, a novel benchmark released this week, created by expert mathematicians. Based on this test, the most advanced models can complete less than 2 per cent of questions.

However, without explicit agreement on measuring such capabilities, experts warn that it can be difficult for companies to assess their competitors or for businesses and consumers to understand the market.

“There is no clear way to say ‘this model is definitively better than this model’ [because] when a measure becomes a target, it ceases to be a good measure” and models are trained to pass the set benchmarks, said Meta’s Al-Dahle.

“It is something that, as a whole industry, we are working our way through.”

Additional reporting by Hannah Murphy in San Francisco

Read the full article here