A test so difficult that no artificial intelligence system can pass it, for now

If you’re looking for a new reason to be nervous about AI, try this: Some of the world’s smartest humans are struggling to create tests that AI systems can’t pass.

For years, AI systems have been measured by giving new models a series of standardized benchmark tests. Many of these tests consisted of challenging SAT-caliber problems in areas such as math, science, and logic. Comparing model scores over time served as a rough measure of AI progress.

But AI systems eventually became too good at these tests, so new, more difficult tests were created, often with the kinds of questions graduate students might encounter on exams.

These tests are also not in good condition. New models from companies like OpenAI, Google, and Anthropic have scored high on many doctoral-level challenges, limiting the usefulness of such tests and leading to a chilling question: Are AI systems becoming too smart to measure?

This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: a new assessment, called “Humanity’s Last Test,” which they claim is the toughest test ever administered to intelligence systems artificial.

Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known AI safety researcher and director of the Center for AI Safety. (The original name of the test, “Humanity’s Last Stand”, was discarded as overly dramatic.)

Hendrycks worked with Scale AI, an artificial intelligence company for which he consults, to compile the test, which consists of approximately 3,000 multiple-choice and short-answer questions designed to test the capabilities of artificial intelligence systems in areas ranging from philosophy analytics to rocket engineering. .

The questions were asked by experts in these fields, including university professors and award-winning mathematicians, who were asked to formulate extremely difficult questions to which they knew the answer.

Here, try to answer a question about hummingbird anatomy from the test:

Apodiformes hummingbirds have uniquely a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded cruciate aponeurosis of insertion of m. caudal depressor. How many paired tendons are supported by this sesamoid bone? Reply with a number.

Or, if physics is more your speed, try this:

A block is placed on a horizontal rail, along which it can slide without friction. It is attached to the end of a rigid, massless rod of length R. A mass is attached to the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal push, parallel to the rail. Suppose the system is designed so that the rod can rotate 360 ​​degrees without interruption. When the rod is horizontal, the tension is T1​. When the rod is vertical again, with the mass directly under the block, it brings with it the tension T2. (Both of these quantities could be negative, which would indicate that the member is in compression.) What is the value of (T1−T2)/W?

(I would print the answers here, but that would ruin the test for any AI systems trained in this column. Besides, I’m too stupid to verify the answers myself.)

Questions about Humanity’s Last Test went through a two-step filtering process. First, the questions presented were given the main artificial intelligence models to solve.

If the models failed to answer them (or if, in the case of multiple-choice questions, the models performed worse than random guesses), the questions were given to a series of human reviewers, who refined them and verified the correct answers. Experts who wrote the most popular questions were paid between $500 and $5,000 per question, in addition to receiving credit for contributing to the exam.

Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, put a handful of questions to the test. Three of his questions were chosen, all of which he told me were “in the upper range of what you might see on a graduate exam.”

Hendrycks, who helped create a widely used AI test known as Massive Multitask Language Understanding, or MMLU, said he was inspired to create more difficult AI tests by a conversation with Elon Musk. (Mr. Hendrycks is also a security consultant for Musk’s artificial intelligence company, xAI.) Mr. Musk, he said, raised concerns about existing tests provided to AI models, which he said were too easy.

“Elon looked at the MMLU applications and said, ‘These are college-level. I want things that a world-class expert could do,’” Hendrycks said.

There are other tests that try to measure the advanced capabilities of artificial intelligence in certain fields, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by artificial intelligence researcher François Chollet.

But Humanity’s Last Exam is intended to determine how good AI systems are at answering complex questions in a wide variety of academic subjects, giving us what might be considered a general intelligence score.

“We’re trying to estimate the extent to which AI can automate a lot of really difficult intellectual work,” Hendrycks said.

Once the list of questions was compiled, the researchers gave humanity’s ultimate test to six leading AI models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. They all failed miserably. OpenAI’s o1 system achieved the highest score of the group, with a score of 8.3%.

(The New York Times is suing OpenAI and its partner, Microsoft, accusing them of copyright infringement of news content related to artificial intelligence systems. OpenAI and Microsoft have denied those allegations.)

Mr. Hendrycks said he expects these scores to rise rapidly and potentially surpass 50% by the end of the year. At that point, he said, AI systems could be considered “world-class oracles,” capable of answering questions on any topic more accurately than human experts. And we may have to look at other ways to measure AI’s impacts, such as looking at economic data or judging whether it can make new discoveries in areas like math and science.

“You can imagine a better version of this where we can ask questions that we don’t know the answer to yet and we’re able to test whether the model can help us solve it,” said Summer Yue, Scale Director of AI Research and exam organizer.

Part of what’s so confusing about the progress of AI these days is how jagged it is. We have AI models that can diagnose diseases more effectively than human doctors, win silver medals at International Mathematics Olympiads, and beat the best human programmers in competitive coding challenges.

But these same models sometimes have difficulty with basic tasks, such as arithmetic or writing measured poetry. This has given them a reputation for being surprisingly brilliant at some things and totally useless at others, and has created very different impressions about how quickly AI is improving, depending on whether you look at the best or worst results.

This jaggedness also made measuring these patterns difficult. Last year I wrote that we need better ratings for AI systems. I still believe it. But I also believe we need more creative ways to track AI progress that don’t rely on standardized tests, because most of what humans do – and what we fear AI does better than us – cannot be captured in a written exam. .

Mr. Zhou, the theoretical particle physics researcher who submitted the questions to the Ultimate Question of Humanity, told me that while AI models were often impressive at answering complex questions, he did not see them as a threat to him and his colleagues, because their work involves much more than spitting out correct answers.

“There is a big gap between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even an AI that can answer these questions may not be ready to help with research, which is inherently less structured.”

Leave a Reply

Your email address will not be published. Required fields are marked *