Wadham lecturer's questions part of Humanity's Last Exam for AI.

Date Published: 12.02.2025

We asked Demosthenes Patramanis what it was like designing exam questions meant to fail every candidate.

Image produced by Adobe Generative AI with prompt "robot taking exam"

AI is getting too good at beating tests. As the New York Times reports, “some of the smartest humans in the world are struggling to create tests that A.I. systems can’t pass.” If the current tests are too easy, it’s hard to track developments in AI and set future goals for progress.

AI is getting too good at beating tests.

Enter ‘Humanity’s Last Exam’. Rather than pitting AI against, say, undergraduate-level questions – the questions in this exam are pitched at the edge of current human knowledge. If AI were playing a game, this benchmarking tool would be a ‘final boss’. Its questions have been gathered from nearly 1000 experts. Thirteen are from the University of Oxford, including an academic at Wadham.

Demosthenes Patramanis, a lecturer in philosophy, contributed two questions. We asked him what it’s like to set exam questions that, unlike most, are designed to fail every candidate.

“The point was not to trick the AI by using weird phrasing,” he explains. I.e., if the AI only stumbles on your question because of obscure wording, you haven’t really tested its capacity to reason. The questions had to be “clear and have a specific answer.”

The first of Demosthenes’s questions has roots in his DPhil research on Plato. He concedes that with this question, although no AI model he tested got it right, some took “a step in the right direction.” The reasoning of the AI models wasn’t totally off-course.

However, his second question thoroughly stumped them all. “None of the models even reasoned correctly,” he shared. (Take that, AI!)

Some of the questions in Humanity’s Last Exam are public but the majority of them, including Demosthenes’s, remain secret. Like with exams for humans, the organisers must walk a fine line. If nothing about the test is known, it’s hard to make progress in passing it. But if too much is known, passing becomes more about gaming the system than broad competence. Hence, the Last Exam organisers maintain a “private test set” of questions to “assess model overfitting”.

How does an academic by day become AI-thwarter by night? Humanity’s Last Exam is not Demosthenes’s first entanglement with the tech sector’s AI boom. He’s been working as an ‘AI trainer’ for the last year and has enjoyed putting models through philosophical and mathematical gauntlets since ChatGPT 3. (Demosthenes could see “how limited its reasoning was, despite all the hype at the time.”)

Since then, models with far more sophisticated reasoning abilities have been developed, like OpenAI’s o1. Contributing to Humanity’s Last Exam struck Demosthenes as a fun way to test its limits.

What’s yet to be determined is how long his questions, and others like it, will continue to be beyond AI’s limits.

Humanity's Last Exam