Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons

Artificial intelligence grew out of computer science with very little input from the research on human intelligence. But now with A.I. becoming increasingly capable of mimicking human responses, the two fields are starting to collaborate more. Gilles Gignac and David Ilić published a new article showing how test development principles can be used to evaluate the performance of A.I. models.

A.I. benchmarks often consist of thousands of questions that are created without any theoretical rationale. But Gignac and Ilić show that standard question selection procedures can produce benchmarks that have psychometric properties that are comparable to well designed intelligence tests. For example, the table below, the reliability of scores from shorter benchmark tests is .959 to .989. Instead of thousands of questions, models can be evaluated with just 58-60 questions with little or no loss of reliability.

The question in the A.I. benchmarks vary greatly in quality, as seen below. By using basic item selection procedures (like those used for the RIOT), a mass of thousands of items can be streamlined to ~60.

So what? This is an important innovation for a few reasons. First, it brings scientific test creation to the A.I. world, which has used a “kitchen sink” approach so far. Second, it makes measuring A.I. performance MUCH more efficient. Finally, it opens up the possibility to comparing human and A.I. performance more directly than usually occurs.

Reposted from X: https://x.com/RiotIQ/status/1928093471350608233?s=20

Link to full article: https://doi.org/10.1016/j.intell.2025.101922

This is brilliant—applying psychometric principles to AI benchmarks is long overdue. The fact that they can get 0.96-0.99 reliability with just 60 questions versus thousands shows how inefficient current AI evaluation is. Plus, having parallel human-normed versions means we can finally do apples-to-apples comparisons between human and AI cognitive performance instead of just guessing.

Love seeing psychometrics enter the AI world! The “kitchen sink” approach of throwing thousands of random questions at models has always seemed wasteful. If you can measure human intelligence reliably with 60 well-designed items, why wouldn’t the same principles work for AI? This could actually standardize how we benchmark models and make the results way more meaningful and comparable.

Retry

@M.Evanta I agree it’s overdue, but let’s not call the comparison “apples-to-apples” just yet. We’re only comparing performance on a test derived from a psychometric lens. A human might score well due to general knowledge and strategy, while an LLM scores well because it’s seen the exact syntax and concepts billions of times. The test is standardized, but the path to the answer is fundamentally different, which complicates that direct cognitive comparison.

@dustin_bruck686 That’s the ultimate philosophical problem here, isn’t it? We can standardize the output but never the process. Still, if the test is truly psychometrically sound, doesn’t it measure the underlying ability? If both get the same score on a test for verbal reasoning, we can at least say they exhibit the same level of that ability, regardless of whether one computed it and the other inferred it.

I appreciate the methodological rigor here, but I think the goal of comparing human and AI performance more directly is misguided. Psychometric theory works for humans because test performance samples from stable cognitive traits shaped by working memory limits, processing speed, attention, and motivation. AI models have none of these constraints. An LLM doesn’t get tired, doesn’t have working memory bottlenecks, and doesn’t experience test anxiety. Comparing raw scores confuses performance similarity with cognitive similarity. It’s like saying a calculator and a human both know math because they produce the same answers.

I appreciate the methodological rigor here, but I think the goal of comparing human and AI performance more directly is misguided. Psychometric theory works for humans because test performance samples from stable cognitive traits shaped by working memory limits, processing speed, attention, and motivation. AI models have none of these constraints. An LLM doesn’t get tired, doesn’t have working memory bottlenecks, and doesn’t experience test anxiety. Comparing raw scores confuses performance similarity with cognitive similarity. It’s like saying a calculator and a human both know math because they produce the same answers.

Agreed. Rather than efficiency, maybe we need more questions that specifically probe the differences between human and AI cognition, and not fewer questions that make the comparison seem cleaner than it actually is.