Flynn effects are biased by differential item functioning over time: A test using overlapping items in Wechsler scales

elipischke344 · November 30, 2025, 2:04pm

The gradual increase of IQ scores over time (called the Flynn effect) is one of the most fascinating topics in the area of intelligence research. One of the most common ways to investigate the Flynn effect is to give the same group of people a new test and an old test and calculate the difference in IQs.

The problem with that methodology is that intelligence tests get heavily revised, and there may be major differences between the two versions of a test.

In this article examining the 1989, 1999, and 2009 French versions of the Wechsler Adult Intelligence Scale, the authors compared the item statistics for items that were the same (or very similar) across versions and dropped items that were unique to each version. This made the tests much more comparable.

The authors then examined how the common items’ statistics (e.g., difficulty) changed over time. This change in statistics is called “item drift” and is common. Item drift is relevant because if it happens to many items, then it would change overall IQs and be confounded with the Flynn Effect.

The results (shown below) were surprising. Over half of test items showed changes to the statistics. While most of these changes were small, they aggregated to have some noteworthy effects. Verbal subtests tended to get more difficult as time progressed, while two important non-verbal subtests (Block Design and Matrix Reasoning) got easier.

The item drift on these tests masked a Flynn effect that occurred in France from 1989 to 2009 (at least, with these test items).

It’s still not completely clear what causes item drift or the Flynn effect. But it’s important to control for item drift when examining how cognitive performance has changed with time. If not, then the traditional method of finding the difference between the scores on an old test vs. a new test, will give distorted results.

Reposted from X: https://x.com/RiotIQ/status/1937146121824116844?s=20
Link to fully study: https://doi.org/10.1016/j.intell.2022.101688

Juan_San · November 30, 2025, 3:51pm

elipischke344:

The authors then examined how the common items’ statistics (e.g., difficulty) changed over time. This change in statistics is called “item drift” and is common. Item drift is relevant because if it happens to many items, then it would change overall IQs and be confounded with the Flynn Effect.

The results (shown below) were surprising. Over half of test items showed changes to the statistics. While most of these changes were small, they aggregated to have some noteworthy effects. Verbal subtests tended to get more difficult as time progressed, while two important non-verbal subtests (Block Design and Matrix Reasoning) got easier.

The item drift on these tests masked a Flynn effect that occurred in France from 1989 to 2009 (at least, with these test items).

It’s still not completely clear what causes item drift or the Flynn effect. But it’s important to control for item drift when examining how cognitive performance has changed with time. If not, then the traditional method of finding the difference between the scores on an old test vs. a new test, will give distorted results.

This is a really important methodological point. If test items themselves are getting harder or easier over time independent of actual ability changes, then comparing raw scores across test versions gives you garbage data. The fact that verbal items got harder while non-verbal got easier suggests cultural shifts maybe language is evolving faster, or people are more exposed to visual puzzles through technology. The “ersatz negative Flynn effect” from 1999-2009 is wild scores looked like they dropped, but it was just item drift making the test harder, not people getting dumber.

CloverL · November 30, 2025, 3:56pm

So basically, a lot of Flynn effect research might be measuring test changes rather than actual intelligence changes? That’s kind of a big deal. The 3 IQ point underestimation from item drift adds up when you’re trying to track generational shifts. Makes you wonder how much of the “reverse Flynn effect” people are freaking out about in recent years is just differential item functioning rather than genuine cognitive decline. This study really shows why you need psychometric rigor when making these comparisons you can’t just slap two test versions together and call it science.

Topic		Replies	Views
Reevaluating the Flynn effect, and the reversal IQ Research journal-article	6	12	December 4, 2025
Putting the Flynn effect under the microscope: Item-level patterns in NLSYC PIAT-math scores, 1985-2004 IQ Research journal-article	6	22	November 15, 2025
Is the Flynn Effect just proof that IQ tests are constantly measuring the wrong thing? IQ Tests	2	10	November 24, 2025
How often do official IQ test questions change? IQ Tests	3	8	November 19, 2025
Flynn effect was proven again in an intercohort rise of cognitive ability in the Chinese population IQ Research journal-article	6	13	November 24, 2025

Flynn effects are biased by differential item functioning over time: A test using overlapping items in Wechsler scales

Related topics