The gradual increase of IQ scores over time (called the Flynn effect) is one of the most fascinating topics in the area of intelligence research. One of the most common ways to investigate the Flynn effect is to give the same group of people a new test and an old test and calculate the difference in IQs.
The problem with that methodology is that intelligence tests get heavily revised, and there may be major differences between the two versions of a test.
In this article examining the 1989, 1999, and 2009 French versions of the Wechsler Adult Intelligence Scale, the authors compared the item statistics for items that were the same (or very similar) across versions and dropped items that were unique to each version. This made the tests much more comparable.
The authors then examined how the common items’ statistics (e.g., difficulty) changed over time. This change in statistics is called “item drift” and is common. Item drift is relevant because if it happens to many items, then it would change overall IQs and be confounded with the Flynn Effect.
The results (shown below) were surprising. Over half of test items showed changes to the statistics. While most of these changes were small, they aggregated to have some noteworthy effects. Verbal subtests tended to get more difficult as time progressed, while two important non-verbal subtests (Block Design and Matrix Reasoning) got easier.
The item drift on these tests masked a Flynn effect that occurred in France from 1989 to 2009 (at least, with these test items).
It’s still not completely clear what causes item drift or the Flynn effect. But it’s important to control for item drift when examining how cognitive performance has changed with time. If not, then the traditional method of finding the difference between the scores on an old test vs. a new test, will give distorted results.
This is a really important methodological point. If test items themselves are getting harder or easier over time independent of actual ability changes, then comparing raw scores across test versions gives you garbage data. The fact that verbal items got harder while non-verbal got easier suggests cultural shifts maybe language is evolving faster, or people are more exposed to visual puzzles through technology. The “ersatz negative Flynn effect” from 1999-2009 is wild scores looked like they dropped, but it was just item drift making the test harder, not people getting dumber.
So basically, a lot of Flynn effect research might be measuring test changes rather than actual intelligence changes? That’s kind of a big deal. The 3 IQ point underestimation from item drift adds up when you’re trying to track generational shifts. Makes you wonder how much of the “reverse Flynn effect” people are freaking out about in recent years is just differential item functioning rather than genuine cognitive decline. This study really shows why you need psychometric rigor when making these comparisons you can’t just slap two test versions together and call it science.