The gradual increase of IQ scores over time (called the Flynn effect) is one of the most fascinating topics in the area of intelligence research. One of the most common ways to investigate the Flynn effect is to give the same group of people a new test and an old test and calculate the difference in IQs.
The problem with that methodology is that intelligence tests get heavily revised, and there may be major differences between the two versions of a test.
In this article examining the 1989, 1999, and 2009 French versions of the Wechsler Adult Intelligence Scale, the authors compared the item statistics for items that were the same (or very similar) across versions and dropped items that were unique to each version. This made the tests much more comparable.
The authors then examined how the common items’ statistics (e.g., difficulty) changed over time. This change in statistics is called “item drift” and is common. Item drift is relevant because if it happens to many items, then it would change overall IQs and be confounded with the Flynn Effect.
The results (shown below) were surprising. Over half of test items showed changes to the statistics. While most of these changes were small, they aggregated to have some noteworthy effects. Verbal subtests tended to get more difficult as time progressed, while two important non-verbal subtests (Block Design and Matrix Reasoning) got easier.
The item drift on these tests masked a Flynn effect that occurred in France from 1989 to 2009 (at least, with these test items).
It’s still not completely clear what causes item drift or the Flynn effect. But it’s important to control for item drift when examining how cognitive performance has changed with time. If not, then the traditional method of finding the difference between the scores on an old test vs. a new test, will give distorted results.
This is a really important methodological point. If test items themselves are getting harder or easier over time independent of actual ability changes, then comparing raw scores across test versions gives you garbage data. The fact that verbal items got harder while non-verbal got easier suggests cultural shifts maybe language is evolving faster, or people are more exposed to visual puzzles through technology. The “ersatz negative Flynn effect” from 1999-2009 is wild scores looked like they dropped, but it was just item drift making the test harder, not people getting dumber.
So basically, a lot of Flynn effect research might be measuring test changes rather than actual intelligence changes? That’s kind of a big deal. The 3 IQ point underestimation from item drift adds up when you’re trying to track generational shifts. Makes you wonder how much of the “reverse Flynn effect” people are freaking out about in recent years is just differential item functioning rather than genuine cognitive decline. This study really shows why you need psychometric rigor when making these comparisons you can’t just slap two test versions together and call it science.
I don’t know if I buy the “item drift” excuse for everything. If verbal subtests are getting harder because people don’t know the words or concepts, isn’t that a valid decline in crystallized intelligence? You can’t just statistically adjust away the fact that people know less vocabulary than they used to and call it a wash.
@CloverL Spot on, the study explicitly showed that the “decline” in France from 1999 to 2009 completely vanished once they corrected for the drift, which suggests the panic about us getting dumber is totally premature. It’s actually scary to think how many headlines have been generated by what is essentially a statistical error in how we calibrate questions.
The directional differences are what grab me. Verbal getting harder, block design and matrix reasoning getting easier. There’s a cultural story embedded in those items. Between 1989 and 2009 in France, what changed? Did people interact with spatial puzzles more (computers, games)? Did language become more fragmented or contextual in ways that made formal verbal reasoning less practiced? The test items are almost like fossils of their time period.
One thing that bothered me was that they only analyzed overlapping items. But test makers choose what to keep and what to drop. If items are systematically removed because they’re becoming too easy or too hard, that’s already a response to population changes. The discarded items might tell us as much as the retained ones. We’re looking at a curated subset of the data.