I have taken a few tests where a question felt unclear rather than difficult. It was not that I could not solve it, but that I was not sure what it was actually asking. That made me wonder how test developers catch those issues before a test is finalized.
Do they flag items when people interpret them in different ways? Is ambiguity something that shows up clearly in the data, or does it rely on feedback from test takers? I imagine some level of confusion is expected, but at what point does a question stop measuring reasoning and start measuring guesswork?