How do developers evaluate whether an item is too ambiguous?

I have taken a few tests where a question felt unclear rather than difficult. It was not that I could not solve it, but that I was not sure what it was actually asking. That made me wonder how test developers catch those issues before a test is finalized.

Do they flag items when people interpret them in different ways? Is ambiguity something that shows up clearly in the data, or does it rely on feedback from test takers? I imagine some level of confusion is expected, but at what point does a question stop measuring reasoning and start measuring guesswork?

This bugs me so much! I think good test developers collect qualitative feedback during norming and ask people to explain their reasoning. If multiple smart people are interpreting a question differently but all making logical sense, that’s a problem. The tricky part is distinguishing between “genuinely ambiguous” and “I just didn’t get it” though. Some items are meant to have that aha moment where clarity suddenly clicks.