A pile of labeled data is not a training data program.
Human in the loop data work fails in model teams, in evaluation teams, in moderation programs, in multilingual data operations, in every AI program that has ever had to reconcile "we shipped the data on time" with "the model is not behaving." The reasons have very little to do with the annotation interface and almost everything to do with the operating model around it: the guideline written as a description rather than a specification, disagreement averaged out instead of surfaced as signal, multilingual labels produced by whoever was available rather than specialists in the language, review scheduled at the end rather than built into the workflow, and ownership split across four vendors with four definitions of "done."