Roughly half of the models’ answers were inaccurate

Team evaluations of AI responses