A work from StandfordHAI examined the safety and accuracy of GPT-4 on medical applications. In a nutshell, it shows that GPT-4 is not robust enough for use as a medical co-pilot. Their findings? - 91% of GPT-3.5 and 93% of GPT-4 responses are deemed safe and the complement were considered “harmful” primarily because of the inclusion of hallucinated citations. - 21% of GPT-3.5 and 41% of GPT-4 responses agreed with the known answer. - 27% of GPT-3.5 and 29% of GPT-4 responses were such that the clinicians were “unable to assess” agreement with the known answer. Paper: arxiv.org/pdf/2304.13714… hai.stanford.edu/news/how-well-…
0
2
2
810
0