ISB Study Highlights AI’s Potential and Pitfalls in Analyzing Health Data
New research highlights strengths of large language models in uncovering social determinants of health while underscoring the need for human oversight and improved de-identification methods.
Institute for Systems Biology (ISB) researchers have gained new insights into the strengths and limitations of using artificial intelligence (AI) to identify social determinants of health from electronic health records. Their peer-reviewed results were published on November 19.
The ISB team, collaborating with Providence, leveraged large language models (LLM) developed from generative pre-trained transformers (GPT). Their research was conducted completely within the secure Providence internal environment.
The study – aimed at detecting housing instability – was conducted on over 25,000 clinical notes from 795 pregnant women and evaluated two large language models (GPT-4 and GPT-3.5), a named entity recognition model, regular expressions, and human review.
This research goes beyond previous studies in two important ways. First, researchers measured how well AI can find housing challenges, distinguish between current and past housing instability, and provide direct evidence from clinical notes. Second, they measured whether AI performed differently if the notes had been de-identified.
GPT-4 was the most effective of the four technologies tested, and was better than humans at finding cases of housing instability (recall). Humans, however, were better at understanding when people did not have housing instability (precision). Humans were also better at providing correct evidence from a clinical note.
“These results show that LLMs present a scalable, cost-effective solution for an initial search for patients who may benefit from outreach,” said ISB Associate Professor Dr. Jennifer Hadlock, corresponding author of the paper.
GPT-4 generally provided the same text that humans had selected to justify answers. Notably, no hallucinated comments appeared in the GPT-4 responses that were reviewed, most likely because the researchers designed the LLM instructions to request verbatim evidence from notes.
However, there were cases where the AI interpretation of note text was incorrect in ways that could be misleading. This is especially important because housing status can intersect with many other challenging or risky situations, such as domestic abuse.
“When a healthcare professional decides whether and how to reach out to offer help, they take great care to consider patient safety. Our results illustrate that it would still be essential to have a human read the actual text in the chart, not just the LLM summary,” Hadlock added.
Further, in a novel experiment, researchers showed that recall was worse when run on de-identified versions of the same clinical notes. These notes had been de-identified with an automated technique called “hide in plain sight,” which replaces potentially sensitive information (such as names, locations and dates) with realistic but fictitious alternatives. The de-identification sometimes reclassified critical information enough to skew the ability to accurately determine housing instability.
“This highlights the need to refine de-identification methods to preserve privacy without losing important details about social determinants of health,” said Dr. Alexandra Ralevski, lead author of the study.