Can You Trust Real-World Data Extracted by a Large Language Model (LLM)?

As a research scientist at Flatiron Health, I’ve spent the past nearly nine years driving the use of natural language processing (NLP) and machine learning (ML) methods to “read” the clinical notes and other unstructured documents from the electronic health record (EHR). In my early days, I spent time developing these models. Today my focus is around quality: once we have used advanced ML models to extract diagnosis dates, treatments and other variables from these unstructured documents, I ask questions such as: How often is the model wrong? What impact do these errors have on the ability to use this data to answer research questions? These questions have guided much of my work and given me a front-row seat to the evolution of these models. Over the past eight years, I’ve witnessed tremendous progress in the performance and capabilities of these models, but especially so over the past three years.

When OpenAI released ChatGPT three years ago, it quickly captured the public's (and my own) imagination. Within days, millions of users were playfully poking around its capabilities and testing its boundaries. It has since exploded from there.

Today, journalists use large language models (LLMs) like ChatGPT, Claude, and Gemini to draft headlines and news stories; marketers rely on them to generate ad copy and product descriptions at scale; software developers use them to accelerate coding; financial analysts to summarize earnings calls; lawyers to research case law; clinicians to document patient visits — and yes, students use generative AI to help with their homework. Across nearly every sector, organizations are exploring how these tools can drive growth and boost efficiency, even as they grapple with ethical concerns, their effect on the labor market, and the uneven quality of AI-generated output.

There are areas where an LLM mistake or hallucination isn’t a big deal. If you're using an LLM for meal prep and the instructions are off, the sauce will curdle, or the meringue will stay soft. You can make corrections and try again. But if you're using an LLM to extract clinical details, like diagnosis dates, treatment and disease progression information, from patients’ electronic health records (EHR), like we do at Flatiron, the impact of a faulty model can be devastating for researchers looking to use that data to answer vital questions at every step of the drug development lifecycle.

Most EHR data remains unstructured

Why do we need to use ML models for automated data abstraction in the first place? Well to answer that question, consider what a typical annual visit to a doctor looks like. Everything discussed with your doctor – symptoms, current diagnoses and medications, lifestyle information – is captured as written visit notes. There may be a perception that moving from paper records to a digitized EHR system means that this data is easily accessible for analysis. However the reality is that this data is not usable for traditional analysis if it is unstructured and cannot be made sense of.

While there’s been considerable progress over the past 10-15 years, our experience in working with every type of EHR system is that 80% of EHR data today still resides in unstructured formats like clinical narratives, pathology reports, and images. That’s especially true in oncology where cases tend to be more complex, involve more external testing, and yield more personalized treatment recommendations.

Automation is indispensable

Unstructured data can be very difficult to extract properly.

To minimize mistakes, we’ve long relied on a process called technology-enabled expert-human abstraction in which trained clinical experts parse through those unstructured records to identify and label all relevant variables, but that approach doesn’t scale easily. Expert human abstraction may be the gold standard, but it’s costly and time-consuming. Precision oncology requires vast datasets to study small patient cohorts and rare diseases — our network currently includes over 5M patient records from 280+ oncology practices and 800+ unique sites of care in the U.S. — and processing patient data across that kind of scale can only be achieved with some level of automation.

Over five years ago, we started using natural language processing (NLP) and other machine learning (ML) algorithms to scale our operations, and LLMs are the natural next step. They’re more efficient at interpreting unstructured text and images than earlier algorithms, and we already had the technical expertise and foundation in place to quickly incorporate the new technology as part of our data extraction process when it first emerged.

A new model validation framework

When we began experimenting with large language models (LLMs) for data extraction, we quickly saw their potential to dramatically improve efficiency, especially with human oversight and continuous refinement. However, these models can produce errors and hallucinations, and robust validation approaches are needed. We discovered that there are many ways to assess the quality of the data, and that even data with high variable-level performance can fail to meet standards of face validity or be fit-for-purpose. For instance, while a variable might appear to be sufficiently correct at the individual level, small errors like a date being off by a matter of days can have a cascading effect, causing events to appear out of sequence and making the dataset messy to use. This experience highlighted the need for a standardized approach to assessing the quality of extracted data across multiple dimensions so that we can be sure the dataset’s performance meets our rigorous standards.

We developed The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework to evaluate data quality, improve accuracy, and assure clients that the extracted datasets were fit-for purpose for their most sensitive use cases. The VALID Framework, to our knowledge, is the first comprehensive validation framework for model-extracted data in the oncology industry, and we don’t release a new ML-extracted data set anymore without first subjecting it to its scrutiny.

It consists of three distinct pillars, designed to build on one another to provide a holistic assessment of data quality, as highlighted below.

Screenshot 2025-09-18 at 1.50.15 PM

Let’s say that for a handful of key variables, such as diagnosis dates and oral treatment, LLM-extracted data is accurate to within a few percentage points of expert human abstraction in the broad cohort (pillar #1), and there are no blatant data inconsistencies across those variables (pillar #2). While these are good signals that data quality is adequate, it does not guarantee that the data will produce the “correct” answer for a specific study objective, such as measuring real world survival patterns for patients that received a specific oral drug in the first line setting. Variable metrics are measured at the broad cohort level and across all oral drugs, but it’s possible that diagnosis and oral variable performance differs across subpopulations and by oral drug. We also don’t know how small differences in LLM-performance relative to expert human abstraction could accumulate across variables and introduce bias in downstream analysis results. Pillar #3 solves both of these issues. Pillar #3 helps verify that the dataset can replicate established clinical and outcome distributions for the use cases and subcohorts you care about — and without introducing new bias.

Data quality starts with asking the right questions

Have you already used RWD extracted through LLM and other ML models? What was your experience like? How transparent was the whole process? How confident were you in the quality of that dataset, and what guarantees did you get that it was an accurate reflection of what was in the EHR? How much do you trust the conclusions you reached through that dataset?

We put together a quick checklist to help you ask the right questions, whether you’re getting your RWD from an outside data supplier or from an internal data extraction team. It’s helped us sharpen our models and build confidence in our own curation approach, and I trust that it can do the same for you. If you have questions about how to put this framework into practice or the critical role LLMs are playing in the evolution of EHR data extraction, reach out to us.

Does your RWD pass the VALID test?

Answer the following questions to see if you should trust your LLM/ML-extracted data

Data and dataset inconsistencies

☑️ Does the data provide adequate sample size and representativeness for your use case?
☑️ Does the data contain relevant demographic and clinical characteristics needed for your analysis?
☑️ Does the dataset align with clinical practice (e.g., distribution of 1L treatment regimen, rate of surgery by stage at initial diagnosis)?
☑️ How extensively did clinical experts review the data for discrepancies and nonsensical datapoints?

Variable-level model accuracy

☑️ What specific performance metrics were used to assess the model’s accuracy?
☑️ Was the model’s accuracy compared to manual data abstraction by clinical experts?
☑️ What reference dataset was used to assess performance? Why can it be trusted?
☑️ Was accuracy the same across subcohorts of interest?

Fit-for-purpose

☑️ Has the dataset been used to replicate well established findings? What analyses?
☑️ Are demographic and clinical cohort distributions as expected?
☑️ Are trends in clinical care over time as expected?
☑️ Are outcomes and treatment patterns as expected, including known biases?

New and Noteworthy

Can You Trust Real-World Data Extracted by a Large Language Model (LLM)?

Most EHR data remains unstructured

Automation is indispensable

A new model validation framework

Data quality starts with asking the right questions

Does your RWD pass the VALID test?

Share

Posted in

More to explore