Skip to content

High-quality, validated AI-enabled research: Insights from recent prostate cancer research

Published

April 2026

By

Eunice Hankinson, Senior Clinical Director, Flatiron Health

High-quality, validated AI-enabled research: Insights from recent prostate cancer research

If there’s a question I hear most often from researchers it’s: How can we trust real-world data extracted by AI?

Flatiron Health has been a pioneer in building datasets powered by AI, while answering the critical questions of how they’re built, in what applications they should be used, and if they can support oncology’s highest stakes decisions—all with a validated approach supported by research.

Prostate cancer research is evolving rapidly. Approvals of androgen receptor pathway inhibitors (ARPI) in earlier disease settings, PARP inhibitors, and radioligand therapies have meaningfully shifted the treatment landscape. Real-world data plays a role in understanding how these treatments perform outside clinical trials—especially when answering questions about who is getting access and what outcomes look like across diverse patient populations.

But the insights that inform those questions depend entirely on the quality of the underlying data. When applying large language models to extract clinical information from electronic health records at scale, it’s critical to demonstrate the quality of the data. This focus has been a guiding principle as we’ve built our prostate Panoramic dataset—and it is the link connecting two publications using our prostate Panoramic dataset being showcased at ISPOR 2026.

First, building a framework that actually answers the question

Earlier this year, Flatiron published the Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework in JCO Clinical Cancer Informatics. To our knowledge, it's the first comprehensive framework of its kind for evaluating the quality of LLM/ML-extracted real-world oncology data.

The VALID Framework is built on three pillars: variable-level performance metrics that benchmark LLM output against expert human abstraction; automated verification checks that identify internal inconsistencies and implausible values; and replication and benchmark analyses that compare LLM-derived findings to established clinical results from human-abstracted or external datasets.

What I find meaningful about this framework is that it goes beyond measuring accuracy to build the mechanisms to find problems and improve. In high-stakes domains like oncology research, the goal isn't getting it right once. It's creating a transparent, repeatable process for continuously evaluating and strengthening your outputs. That's what VALID is designed to do, and it's now the lens we apply to every new dataset we develop.

Applying VALID framework in practice: the prostate cancer Panoramic dataset

Following publication of the VALID framework, we’ve demonstrated its application in the prostate Panoramic dataset, with results presented at ISPOR this year.

The first poster—Assessing Quality of a Large Language Model (LLM)-Derived Prostate Cancer (PC) Real-World Dataset: An Application of the VALID Framework—applies each pillar of the framework directly to Flatiron’s LLM-extracted prostate Panoramic database, drawn from the EHR-derived, deidentified Flatiron Health Research Database including nearly 400,000 patients with prostate cancer.

This study focused on a cohort of 374,000 US patients with prostate cancer, and the findings showed:

  • For variable-level performance metrics, we compared LLM-extracted variables to doubly-abstracted test sets of 349–500 patients. F1 scores—the standard measure of precision and recall—for initial diagnosis and date were only 2.10 percentage points lower for the LLM than for expert abstractors; metastatic diagnosis and date was 2.11 percentage points lower; and castration-resistant versus hormone-sensitive status was just 0.52 percentage points lower. These are clinically meaningful variables. LLM performance approaching expert human performance across all three variables speaks to real extraction quality, not just theoretical potential. LLM performance approaching expert human performance reflects high-quality extraction.
  • The verification checks told a similarly reassuring story. When assessing a clinical plausibility signal—the proportion of patients with metastatic hormone-sensitive prostate cancer (mHSPC) who received more than one line of systemic therapy—the LLM-derived dataset was only 3.3 percentage points higher than the abstracted comparator.
  • Perhaps most meaningful for our biopharma partners is what the replication analysis showed. In treatment-selected cohorts, real-world overall survival (rwOS) patterns between LLM-derived and abstracted datasets were strikingly similar. For patients treated with ARPI in the 1L metastatic castration-resistant prostate cancer (mCRPC) setting, median rwOS was 25.3 months (95% CI: 24.9–25.8) in the LLM-derived dataset versus 24.4 months (23.7–25.1) in the abstracted dataset. For patients receiving PARP inhibitors in the 2L mCRPC setting, median rwOS was 15.8 months (13.9–17.2) versus 15.9 months (14.5–17.8), respectively. These results are not only statistically comparable—they are clinically interchangeable. This level of fidelity is critical when evaluating external control arm feasibility or modeling treatment effectiveness.

This research demonstrates how the VALID framework provides a basis for real-world data quality to be transparent and measurable. When a biopharma team asks whether our prostate cancer data is fit for a regulatory submission, a comparative effectiveness analysis, or an external control arm study, we can answer that question with confidence and clear evidence. The VALID Framework gives us a structured, defensible way to do exactly that.

Innovation and validation

There's a misconception I sometimes encounter: that a focus on data quality is inherently conservative, that it slows innovation. However, our commitment to rigor is precisely what gives us the confidence to push methodologically further.

Our second ISPOR poster—Customization of a Large Language Model Approach to Capture PSA and Imaging Derived Real-World Progression Events in Prostate Cancer—is a clear example. Prostate cancer presents a specific and well-known challenge for real-world progression capture: clinician documentation of worsening disease is valuable, but it doesn't tell the whole story. PSA kinetics are central to how oncologists understand and communicate disease progression in their clinical practice. Ignoring PSA-based signals in a progression framework means leaving meaningful clinical signals on the table.

This study, conducted in the same 374,189-patient Flatiron Health Research Database, adapted our existing LLM-based progression extraction approach to incorporate PSA-based progression events alongside clinician-documented ones—and then evaluated its performance using the VALID framework.

  • The results of this study are critical when defining real-world progression in prostate cancer.. In 1L mHSPC, 53% of patients had at least one captured real-world progression (rwP) event. In the mCRPC setting, that figure rose to 73% in 1L, 77% in 2L, and 77% in 3L. Looking at the source of those events: 49%–68% of patients had at least one clinician-documented event, while an additional 20%–36% had at least one PSA-only event. The PSA-based layer captures a meaningful number of progression events that the clinician-documentation approach alone would have missed.
  • The validation analysis clinically grounded these findings. First rwP events following line start were associated with downstream clinical events—treatment change or death—at increasing rates in later lines of therapy: 39% in 1LmHSPC, rising to 50% in 3L mCRPC. As expected, clinician-documented events were more strongly associated with downstream clinical action than PSA-only events (41%–57% versus 23%–33%), which reflects how clinicians actually make treatment decisions in practice. Increasing PSA trends inform one aspect of the patient status; documented clinical deterioration drives treatment decisions and our LLM model was able to capture both PSA progression as well as clinician-documented progression instances.
  • The PFS estimates by metastatic site also landed where clinical experience would predict. Patients with liver metastases had substantially shorter PFS than those with bone-only metastases: 3.2 versus 5.9 months in 1L mCRPC, and 6.9 versus 17.8 months in 1L mHSPC.

The broader implication of this work goes beyond prostate cancer. Disease-specific biomarkers—PSA in prostate cancer, CA-125 in ovarian cancer, AFP in hepatocellular carcinoma—carry clinical meaning that generic progression frameworks can miss. Building methodologies that can capture that specificity, validate them rigorously, and do it at scale is difficult but important work. It requires close collaboration between data scientists, machine learning engineers, and clinicians who understand these numbers and their context.

What holds it all together

Real-world evidence in prostate cancer has an enormous amount to offer—but only if we can stand behind the quality of the data powering it. The work we're presenting at ISPOR reflects Flatiron values: expeditious data generation that is scientifically rigorous.

When a biopharma team asks whether our prostate cancer data is fit for a regulatory submission, a comparative effectiveness analysis, or an external control arm study, we can answer that question with confidence and clear evidence. The VALID Framework gives us a structured, defensible way to do exactly that. And, building ML-based approaches that integrate many sources of clinical data requires more than machine learning capability—it requires clinicians who know the nuances of how progression is assessed and documented in routine patient care, can translate that to a modelling-approach, and can assess that the results are clinically appropriate and meaningful.

If you're attending ISPOR this year, stop by booth #316 or reach out to our team directly to schedule time in advance. We would love to talk through this research and what it means for your work.

 

Share