Three-pillar framework sets methodological benchmark for data quality and transparency in oncology data
Flatiron Health today announced the publication of the Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework in the Journal of Clinical Oncology Clinical Cancer Informatics. The framework represents the first and most comprehensive, peer-reviewed approach to evaluating the quality and reliability of real-world data extracted by large language models (LLMs) and machine learning—setting a methodological benchmark for data integrity in oncology research.
As large language models emerge as a tool for clinical data extraction from sources such as electronic health records, the industry faces a tradeoff—AI can unlock speed and scale, but it requires rigorous validation. Flatiron’s VALID Framework makes real-world data quality transparent and measurable, enabling evidence that meets the bar for high-stakes clinical decisions. Specifically, the framework applies a rigorous, three-pillar approach: variable-level performance metrics that benchmark LLM extraction against expert human abstraction; automated verification checks that systematically identify logical inconsistencies and implausibilities in data; and replication and benchmark analyses that confirm LLM-extracted results replicate established clinical findings.
"By publishing this framework transparently, we hope to contribute to raising the bar across the industry," said Nathan Hubbard, Chief Executive Officer of Flatiron Health. "Our commitment to data quality while applying LLMs responsibly and rigorously has enabled us to work at scale—with longitudinal records across millions of patients and over 1.5 billion data points—without compromising the rigor that has defined Flatiron for decades."
Flatiron's LLM-extracted data builds on the highest-quality, human-abstracted real-world oncology data. By combining AI with expert human abstraction, Flatiron delivers gold-standard data quality at scale without trading off the clinical rigor that makes it fit for use in the highest-stakes decisions in cancer care and drug development. Every LLM-enabled dataset is subject to the VALID Framework, alongside long term clinical and scientific oversight to ensure data that captures complete patient journeys and validated outcomes.
"The VALID Framework, combined with our robust clinical and methodological expertise, gives us—and our customers—a clear basis for evaluating whether efficiency and accuracy go hand in hand, as well as confidence in clinical and strategic decisions made using real-world data," said Jonathan Kish, PhD, MPH, Vice President and Head of Research Sciences at Flatiron Health. "We’re investing deeply in the underlying work: data models, multimodal depth, and resolving complex edge cases to ensure that we're not just extracting more data; we're extracting better data at scale, so every decision is informed by intelligence you can trust."
Read the full publication: Estevez M, Singh N, Dyson L, et al. Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework. JCO Clin Cancer Inform. 2026. https://ascopubs.org/doi/10.1200/CCI-25-00215
About Flatiron Health
Flatiron Health is a healthtech company expanding the possibilities for point of care solutions in oncology and using data for good to power smarter care for every person with cancer. Through machine learning and AI, real-world evidence, and breakthroughs in clinical trials, we continue to transform patients’ real-life experiences into knowledge and create a more modern, connected oncology ecosystem. Flatiron Health is an independent affiliate of the Roche Group.


