Assessing quality of a LLM-derived prostate cancer (PC) real-world dataset: an application of the validation of accuracy for LLM/ML-extracted information and data (VALID) framework

Overview

Large language models (LLMs) are increasingly used to extract clinical information from electronic health records, offering a scalable alternative to manual data abstraction. However, ensuring the accuracy and reliability of LLM-derived data is essential before it can be used in research.

In this study, researchers applied the VALID framework—a comprehensive approach to evaluating data quality—to a large LLM-derived prostate cancer dataset. They compared LLM-extracted variables to human-abstracted data, conducted consistency checks, and replicated key clinical outcomes. The results showed that LLM performance was very similar to manual abstraction, with only small differences in accuracy and highly consistent survival estimates across datasets.

Why this matters

This study demonstrates that LLMs can generate high-quality real-world data suitable for research when rigorously validated. By enabling scalable data extraction without sacrificing accuracy, LLMs can help expand the scope and speed of real-world evidence generation in oncology.

Read the research

New and Noteworthy

Assessing quality of a LLM-derived prostate cancer (PC) real-world dataset: an application of the validation of accuracy for LLM/ML-extracted information and data (VALID) framework

Overview

Why this matters

Share

Posted in

More publications

ISPOR

April 2026

Customization of a large language model approach to capture PSA and imaging derived real-world progression events in prostate cancer

Magee K, Ward P, Chen W, Hankinson E, Dolor A, et al.

ASCO Annual Meeting

May 2026

From plenary to practice: A large language model (LLM)-based thematic analysis of landmark clinical trial discussions and factors influencing their real-world adoption

Cohen AB, Williams T, Aggarwal C, et al.

ASCO Annual Meeting

May 2026

Using ML to predict rapid progression for patients (pts) with HR+/HER2- metastatic breast cancer (mBC) treated with frontline (1L) CDK 4/6 inhibitors (CDK 4/6i)

Peng M, Rios G, Estevez M, et al.