Fairness by design: End-to-end bias evaluation for LLM-generated data

Overview

Large language models (LLMs) are powerful tools for extracting clinical information from electronic health records (EHRs), but because they are trained on data that may reflect existing societal biases, it is essential to evaluate their performance for fairness. In this study, researchers developed an LLM to identify initial and metastatic breast cancer diagnoses and dates from EHRs, then rigorously assessed its accuracy and potential bias across different patient subgroups. The evaluation used a three-pronged approach as outlined in the Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework: comparing LLM and human abstractor performance on a test set, running verification checks for data consistency and plausibility, and replicating key clinical outcomes in large patient cohorts.

Results showed that the LLM’s accuracy in identifying diagnoses varied slightly by race/ethnicity and age, with higher recall for Black patients and higher precision for Latinx patients compared to White patients, but lower recall for patients aged 75 and older compared to patients aged 50-64 (the majority group). Importantly, the LLM’s performance as measured by verifications checks and overall survival estimates via replication analysis as described in the VALID Framework were consistent between LLM- and abstraction-generated datasets, with only minor differences observed.

Why this matters

As LLMs become more widely used to curate real-world oncology data, ensuring that these tools do not perpetuate or amplify health inequities is critical. This research demonstrates a robust framework for evaluating both the quality and fairness of LLM-extracted data, highlighting the importance of ongoing bias assessment. By identifying small but meaningful subgroup differences, the study provides actionable insights to improve model fairness—such as refining prompts or validation processes—ultimately supporting more equitable and trustworthy use of AI in cancer research and care.

Publications

Fairness by design: End-to-end bias evaluation for LLM-generated data

Overview

Why this matters

Share

Posted in

More publications

AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning

July 2025

Using large language models for scalable extraction of real-world progression events across multiple cancer types

Cohen A, Krismer K, Magee K, et al.

ASCO Annual Meeting

May 2025

Impact of social determinants of health on mortality in diffuse large B-cell lymphoma (DLBCL) using real-world data

Canavan M, Wang M, Mbah O, et al.

ESMO Real-World Data and Digital Oncology

February 2025

Characterisation of oncology EHR-derived real-world data in the UK, Germany, and Japan

Adamson B, Horne E, Xu C, et al.