Skip to content

Fairness by design: End-to-end bias evaluation for LLM-generated data

Published

July 2025

Citation

Estevez M, Mbah O, Sheikh A, et al. Fairness by design: End-to-end bias evaluation for LLM-generated data. AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning. 2025.

Overview

Large language models (LLMs) are powerful tools for extracting clinical information from electronic health records (EHRs), but because they are trained on data that may reflect existing societal biases, it is essential to evaluate their performance for fairness. In this study, researchers developed an LLM to identify initial and metastatic breast cancer diagnoses and dates from EHRs, then rigorously assessed its accuracy and potential bias across different patient subgroups. The evaluation used a three-pronged approach as outlined in the Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework: comparing LLM and human abstractor performance on a test set, running verification checks for data consistency and plausibility, and replicating key clinical outcomes in large patient cohorts. 

Results showed that the LLM’s accuracy in identifying diagnoses varied slightly by race/ethnicity and age, with higher recall for Black patients and higher precision for Latinx patients compared to White patients, but lower recall for patients aged 75 and older compared to patients aged 50-64 (the majority group). Importantly, the LLM’s performance as measured by verifications checks and overall survival estimates via replication analysis as described in the VALID Framework were consistent between LLM- and abstraction-generated datasets, with only minor differences observed.

Why this matters

As LLMs become more widely used to curate real-world oncology data, ensuring that these tools do not perpetuate or amplify health inequities is critical. This research demonstrates a robust framework for evaluating both the quality and fairness of LLM-extracted data, highlighting the importance of ongoing bias assessment. By identifying small but meaningful subgroup differences, the study provides actionable insights to improve model fairness—such as refining prompts or validation processes—ultimately supporting more equitable and trustworthy use of AI in cancer research and care.

Read more

Share