Considerations for the use of machine learning extracted real-world data to support evidence generation: A research-centric evaluation framework

Published

June 2022

Citation

Estevez M, Benedum CM, Jiang C, Cohen AB, Phadke S, Sarkar S, Bozkurt S. Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework. Cancers. 2022; 14(13):3063. https://doi.org/10.3390/cancers14133063

Our summary

When working with real-world data (RWD), key information, such as diagnosis dates, biomarker status, and therapies received, are only available as unstructured text in electronic health records (EHRs). Machine learning (ML) can be used to extract these unstructured data elements—but unique challenges emerge when using the data produced with ML techniques for research purposes. Specifically, how best to assess validity and generalizability to different cohorts of interest.

This framework covers the fundamentals of evaluating RWD produced using ML methods to maximize the use of EHR data for research purposes.

Why this matters

Using machine learning to extract unstructured data elements found in EHRs has the ability to unlock retrospective research at scale. This framework guides a multi-stakeholder evaluation that is transparent, goes beyond standard machine learning metrics, and focuses on RWD methodologic fundamentals and considerations, to help determine whether ML-extracted variables are fit for research use.

Read the research

Publications

Considerations for the use of machine learning extracted real-world data to support evidence generation: A research-centric evaluation framework

Our summary

Why this matters

Share

Posted in

More publications

AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning

July 2025

Using large language models for scalable extraction of real-world progression events across multiple cancer types

Cohen A, Krismer K, Magee K, et al.

ISPOR

April 2025

Leveraging machine learning to assess the association of rash and survival in patients with advanced NSCLC

Yuan Q, Dolor A, Qian Y, et al.

ISPOR

April 2025

Performance assessment and validation of real-world response data generated using a deep learning-based natural language processing model across multiple solid tumors

Magee K, Yuan Q, Blarre A, et al.