Approach to machine learning for extraction of real-world data variables from electronic health records

Summary

Access to research-ready datasets with high quality, recency, clinical depth, provenance, completeness, representativeness, and usability is a significant barrier to generating robust and meaningful real-world evidence (RWE). Studies that use electronic health record (EHR)-derived data require extensive data pre-processing and curation to create meaningful variables and outcomes for analysis. However, valuable information is often trapped within unstructured documents such as clinician notes or scanned lab reports, making it difficult to extract relevant data. Traditional methods like manual chart reviews by clinical experts are time-consuming and resource-intensive, limiting research opportunities and scale.

Recent advances in artificial intelligence (AI), specifically natural language processing (NLP) and machine learning (ML) offer new opportunities to extract clinical details from patient charts and curate high-quality RWD in oncology. This paper provides an overview of Flatiron Health’s approach to applying NLP and ML methods to efficiently extract data from unstructured documents and visit notes stored in oncology care EHR, offering transparency and explainability.

Figure 5 Machine Learning Adamson 2023 Figure. Sentences of text (fictional examples here) from EHR are inputs to deep learning models that produce a data variable value for each patient as an output.

Why this matters

ML helps us learn more from examples of patients with specific characteristics, diseases, and therapies to achieve statistically meaningful results, especially for those who historically have been oppressed or marginalized in oncology clinical trials.

With the rapid rise in personalized treatment and new biomarkers, transparent fit-for-purpose applications of ML will become increasingly important. With high-performance models, we can learn more from every patient's experience, reduce potential bias in evidence, and understand treatment effectiveness and safety in a timely way.

Read the research

Learn more about how we assess and achieve high-quality RWD with our machine learning performance evaluation framework

Publications

Approach to machine learning for extraction of real-world data variables from electronic health records

Summary

Why this matters

Share

Posted in

More publications

AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning

July 2025

Using large language models for scalable extraction of real-world progression events across multiple cancer types

Cohen A, Krismer K, Magee K, et al.

ISPOR

April 2025

Leveraging machine learning to assess the association of rash and survival in patients with advanced NSCLC

Yuan Q, Dolor A, Qian Y, et al.

ISPOR

April 2025

Performance assessment and validation of real-world response data generated using a deep learning-based natural language processing model across multiple solid tumors

Magee K, Yuan Q, Blarre A, et al.