Skip to content

Approach to machine learning for extraction of real-world data variables from electronic health records


March 2023


Adamson B, Waskom M, Blarre A, Kelly J, Krismer K, Nemeth S, Gippetti J, Ritten J, Harrison K, Ho G, Linzmayer R, Bansal T, Wilkinson S, Amster G, Estola E, Benedum CM, Fidyk K, Estevez M, Shapiro W, and Cohen AB (2023). Approach to machine learning for extraction of real-world data variables from electronic health records. Front. Pharmacol. 14:1180962. doi: 10.3389/fphar.2023.1180962


Access to research-ready datasets with high quality, recency, clinical depth, provenance, completeness, representativeness, and usability is a significant barrier to generating robust and meaningful real-world evidence (RWE). Studies that use electronic health record (EHR)-derived data require extensive data pre-processing and curation to create meaningful variables and outcomes for analysis. However, valuable information is often trapped within unstructured documents such as clinician notes or scanned lab reports, making it difficult to extract relevant data. Traditional methods like manual chart reviews by clinical experts are time-consuming and resource-intensive, limiting research opportunities and scale. 

Recent advances in artificial intelligence (AI), specifically natural language processing (NLP) and machine learning (ML) offer new opportunities to extract clinical details from patient charts and curate high-quality RWD in oncology. This paper provides an overview of Flatiron Health’s approach to applying NLP and ML methods to efficiently extract data from unstructured documents and visit notes stored in oncology care EHR, offering transparency and explainability.

Figure 5 Machine Learning Adamson 2023Figure. Sentences of text (fictional examples here) from EHR are inputs to deep learning models that produce a data variable value for each patient as an output.

Why this matters

ML helps us learn more from examples of patients with specific characteristics, diseases, and therapies to achieve statistically meaningful results, especially for those who historically have been oppressed or marginalized in oncology clinical trials. 

With the rapid rise in personalized treatment and new biomarkers, transparent fit-for-purpose applications of ML will become increasingly important. With high-performance models, we can learn more from every patient's experience, reduce potential bias in evidence, and understand treatment effectiveness and safety in a timely way.

Read the research

Learn more about how we assess and achieve high-quality RWD with our machine learning performance evaluation framework