Summary
Real-world data (RWD) from electronic health records (EHRs) can provide valuable, large-scale information on how clinical characteristics and treatments impact patient outcomes. However, there are challenges with using EHR data, such as extracting unstructured information that is not suitable for direct analysis.
Two common approaches to overcome this issue are manual chart review (human abstraction) and machine learning (ML) models that extract variables of interest from unstructured text. ML models enable the analysis of much larger datasets than manual abstraction alone, but may be prone to model errors.
In this paper, researchers from the Fred Hutchinson Cancer Center and Flatiron Health evaluate different approaches to using ML-extracted variables in downstream statistical analyses. Novel proposed methods combine ML-extracted data with a limited amount of human-abstracted data to improve estimation of statistical parameters, relative to either approach in isolation.
Why this matters
ML models can accelerate the process of extracting unstructured clinical information from EHR databases into variables that can be analyzed. However, it is crucial to determine if these variables can be treated equivalently to structured or abstracted variables in statistical analyses. This study investigates various approaches to analyze ML-extracted variables to obtain accurate parameter estimates and valid statistical inferences similar to those obtained from expert abstraction of variables.