Summary
Health authorities have highlighted data completeness in real-world data from electronic health records (EHRs) as a key component of data integrity and a shortcoming of observational data. Fortunately, natural language processing (NLP) has proven beneficial in addressing missingness, particularly in oncology EHRs where clinical notes often hide essential information, by automating the approach and improving the completeness of these details at scale.
In this study, researchers from Huntsman Cancer Institute, NYU Grossman School of Medicine, and Flatiron Health successfully developed a high-performing NLP algorithm to extract Eastern Cooperative Oncology Group Performance Status (ECOG PS) from unstructured EHR sources for patients starting new treatments across 21 distinct cancer types.
ECOG PS indicates the general health status of a patient with cancer. Access to this variable can enhance oncology research, help determine eligibility criteria in clinical trials, and facilitate decisions by both regulatory and health technology assessment bodies.
The study found that NLP can be an important tool to address RWD missingness. Implementing NLP enhanced the availability of ECOG PS in the dataset from 60% to 73%. When compared with ECOG values captured in structured EHR fields, NLP-derived ECOG PS had high accuracy (93%) and sensitivity (88%) and a positive predictive value (PPV) of 88%.
Why this matters
Utilizing natural language processing algorithms can help tackle critical challenges associated with RWD, including data missingness. Moreover, it can facilitate the achievement of a fundamental benefit offered by RWD: the ability to aggregate extensive longitudinal clinical information from large patient cohorts, leading to high-quality clinical research. This advancement improves our ability to answer meaningful research questions and brings significant advantages to healthcare providers, HTA, regulatory stakeholders, and, above all, patients.