Optimization of natural language processing-supported comorbidity classification algorithms in electronic health records

https://www.valueinhealthjournal.com/article/S1098-3015(19)30482-6/abstract

Authors:
Hooley I, Chen R, Long L, Cohen A, Adamson B

Objectives

Diagnosis code-based algorithms for comorbidity identification previously validated in administrative claims data have different sensitivities and positive predictive values (PPV) when applied in electronic health records (EHR). Novel algorithms leveraging natural language processing (NLP) of unstructured EHR data may improve accuracy beyond billing codes. We aimed to 1) define the design process for efficient comorbidity classification algorithms leveraging NLP and ICD codes, and 2) apply it to a case study of HIV status in an oncology EHR.

Methods

We developed a framework to optimize an NLP classification algorithm: identify more potential cases (n_NLP) than ICD codes alone (n_codes), pre-specify a minimum PPV threshold, iteratively test combinations of phrases to identify the comorbidity, validate with manual chart abstraction, and assess PPV. This proof-of-concept study applied the framework to predict HIV status among 2.2 million oncology patients in the Flatiron Health EHR-derived database. Iterations continued until PPV>70% and n_NLP>n_codes. Internal validation by manual chart abstraction confirmed status of a random sample with HIV diagnosis codes, and the NLP classification sensitivity was assessed within this sample.

Results

Five iteration cycles optimized an NLP algorithm using 9 core phrases with 40 permutations. Overall (n=2.2 million), NLP classified more potential HIV-cases (n=11,063) than diagnosis codes (n=4,592), with 3,452 patients classified by both approaches. Internal validation estimated a 77% PPV (69/90) for the NLP algorithm and >99% for the ICD code algorithm (15/15). Applied to the ICD code-identified cohort, the NLP algorithm sensitivity was 75%.

Conclusions

These findings suggest that NLP can supplement ICD codes to improve the sensitivity of EHR comorbidity classification algorithms, albeit with lower PPV. This framework needs further internal and external validation to evaluate specificity and sensitivity. These results highlight the potential value of NLP approaches in defining research cohorts in EHR populations

Sources:
ISPOR Annual Meeting

Publications

Optimization of natural language processing-supported comorbidity classification algorithms in electronic health records

Objectives

Methods

Results

Conclusions

Share

Posted in

More publications

AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning

July 2025

Using large language models for scalable extraction of real-world progression events across multiple cancer types

Cohen A, Krismer K, Magee K, et al.

arXiv

June 2025

Ensuring reliability of curated EHR-derived data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework

Estevez M, Singh N, Dyson L, et al.

ASCO Annual Meeting

May 2025

Concordance of response-based clinical trial and machine learning–generated real-world end points

Zhang Q, Krismer K, Lu Y, et al.