Japanese language EHR LLM extraction of longitudinal unstructured ECOG performance status

Overview

Accurately capturing patient performance status is essential for oncology research and clinical decision-making, yet in Japan, the Eastern Cooperative Oncology Group (ECOG) performance status is typically recorded only in unstructured clinical notes, making data collection challenging and resource-intensive.

This study evaluated whether large language models (LLMs) could automatically extract ECOG performance status information from Japanese electronic health records, potentially unlocking scalable real-world data curation capabilities. When compared to manual human abstraction, the LLM achieved 100% sensitivity and 100% precision in identifying performance status values. Notably, in cases where human reviewers missed information, the model identified performance status in approximately 50% of those records. The computational cost of using this automated approach was estimated at less than 5% of the cost for traditional manual data abstraction.

Why this matters

This research demonstrates that LLM-based extraction of ECOG performance status from Japanese clinical notes is both highly accurate and cost-effective. By automating this process, researchers can now scale real-world data collection across Japan more efficiently, enabling richer longitudinal datasets for cancer research. Future advancements on this approach in Japan will build on Flatiron’s focused efforts to validate LLM-extracted real-world data globally, including publications like the VALID Framework—the industry's first comprehensive approach to evaluating AI-extracted real-world data. Finally, this work supports international research collaboration by creating consistent, harmonized clinical data across countries—ultimately improving our ability to conduct meaningful real-world evidence studies that inform treatment decisions for cancer patients globally.

Publications

Japanese language EHR LLM extraction of longitudinal unstructured ECOG performance status

Overview

Why this matters

Share

Posted in

More publications

ASCO Annual Meeting

May 2026

From plenary to practice: A large language model (LLM)-based thematic analysis of landmark clinical trial discussions and factors influencing their real-world adoption

Cohen AB, Williams T, Aggarwal C, et al.

ASCO Annual Meeting

May 2026

Using ML to predict rapid progression for patients (pts) with HR+/HER2- metastatic breast cancer (mBC) treated with frontline (1L) CDK 4/6 inhibitors (CDK 4/6i)

Peng M, Rios G, Estevez M, et al.

ASCO Annual Meeting

May 2026

Machine learning risk stratification in a US-based database to identify subgroups of patients with PD-L1-high NSCLC who benefit from adding chemotherapy to pembrolizumab

Orcutt X, Nimgaonkar V, Sun L, et al.