Overview
Missingness in race/ethnicity data in real-world health care data sets make it difficult to assess and address health inequities. The Bayesian Improved Surname Geocoding (BISG) method is a validated approach for imputing missing race/ethnicity data, but there have been few applications in large, diverse oncology cohorts. This study evaluated the validity of BISG in a nationwide electronic health record (EHR)-derived cohort of cancer patients, comparing BISG-imputed race/ethnicity with EHR-documented data and assessing its impact on key patient outcomes.
Using data from over 2.25 million patients in the Flatiron Health EHR-derived database, researchers found that BISG significantly increased the proportion of Latinx, non-Latinx (NL-) Asian, NL-Black, and NL-White patients by “unlocking” patients with unknown race/ethnicity. The method demonstrated high classification accuracy and concordance with EHR-documented race/ethnicity, particularly for NL-White patients. Importantly, when assessing associations between race/ethnicity and outcomes such as overall survival, time to treatment initiation, and clinical trial participation, results were consistent between BISG-imputed and EHR-documented race/ethnicity.
Why this matters
Accurate race and ethnicity data are essential for identifying and addressing disparities in cancer care. This study demonstrates that BISG is a reliable tool for augmenting missing race/ethnicity data in oncology research, enabling more comprehensive health equity analyses. By improving the completeness of demographic data, BISG can help researchers and healthcare providers better understand disparities in treatment access, clinical trial participation, and patient outcomes. Further, these insights can inform policies and interventions aimed at reducing inequities and ensuring that all patients receive high-quality cancer care.