Overview
While real-world data (RWD) from electronic health records are used to study how cancer treatments perform in routine clinical practice, individual databases often contain limited numbers of patients for certain research questions—especially when studying rare outcomes, uncommon exposures, or specific treatment groups. Combining, or “stacking,” multiple RWD sources can help address this challenge by increasing the size and representativeness of patient cohorts, but doing so requires reliable ways to identify and remove duplicate patients while protecting privacy.
In this study, researchers evaluated a privacy-preserving method to combine two oncology databases—the Flatiron Health Research Database and ConcertAI’s data lake—for patients in the United States with treated metastatic breast cancer (mBC). Using a process called tokenization, personally identifiable information was converted into encrypted tokens that allowed patients present in both databases to be removed from one of them, without sharing sensitive data. After applying treatment-specific eligibility criteria and removing overlapping records, the researchers created a stacked dataset combining curated patients from both sources, increasing the treatment-specific cohort size by approximately 50%.
Why this matters
Larger patient cohorts result in greater accuracy and precision when studying rare outcomes, treatment patterns, and specific patient subgroups in oncology. This research demonstrates that tokenization can enable the stacking of real-world datasets from different sources while maintaining strict privacy protections. By expanding the number of patients available for analysis, this approach can strengthen real-world evidence generation and support more robust oncology research that may not be feasible using a single data source alone.