Skip to content

Using machine learning models to unlock insights into Castleman disease


April 2024


James Gippetti, Jonathan Kelly

Using machine learning models to unlock insights into Castleman disease

At Flatiron Health, our commitment to fostering a culture of growth and development is deeply ingrained in our DNA. Our changemakers embody our core value “learn, teach, grow” through their spirit of relentless curiosity and knowledge-sharing. We don’t gatekeep—this series serves as a collection of diverse learnings, contributions, and perspectives from our changemakers.  Ranging from personal development stories to first-time manager tips & tricks to impactful projects, this series showcases our commitment to continuous improvement with one another and within our broader community.

As the landscape of cancer treatment and disease management shifts towards more personalized approaches, the need to understand the experiences of increasingly specific patient groups is becoming critical. Traditionally, the field of oncology has relied on clinical trials to gather insights. Flatiron has been leading the charge for over a decade to complement insights from clinical trials with those derived retrospectively from real-world data (RWD). However, when using either clinical trials or RWD, unlocking insights into extremely specific populations has been challenging due to the difficulty and cost associated with identifying a cohort of patients sizable enough to do statistically significant analyses.

In response to these challenges, Flatiron's Machine Learning (ML) team has embarked on an ambitious journey. We aim to harness advanced ML techniques to extract insights from patient charts efficiently and at a scale previously unattainable. This is no easy task, as it requires a lot of effort upfront to curate a set of examples with answers (labels) that we use to teach each model what to extract.

For extremely specific cohorts like rare disease populations, this upfront cost can make the entire project infeasible. In an attempt to solve these challenges, our team has been working on models capable of accurately generalizing to patient populations that did not exist in the models training data. An example of this is a model that is trained on patients who have been diagnosed with breast or lung cancer, and using that information and the similar documentation patterns clinicians use across diseases, the model is then able to extract if a patient has been diagnosed with colon cancer. This new capability has been instrumental in the enablement of many research projects that would otherwise lack the necessary resources.

At Flatiron, we hold company-wide Hackathons a few times a year. Hackathons are a time when employees have the opportunity to take a step back and work on any project that they like—from knocking out some particularly nasty tech debt to doing a data analysis for a publication or trialing a new external tool. During one Hackathon in 2020, Dave Fajgenbaum, an oncologist, professor, and researcher from the University of Pennsylvania Perelman School of Medicine, shared his story with us. While oncologists are a typical sight at Flatiron, there is nothing typical about Dave’s story. Dave was diagnosed with the rare illness, Castleman disease while in medical school. At the time, little was known about Castleman disease and even less was known about the particularly deadly subtype that Dave had been diagnosed with, Idiopathic multicentric Castleman disease (iMCD). Undeterred by the lack of an FDA approved treatment, Dave took action—he started working tirelessly to better understand his disease and potential cures, ultimately achieving a durable remission and chronicling his story in his autobiography Chasing My Cure.

Since then, Dave has continued his work to better understand Castleman disease and available treatments. Inspired by Dave’s story and recognizing the importance of identifying and collecting data on these rare patients’ medical journeys, a group of us at Flatiron wanted to help. Flatiron is one of the few companies in the world with access to the type of data needed to make a real impact. During the Hackathon, we leveraged a combination of structured disease codes from the patient chart (ICD codes) and human review from our team of internal oncology experts to identify Castleman patients. In just the three day hackathon, we were able to identify a cohort of 453 Castleman patients, 100 of which had iMCD. This was the largest cohort of Castleman patients identified at that time. Working together with Dave, we published an abstract at the American Society of Clinical Oncology (ASCO) conference in 2021, describing the clinical characteristics of these patients.

In the years since, we’ve steadily increased our ML capabilities and suite of extracted data points. In 2023, we developed a model that predicts both the cancer type a patient has and the date of their diagnosis. We also upleveled our ML infrastructure to make it feasible to extract billions of data points from patient charts, allowing us to scale to our full network. By utilizing this model alongside our increased prediction capacity, we were able to identify significantly more Castleman patients than with our previous ICD-focused technique. With that in mind, we reached back out to Dave and formed a small cross-functional team of Research Oncologists, Machine Learning Engineers, and Research Scientists to investigate the potential of using ML to expand the cohort and gather even more insights about Castleman patients. As iMCD is the most deadly form of Castleman disease, we decided to focus our efforts there, building an additional model that identified iMCD patients from the set of Castleman patients our initial diagnosis model identified. Using these models, we looked across our full network of 3.8 million patients for those with Castleman disease, identifying 267 patients with iMCD—the largest cohort to be identified in published literature (that we’re aware of) and almost triple the initial cohort found in 2020!

Once the cohort of patients was identified, we could begin to explore clinical characteristics, treatment patterns, and outcomes for these patients. We used another ML model to predict oral drug use to understand which treatments patients were receiving and in what order they received them. Our Research Science team then generated descriptive clinical characteristics that we could use to observe the similarities and differences between the cohort we identified and previous literature. We shared our results with Dave and his team at the University of Pennsylvania and eventually worked with them to create an abstract that was accepted and presented at the American Society of Hematology (ASH) conference in 2023. This study used our ML techniques to conduct the most extensive analysis of patients with iMCD to date. By evaluating the utilization and effectiveness of various treatments in real-world practice, the research provides valuable insights into the disease, how it is managed, and how care for patients with iMCD can be improved.

Almost as exciting as identifying this cohort, is that we were able to complete the entire technical process of identifying and describing the cohort without any additional human review in under two weeks. This is a wonderful example of the speed and scale afforded to us by the use of ML techniques for data extraction. Thanks to ML and Flatiron data, clinicians, researchers, and patients with iMCD will be able to see the characteristics of a large cohort of iMCD patients, including treatment patterns and real-world survival rates. These findings carry the potential to impact future treatment strategies, leading to better care for individuals with this rare condition.

Flatiron’s mission has always been to improve and extend lives by learning from the experience of every person with cancer. Studying rare patient cohorts has been and continues to be difficult because of the challenge of cohort identification, but is critical for the continued improvement of treatments for current and future patients. Through the use of ML, we have the ability to learn from every person with cancer at the scale of the entire Flatiron network—not just for a few patients with the most common cancer types, but for all patients and for any cancer.

Medicine is becoming increasingly personalized, with therapies changing from disease-level hammers to patient-level needles. This heightens the need for specific rare cohorts, as the question for patients changes from “What do people with my disease look like?” to “What do patients like me look like?” We at Flatiron have been working to unlock answers to those questions for over a decade, and utilizing ML and our expansive network, we are answering that question at a more personal level than ever. 

Make your next career move

Are you ready to become a changemaker in cancer care?