Skip to content

Responsible AI in action: Flatiron's approach to AI fairness and bias assessment

Published

July 2025

By

Cleo A. Ryals, PhD, Head of Health Equity

Do the Right Thing

At Flatiron, our mission is to improve patient lives by learning from the experience of every person with cancer. As a health services researcher, who has spent nearly two decades focused on leveraging data to better understand and address cancer inequities, I understand the importance of generating real-world evidence (RWE) that accurately reflects the diversity of people living with cancer. Moreover, as a scientist grounded in community-engaged research methods, I recognize the value of engaging community voices in the work of advancing health equity and achieving optimal care for every person with cancer. At the heart of my approach is a deep-seated belief in integrity, which to me means doing the right thing, when nobody else is looking, even when it feels inconvenient or uncomfortable. 

“Do the Right Thing” is a core value that we embrace daily in our work at Flatiron and it’s one that has gained even greater importance as artificial intelligence (AI) becomes more prominent in our day-to-day work. While AI has introduced new opportunities to accelerate and scale our work, it also comes with its own sets of risks and challenges that must be addressed along the continuum of AI development and deployment. This is especially true in the context of real-world data (RWD) curation and real-world evidence generation, where machine learning (ML) and large language models (LLMs) are increasingly playing a role alongside human-based approaches to data extraction and interpretation.

At Flatiron, we’ve invested heavily in rigorously evaluating the performance of our AI models used for scaled extraction of clinical details such as biomarkers, tumor response and progression. This evaluation process includes careful assessment and mitigation of potential biases in how these models perform across different patient subgroups. But the truth is that measuring bias isn’t as straightforward as one might think. It’s an ever-evolving process that necessitates input from diverse stakeholders, including those with lived experiences of bias and marginalization.

What is a bias assessment and why is it important?

The National Institute for Standards and Technology (NIST) Responsible/Trustworthy AI framework uses the term fairness to refer to the goal of ensuring that AI systems operate without causing harmful bias or discrimination, particularly to historically marginalized or disenfranchised groups. According to NIST, managing risk of bias involves actively identifying, mitigating, and monitoring biases that may arise in AI systems during their design, deployment, and operation. As such, bias assessments are critical to determining whether AI systems perpetuate or exacerbate existing societal inequities. For example, in 2019, Obermeyer et al. demonstrated how a commonly used AI algorithm for predicting healthcare needs among health plan beneficiaries using claims data was found to disproportionately underestimate the health needs of Black patients, leading to less appropriate care.

In the context of AI-enabled extraction of clinical details from electronic health records, bias assessments are critical to ensuring that historical and systemic biases, often reflected in various types of health data, are not perpetuated. Moreover, AI algorithms, such as the one assessed in the Obermeyer study, can harbor inherent biases that impact model performance. Without proper bias assessment, these inequities can be embedded and even amplified in RWE generation, leading to bias in its application (e.g., target discovery, trial design, post-marketing studies). These biases can have serious downstream effects by reinforcing exclusionary research practices, skewing regulatory or payer decisions, and ultimately compromising the quality, equity, and effectiveness of care delivered to patients.


What is Flatiron’s approach to ML/LLM bias assessment?

Bias in ML/LLM-derived real-world data isn’t always obvious. It can stem from the underlying data, the model itself, or how that model performs across specific subgroups. So, it's not enough to just ask, “Does the model work?” At Flatiron, we ask, “Who does it work for, where might it fail, and what are the implications if it does fail?”

Take the example of extracting HER2 status in breast cancer. A model might show strong overall performance, but can mask issues within subgroups. When stratifying by age, we might find that HER2+ status is under-identified among older patients. This under-identification could be due to differences in documentation patterns or training data gaps. This kind of hidden bias wouldn’t show up in overall cohort-level metrics but could meaningfully impact clinical and research insights.

While we can't always eliminate bias in the underlying data, we design our models to be accurate and transparent enough to detect and measure it. That way, instead of concealing inequities, our models help uncover and quantify them, supporting broader efforts to identify and address inequities in healthcare.

Moreover, as responsible users of AI, Flatiron’s goal is to ensure that our ML/LLM-derived variables support both accurate and fair insights. Since launching our first ML models in 2017 to support more efficient cohort selection, our approach to bias assessment has evolved significantly. Initially, our focus was on comparing baseline characteristics and overall survival among cohorts identified via ML vs. human abstraction. Today, with expanded tools and capabilities, we've developed a more comprehensive bias assessment framework that leverages three complementary approaches for evaluating ML and LLM performance, including.

  1. Subgroup Performance Checks

    Bias assessment in this context checks whether the model performs equally well across different demographic groups. For example,  we could examine HER2 precision/recall stratified by age or race.

  2. Verification Checks

    Bias assessment in this context looks at patterns in the data separately by demographic subgroups. For example, we might examine different age groups to see whether patients on HER2-targeted therapy are correctly labeled as being HER2+.

  3. Replication Analyses

    Bias assessment in this context tests whether the results from an analysis leveraging ML/LLM-derived data yields the same results as an identical analysis leveraging human abstracted data, overall and stratified by subgroups of interest.

While this framework reflects our current approach to bias assessment, we recognize that the AI landscape is evolving at an exponential rate. As such, we continue to be agile in our approach to bias assessment, in order to keep up with the rapidly changing AI landscape and to ensure fairness in our development and deployment of ML/LLMs. 

Alone we go faster, together we go farther

While AI is indeed a highly technical topic, in order to effectively address issues related to bias and fairness, it’s crucial that we engage a wide range of stakeholders in the bias assessment process. It’s not enough to rely solely on technical experts to identify and mitigate bias; we must ensure that our work is also grounded in the lived experiences and expertise of patients, survivors, clinicians, and health equity researchers.

This is a key reason why, when I joined Flatiron three years ago as the Head of Health Equity, one of the first things I did was establish a Health Equity Scientific Advisory Board (HE SAB) comprised of diverse stakeholders representing patients, survivors, academia, community oncology, and grassroots organizations. Our HE SAB has played a critical role in complementing the expertise of Flatiron’s Health Equity team by guiding and informing our health equity research efforts, including our bias assessment strategy. Last year, we held multiple meetings with our HE SAB, where we transparently explained our approach to deploying and evaluating ML models at Flatiron and solicited feedback regarding how to improve our bias assessment approach. We received several suggestions, including:

  • Examining bias in the underlying human abstracted data (on which ML/LLM-derived data are trained and evaluated)
  • Integrating health equity use cases into Flatiron’s existing replication analysis-based performance assessment to ensure equity measurement is not obscured by ML/LLMs
  • Broadening bias assessment approach to include assessments of bias by additional factors such as geography and social determinants of health 

We’ve already begun integrating these suggestions into our bias assessment strategy at Flatiron, and plan to continuously engage our HE SAB around this topic. Our experience has shown us that by adopting an inclusive, community-engaged approach to bias assessment, we can foster the development and deployment of AI that is not only scientifically sound but also socially and ethically responsible.

Shaping the future of responsible AI Use in RWE 

As the use of AI in RWE generation continues to evolve at a rapid pace, it brings with it both exciting opportunities and new responsibilities. With each advancement, we must proactively consider how new forms of bias may emerge, and commit to identifying and addressing them early and often. However, successfully designing and executing bias assessments for AI use in RWE requires broad, interdisciplinary thought leadership that includes bringing together voices from data science, health equity, clinical care, and lived experience to ensure AI models are not only technically sound, but socially responsible as well.

At Flatiron, we recognize that responsible use of AI in RWE is not just a technical challenge; it’s a moral and business imperative that sits at the heart of “Doing the Right Thing” for all people with cancer, no matter who they are or where they live. I am proud to be part of an organization that not only recognizes this imperative, but also backs it up with action. By engaging a diverse range of experts across the evidence and healthcare ecosystem and staying vigilant against the unintended harms that can arise, Flatiron is shaping an AI-powered future that is not only innovative, but also fair, transparent, and truly inclusive. This ultimately moves us closer to realizing our mission: to improve and extend lives by learning from the experience of every person with cancer.

To learn more about Flatiron’s approach to AI fairness and bias assessment, please reach out.

Share