Skip to content

From entry to insights: How AI is transforming the clinical data journey

Published

March 2025

By

Aaron Cohen, MD, MSCE, Senior Medical Director; Head of Research Oncology, Clinical Data

From entry to insights: How AI is transforming the clinical data journey

As a medical oncologist working at both Bellevue Hospital and Flatiron Health, I see firsthand how the details I document in patient charts—like the date of a cancer diagnosis or the results of genomic testing—are more than just notes for immediate care or billing. These details hold the potential to shape research, guide treatment decisions, and help connect patients with clinical trials that could change their lives.

As advancements in artificial intelligence (AI) continue at a breakneck pace, one of the things I'm most excited about (apart from AI writing my patient notes!) is the ability of large language models (LLMs) to make sense of the complex clinical information buried in medical charts and pull out the critical details that can help enable these promising applications. For example, a patient with stage 4 lung cancer might be a candidate for a clinical trial based on a specific gene mutation, but if that result is tucked away in a scanned PDF or long visit note, it may not be surfaced in time for their oncologist to act. In a fast-moving field like oncology, where trial slots are limited and treatment windows are narrow, a delay in finding this information can mean the difference between accessing a cutting-edge therapy or missing the opportunity altogether.

At Flatiron, we’ve pioneered the use of LLMs to extract clinically meaningful information from the EHR, making de-identified patient data more accessible and actionable. The ultimate goal? To ensure every person’s experience can help inform and improve outcomes for people with cancer.

The enthusiasm around using LLMs for EHR data curation is well deserved—these models are transforming how we identify and utilize clinical insights at scale. AI and machine learning (ML) have the power to process vast amounts of patient data rapidly and extract meaningful insights from hundreds of pages of documents. As a result, we can more nimbly understand evolving standards of care and the efficacy of new treatments in real time. 

But what often goes unnoticed is the meticulous, behind-the-scenes work required to ensure LLMs are producing critical data points that are not only scalable but also accurate, complete, and helpful.

Let’s take a closer look at how a single piece of clinical data travels from a clinician’s keyboard to become key evidence advancing research and informing patient care.

Leveraging LLMs to their fullest potential

Given how powerful LLMs are becoming, it’s tempting to assume they can be deployed as-is without modifications or a deep understanding of their inner workings and limitations. However, the reality is far more complex. At Flatiron, behind every LLM application, there’s a cross-functional team of machine learning engineers, research scientists, and clinicians constantly experimenting and iterating with the latest technologies and models.

Over the past two years of experimentation, we’ve discovered that effectively prompting LLMs requires a strategic approach—a Goldilocks-like sweet spot where the level of input must be just right. Providing too much information and instruction can overwhelm the model, while providing too little fails to yield meaningful results. By using this precise combination of human expertise and the latest technology, our multidisciplinary team pinpoints the right sections of the chart for the LLM to focus on, determines when fine-tuning is beneficial to teach the model new abilities, and effectively breaks down the requested task so the model can reason appropriately to achieve the highest possible accuracy.

For example, LLMs are well suited to support the identification of cancer progression events. This is a crucial data point in a patient’s journey, and understanding when they occur can help us better understand treatment efficacy, predict if a patient will respond to therapy, and more quickly enroll patients in a clinical trial. Yet these events are documented inconsistently in the EHR due to variations across clinicians, cancer types, and diagnostic tests. It’s difficult for human abstractors alone to efficiently identify these events in a way that is scalable across millions of patients.

To solve for this, our engineers and clinical team collaborate to craft precise instructions that guide the LLM in interpreting intricate clinical subtleties. We often prompt the LLM with the framing: "Imagine you are an oncologist," which helps align its reasoning with clinical expertise. We then go further, refining it to actually think like an oncologist by breaking complex problems into smaller, more manageable steps, an approach called chain-of-thought prompting. These techniques improve model performance and help it to distinguish subtle differences in oncological care: "If tumor flare is mentioned, classify it as pseudoprogression rather than traditional progression." Cancer care is complex, and ensuring an LLM comprehends these distinctions requires thoughtful, nuanced guidance from clinical and technical experts.

The importance of validating AI outputs

Extracting data is just the first step in data curation. Next we need to evaluate its performance to understand the impact of potential errors and continuously optimize our approach.

The key to truly understanding how well the LLM is working is establishing a reliable reference point—what the industry refers to as a gold standard—to compare against. The gold standard can be defined differently, but at Flatiron, we hold ourselves to the highest benchmark—striving for 24-carat quality.

Our process is rigorous: two human abstractors, trained experts guided with detailed policies and procedures that have been crafted over the past decade, independently abstract an answer for the same task. If we see disagreements, either between abstractors or between abstractor and LLM, our clinical team adjudicates the differences to reach a consensus.

Measuring LLM performance against this extremely high bar ensures we are able to push for the best possible accuracy. We’re also able to evaluate our trained human abstractors using this same rigorous standard, which allows us to contextualize the LLM’s performance relative to human expertise. The results have been striking, with LLMs in some cases exceeding human performance! Given Flatiron’s high standards for human abstraction quality, this is an exciting milestone.

Incorporating human oversight and iterative refinement

LLM performance is just one piece of the puzzle and what they extract is not the final product. At Flatiron, we implement multiple layers of validation to ensure the accuracy and reliability of our clinical data, recognizing how important it is to get it right given the critical role of real-world data in research and clinical decision making. Flatiron’s core principle has been to combine AI and ML technology with expert wisdom and use both to their best ability. It takes people to turn the promise of AI and ML into power.

There’s the saying “human in the loop,” and then there’s what we do at Flatiron (which honestly sounds a bit like the beginning of a joke): A doctor, engineer, and a statistician walk into a loop... 

Anyway…here are some of the additional efforts we employ across teams to refine and validate our data: 

  • Clinicians will propose data checks to flag findings we wouldn’t expect to see (such as a patient with a specific mutation but not the associated targeted therapy) for deeper review. 
  • Our clinical team then works with our engineers to identify errors and actually go back into the chart. We do this to not only ensure we get the right answer but also understand why errors occurred in the first place to inform changes in our models that may be needed.
  • We leverage our large and representative network to analyze data not just at the patient level but at the population level as well to ensure findings align with clinical benchmarks, literature, and even our own scaled human-abstracted data sets. (“Hmmm, the proportion of patients with stage IV disease is higher than we would expect—let's examine that”).
  • Our Health Equity team investigates how LLMs are performing in historically marginalized subgroups to assess potential bias and ensure fairness in our models.
  • We top it off by having research scientists perform clinically relevant analyses using the data—e.g., How long do patients receiving a given therapy live?—and compare the answer to what we would have gotten if we used only human abstraction. Getting similar results builds confidence and trust.

It’s this rigorous, multi-step approach that ensures only the highest-quality, most reliable data makes it into a finalized dataset.

The bigger picture: Harnessing AI with purpose

AI is revolutionizing what we know and understand about patients with cancer, and we believe AI and ML are a critical part of unlocking additional value at scale. Extracting and analyzing clinical data from the EHR enables us to better appreciate each patient’s experience in the real world and learn more about the best ways to support them.

At Flatiron, we are embracing every opportunity to responsibly leverage AI capabilities—digging deeper to unlock more granular clinical details, scaling larger to draw knowledge from each patient’s journey, and continuously refining our approach to stay at the forefront of technological advances. But LLMs alone are not a standalone fix—they are a force multiplier. Their true value comes from how we shape, refine, and integrate them with deep technical and clinical expertise alongside rigorous validation.

This balance of technological innovation and thoughtful oversight is at the core of our mission: to improve and extend lives by learning from the experience of every person with cancer. Because of this human-first approach, I have confidence that the data I enter in the clinic remains true to its original intent by the time it reaches its final destination, supporting patients, clinicians, and researchers in the process. 

 

To learn more about Flatiron Health's Evidence Solutions, reach out to us.

Share