Skip to content

A geyser of biomarkers: Extracting data from NGS reports

Published

June 2025

A geyser of biomarkers: Extracting data from NGS reports

James Gippetti, Staff Machine Learning Engineering Manager
George Ho, Senior Machine Learning Engineer

At Flatiron, we pride ourselves on our ability to identify complex challenges and rapidly mobilize our deep bench of talent—including software engineers, data scientists, research oncologists, and clinical experts—to craft innovative solutions. This collaborative, cross-functional approach is part of our secret sauce that makes us pioneers in the healthcare data landscape. It’s why organizations across the healthcare industry (particularly pharma and biotech) turn to us when they need cutting-edge real-world evidence solutions to their most critical problems.

Our mission is to learn from every person with cancer in our network. We’ve made tremendous progress over the years scaling our data curation operation with both LLMs and custom deep learning models: we can pull out a patient’s diagnosis, treatment, endpoints and more. We can tell you a patient’s biomarker status and the dates they were tested… but only one biomarker at a time. What consistently eluded us was NGS extraction—extracting all biomarker information from all of a patient’s genetic testing reports (particularly next-generation sequencing, or NGS reports). As it is, this vast trove of biomarker data is essentially locked away because we don’t have an effective and scalable way to extract and structure it.

Extracting structured data from documents is a common and industry-standard problem —some of us on the team even have backgrounds in finance, where automated reading of financial reports and news articles is a big problem. We started digging through example reports to see what we were up against.

The challenge: biomarker data is wily

Here’s a sample report which we pulled off the internet. There’s no protected health information (PHI) on it, but it’s illustrative.

On the first page, you can see four biomarkers. Three of them are positive—you can tell because (a) they’re in bold and (b) it says “Mutated, Pathogenic”—in other words, not only is the gene mutated, it is also mutated in a way that is known to cause cancer.

  • So the model needs to learn a diverse vocabulary that indicates biomarker results: “pathogenic”, “loss”, “deficient”... there are a lot of ways of saying positive without saying “positive”. That’s not so hard for an ML model to do.

There’s also a “Method” column. What’s that?

  • There are lots of different ways to test for biomarkers. The biology of it all is interesting, but a bit besides the point: let’s just say that NGS is the most important method, and they tend to produce codes like “p.R1835*”—those are HGVS codes that notate exactly how the gene has been altered. There are more complications with these codes, but let’s move on.
  • Immunohistochemistry (a.k.a. IHC) is another notable testing method. Instead of a code, it produces a percentage score which you can see in the “Result” column, which ideally we would capture. Ultimately these percentages are thresholded to give an ordinal score that is clinically actionable—e.g. above 1% and you’re “positive” and your clinician can recommend a different therapy. But these thresholds depend on the disease (and can change over time!).

Already we’re seeing the complexity here: the kind of data we want to extract depends on the data itself! We want percentage scores for IHC results, HGVS codes for NGS results, a high-level clinically-actionable result (positive, negative, etc.) for both, and that’s not even getting started with other kinds of testing methods!

Wait, there are actually two results for ERBB2! While they’re both negative, you could easily imagine it being positive via CISH but negative via IHC, and we want to be able to capture both of those. So we can’t just produce one prediction for each biomarker… the model needs to be able to produce any number of predictions. Yikes.

Anyway, let’s move on to the next page.

At the top we see Microsatellite Instability (MSI) and Tumor Mutational Burden (TMB), which are special but important biomarkers—they’re not genes, they’re an aggregate measure of how mutated a tumor’s DNA is.

  • We still want to capture both these results, and for TMB the result is a circled number in a sliding-scale cartoon. If we were to naively pull out the text from this document, it might be tricky to get at the “11”.

Looking at the table below, what’s up with these variants of “unknown significance”? Surely a patient is either positive or negative for a particular genetic variant, right?

  • Well it’s not that simple! (“not that simple” seems to be a suspiciously recurring theme here)
  • A gene can be altered, but if that alteration is not known to be pathogenic (i.e. cause cancer), then it’s not really helpful for a clinician to know that it’s been altered.
  • We still want to capture these though! After all, today’s “unknown significance” might be tomorrow’s “pathogenic”—BRCA wasn’t known to be pathogenic until as recently as 2000.

And you can see different HGVS codes in the “Protein Alteration” and “DNA Alteration” columns. We want those, but which one to extract? Let’s leave that can of worms firmly shut for now.

Oh by the way, fun fact: this section wasn’t always used. Back in 2013, the company that produces these biomarker reports, Caris Life Sciences, changed up the PDF structure—we need to be careful not to overfit to the more recent PDFs.

We’re starting to understand why we’ve historically employed humans to abstract all of this data!

If you scroll down a bit more you’ll see, like, a billion genes! This patient was tested for all of them, and all of them are coming up negative. Again, it’s important to capture this: knowing a patient was tested negative is more helpful than thinking they were not tested at all.

It isn’t hard for an ML model to extract these, but it is hard for humans—I mean think about it, it’s pretty horrific to ask a human to sit down and type all of these out.

  • So, how are we going to get ground-truth labels to even validate a machine learning approach, let alone train one? We might have to make do with a few-shot approach, but even then we would need validation labels.

At this point, we feel like we’re basically asking for magic—we want to “just extract everything” (which is not, as you might imagine, a helpful technical problem statement).

We tried many things: a chronicle

As you can imagine, it took us a long time to realize the full complexity of the problem and we’ve run headlong into many dead ends along the way—you only know at the end what the fastest way is from the beginning, and our current solution has been years in the making. Below is a brief history of our team’s work.

2018 — We first realized that using ML to extract biomarker data from NGS reports was a good idea. We tried to model it, but realized that it was a much more challenging project than we had time for, and pivoted away from an ML approach in favor of simply abstracting the data.

2020 — We hosted a summer intern to take a more earnest crack at the problem. By this time, we had a few ML models under our belt at Flatiron, and our intern leveraged existing tools and expertise. The models performed well, but we had to drastically simplify the problem to achieve this: we still couldn’t extract everything that we wanted to.

2021 — At this point, we had projects that would’ve benefited hugely from NGS extraction, but we felt that we couldn’t indulge such a speculative project in the face of our customers’ need for research-quality data. We delivered data using traditional NLP approaches and achieved strong performance for a small number of important biomarkers, but a biomarker-by-biomarker approach simply wasn’t scalable.

2022 — We hosted another summer intern to try an alternative approach. Instead of using ML, we used industry-standard optical character recognition (OCR) services to extract text and tables from NGS reports, and hard-coded rules to parse and structure its output. This approach worked well and significantly increased the data we could extract, but the rules were brittle and specific to each vendor’s report structure. The approach also struggled with poorly-scanned documents.

2023 — This was the year things finally started to align! Emboldened by recent ML successes, we fine-tuned several open-source NLP models and succeeded in extracting a much richer dataset from NGS reports. However, it took us almost half a year to develop this capability, and productionization of these models still proved to be a huge challenge.

2024 — We piloted a proof-of-concept with a third-party ML vendor, Extend. Extend provides an LLM-powered platform to transform complex documents into high-quality data, enabling us to automate biomarker extraction with high accuracy and reliability. The proof-of-concept was a huge success: we were able to replicate 6 months of work (that took place in 2023) in around 2 weeks(!) and with comparable performance.

What’s next?

We’ve signed on with Extend and are scaling this work to unlock the value of NGS reports for all 5 million people with cancer in our network. We believe Extend is an excellent partner—their user-friendly platform has enabled us to achieve state-of-the-art accuracy, and rapidly bring models into production. This partnership helps us achieve our mission of learning from the experience of every person with cancer, truly transforming our work against this disease.

As we scale up our solution with Extend, we’ll be able to access testing results for hundreds of genes for patients all across our networks. There might be groups of patients with rare genetic alterations in our dataset who are responding better to specific types of treatment. How incredible it is to think that we’ll be able to identify these patients and learn from them to inform improvements to personalized medicine!

There’s still work to do to link this geyser of biomarker data into our products, but we can’t wait to see what research questions can get answered once it does.

Share