In 2023, we ran a stratified performance analysis on our chest X-ray model that we hadn't originally planned to publish. We planned to publish it as part of our FDA submission documentation, use it internally to guide training improvements, and move on. The results were uncomfortable enough that we eventually decided they deserved a standalone discussion — because the problems we found are likely present in other models that haven't done this analysis, and the field needs more transparency about what demographic bias in medical imaging AI actually looks like in practice.

What we found: our chest X-ray model's sensitivity for pneumonia detection was 97.1% in white patients and 91.4% in Black patients, across a balanced case-controlled subset of our validation data. That 5.7-point gap was statistically significant (95% CI: 3.1-8.3 points), consistent across three independent validation cohorts, and present in both male and female subgroups within the Black patient population.

We didn't expect it. We should have looked for it sooner.

Where Bias in Medical Imaging AI Comes From

The sources of demographic performance disparities in medical imaging AI are better understood now than they were five years ago, largely because a handful of teams have been willing to investigate and publish their findings rather than suppress them. The main mechanisms fall into three categories.

Training data distribution is the most obvious. If a model is trained on chest X-rays predominantly from academic medical centers in the northeastern United States, it will reflect the patient demographics of those centers — which tend to be less diverse than the population of patients the model will eventually encounter in community hospitals, safety-net institutions, and rural health systems. A model that has processed 200,000 examples of pneumonia in patients of European ancestry and 12,000 in patients of African ancestry will, absent other corrections, learn features that are better calibrated to the former group. This isn't a flaw in the architecture. It's a direct consequence of differential representation in the training data.

Imaging acquisition differences compound this. Skin tone affects how image contrast and density artifacts manifest on X-ray in ways that are subtle but real. Body composition differences that vary by population affect normal reference distributions for organ sizing and density. Equipment calibration practices differ between institutions, and if some equipment types are overrepresented in training data, the model may generalize less reliably to equipment more common in underserved settings — which often run older acquisition technology.

Ground truth labeling is a third source that receives less attention. If the radiologist annotations used to train a model reflect the same diagnostic biases documented in human clinicians — and there is substantial evidence that they do — then training on those labels propagates existing inequities rather than correcting them. A model that learns from radiologist labels will learn whatever biases those radiologists carried, along with their clinical expertise.

What We Did to Address the 5.7-Point Gap

The intervention we implemented was multi-pronged. First, we sourced additional training data from institutions serving majority-Black patient populations — specifically, four safety-net hospitals in the southeastern United States with patient demographics that were underrepresented in our original training corpus. We added approximately 280,000 studies from these sites, with annotations performed by radiologists recruited from those same institutions to address the labeling bias concern.

Second, we implemented a fairness-aware training objective that penalized demographic performance disparities during model optimization. The loss function included a demographic parity term that weighted gradient updates to reduce the gap between highest- and lowest-performing demographic subgroups on a held-out stratified validation set. This is a well-established technique in algorithmic fairness literature, though its application to medical imaging AI has lagged behind other domains.

Third, we conducted a separate analysis of imaging acquisition differences across the demographic groups in our data. We found that studies from the institutions with lower Black patient performance had been acquired on a higher proportion of older DR systems versus newer systems predominant at the academic centers. We added targeted data augmentation to simulate older acquisition characteristics and ensure the model's feature extraction was robust to acquisition variation.

After these interventions, the pneumonia sensitivity gap between white and Black patients narrowed to 1.2 points (96.8% vs. 95.6%), which was within the confidence interval of statistical equivalence at our sample sizes. We re-ran the analysis across all other demographic dimensions we had sufficient power to test — sex, age decile, BMI quartile, insurance status as a proxy for socioeconomic factors. The remaining gaps are small and we continue monitoring them in our post-market surveillance program.

What We Haven't Fixed

Complete demographic parity across all subgroups and all pathologies isn't a goal we'll achieve with a single round of training improvements. The data limitations are real. Some demographic groups simply don't have large enough representation in any available medical imaging repository to support the same depth of statistical validation as majority populations. Rare disease presentations in specific populations may have training sample sizes too small to characterize model performance with statistical confidence.

We publish the subgroup performance data in our external validation reports. The gaps that remain are documented. Healthcare institutions deploying our technology deserve to know exactly where performance is weaker, so they can make informed decisions about the clinical contexts where AI assistance is most and least reliable for their specific patient populations.

The broader point is that bias analysis in medical imaging AI should not be optional. It should be a standard part of every validation program. The data we found wasn't there because we were building a flawed product — it was there because we looked for it. Many products with equivalent technical specifications haven't done this analysis at all. That's a problem for the field, not just for individual vendors.