What 50,000 AI-Flagged Studies Taught Us About False Positive Management

Across our deployed hospital sites, we've now processed approximately 50,000 studies that triggered an AI alert — a flagged finding above the threshold for critical or significant pathology. Our clinical informatics team tracks the outcome of every one of those flags: whether the radiologist confirmed the finding, modified the assessment, or dismissed the alert as a false positive. That dataset has taught us more about what makes AI radiology tools useful or frustrating in practice than any controlled validation study we've run.

The headline finding: false positive rates in clinical operation are higher than validation studies suggest, they're not uniformly distributed across alert categories, and how you manage them operationally matters as much as how you reduce them in training.

The Gap Between Validation and Deployment False Positive Rates

Across our five primary alert categories — pneumothorax, intracranial hemorrhage, pulmonary embolism, pleural effusion, and pulmonary nodule — our validation false positive rates ranged from 3.1% to 7.4%. In live deployment, the overall false positive rate across the same categories runs at 8.9%. The gap is consistent and worth understanding rather than dismissing.

The first explanation is case mix. Validation datasets, even well-constructed ones, don't perfectly replicate the case mix of a live clinical environment. In our deployed emergency departments, we see a higher proportion of technically degraded images than in our validation cohorts — portable ED X-rays with motion artifact, suboptimal positioning, or incomplete chest coverage. Technically degraded images generate more model uncertainty, which translates to more borderline predictions, which produce more false positives at a fixed threshold.

The second explanation is population characteristics. Our validation datasets were demographically diverse, but they weren't locally calibrated to any specific institution's patient population. One of our deployment sites serves a large population of patients with chronic obstructive lung disease, where baseline chest X-ray appearance — hyperinflation, flattened diaphragms, bullous changes — creates a background that the model sometimes interprets as pathological. Local case mix shapes local false positive rates in ways that a global validation dataset can't fully predict.

The third explanation is what we call prior-study artifacts. When temporal comparison is active, the model occasionally generates false positives when comparing studies that appear to show interval change but where the apparent change reflects different patient positioning or acquisition technique rather than true clinical change. This is harder to eliminate because the temporal comparison feature is also responsible for a meaningful portion of our true positive yield — turning it off reduces false positives but also reduces true positive sensitivity.

Which Alert Categories Generate the Most False Positives

The false positive rate is not uniform across alert categories, and that variation is clinically significant.

Pneumothorax alerts have the lowest false positive rate in live deployment at 4.1%. Pneumothorax has well-defined radiographic criteria, is present at meaningful clinical rates in our ED population, and the imaging feature — pleural line separation — is distinct enough that the model makes reliable predictions even in degraded images. Radiologists dismiss very few pneumothorax alerts.

Pulmonary nodule alerts have the highest false positive rate at 18.3%. Nodule detection in clinical operation is genuinely hard because the clinical population includes a high proportion of normal variants that overlap with small nodule appearance — end-on vessels, nipple shadows, rib fracture callus. Most of the false positive nodule alerts are being generated by these normal variants rather than by clear imaging errors. We're addressing this with a dedicated normal variant classifier that runs as a post-processing step on positive nodule predictions, but this category remains our most challenging in terms of clinical trust.

Intracranial hemorrhage alerts show a 6.2% false positive rate, with most false positives clustering around beam hardening artifact on head CT and calcified structures that the model occasionally misclassifies. Radiologists recognize these false positives quickly — the review time for a dismissed hemorrhage alert averages 38 seconds, suggesting the dismissal is based on rapid visual recognition rather than extended deliberation.

Alert Fatigue: What We Measured

Alert fatigue — the tendency to dismiss alerts without careful review when alert volume is high — is the operational risk that undermines the clinical value of any AI triage system. We track alert engagement metrics at all deployed sites: time to open flagged study, whether the flagged region is viewed, report content correlation with the AI flag.

At the site with the highest alert volume (a Level I trauma center averaging 340 studies per day), we observed alert disengagement beginning to emerge at approximately 14 weeks post-deployment. Dismissal speed for positive alerts increased — radiologists were declining flags faster, which in statistical terms indicates less careful review. The pattern was subtle and wouldn't have been visible without the engagement metrics, but it was real.

The intervention we implemented was a threshold recalibration that increased the confidence requirement for alerts — trading a small reduction in sensitivity for a meaningful reduction in alert volume. Alert volume dropped by 22%. Dismissal speed returned to baseline within three weeks. Over the following quarter, confirmed true positive rate on remaining alerts increased by 4.1 percentage points — meaning the higher-confidence alerts were generating more clinically actionable findings per flag.

This is a counterintuitive but important finding: a system generating fewer alerts can deliver better clinical outcomes than one that maximizes sensitivity at the cost of alert fatigue. The optimal operating point is not maximum sensitivity — it's the point where radiologist engagement with alerts is high enough to capture the benefit of the true positives without the dismissal behavior that erodes system value.

What Reduces False Positives Most Effectively

Three interventions have shown measurable impact on our live false positive rates:

Site-specific threshold calibration — setting alert thresholds based on locally collected data rather than global validation thresholds — has been the single most effective intervention. Institutions with access to three to four months of shadow deployment data can calibrate to their local case mix in a way that reduces false positives without sacrificing meaningful true positive yield. This requires ongoing monitoring infrastructure but delivers consistent results.

Image quality filtering — screening out studies that fall below a minimum quality threshold before applying the full detection model — has reduced false positives from degraded images by approximately 35% at the sites where we've implemented it. The tradeoff is that borderline-quality studies that contain real findings are also screened out and returned to the radiologist without an AI overlay. Radiologists at affected sites have been comfortable with this tradeoff.

Normal variant expansion in training data has progressively reduced false positives in the pulmonary nodule category. Each major retraining cycle incorporates additional normal variant examples specifically sourced to address the false positive patterns we're seeing in live deployment. The pulmonary nodule false positive rate has decreased from 24.1% at initial deployment to the current 18.3% over three retraining cycles. The trajectory is in the right direction, but the category remains our most active area of ongoing work.

The 50,000-study dataset hasn't given us all the answers. But it's given us a level of operational specificity about where false positives come from and what actually moves the rate that no controlled validation study can replicate. That's the value of sustained post-market monitoring — not just demonstrating compliance, but actually learning what the system does in the world as it is rather than the world as the training data represents it.

What 50,000 AI-Flagged Studies Taught Us About False Positive Management

The Gap Between Validation and Deployment False Positive Rates

Which Alert Categories Generate the Most False Positives

Alert Fatigue: What We Measured

What Reduces False Positives Most Effectively

Review our post-market performance data