One question we get regularly from clinical leadership evaluating AI deployment is how our performance numbers compare across modalities. It's a reasonable question — chest X-ray, CT, and MRI are very different imaging techniques, and a 97% sensitivity figure in chest X-ray doesn't tell you much about what to expect from CT or MRI performance. The answer requires some context before the numbers make sense.
Why Modalities Aren't Directly Comparable
Sensitivity and specificity figures across modalities are not apples-to-apples comparisons, even when they're expressed with the same precision. Each modality presents a different problem to the AI model — different data structure, different resolution, different pathology detection task, and a different ground truth reference standard.
Chest X-ray is a 2D projection image with relatively constrained image dimensionality. The AI model processes a single image (or two projections in a PA/lateral pair) and makes a binary detection decision for each of a defined set of pathology classes. The reference standard is typically radiologist consensus read across a case-controlled validation set.
CT is volumetric. A single chest CT study contains 300-600 axial slices, and the pathology detection task involves identifying abnormalities in 3D space. The reference standard for CT often involves reference reads by subspecialty radiologists — a pulmonary subspecialist rather than a general radiologist — which raises the quality of the reference standard and often the difficulty of the detection task. A finding that a general radiologist marks as "present" on a chest X-ray may require a more specific description to be counted as "detected" on CT.
MRI adds another dimension of complexity: acquisition variation. MRI studies are acquired with different pulse sequences, field strengths, and contrast protocols that produce meaningfully different images from the same anatomy. A model trained on 1.5T brain MRIs performs differently on 3T brain MRIs of the same pathology. Managing that variation in training data is one of the harder technical challenges in MRI AI, and it's reflected in the generally lower sensitivity numbers across MRI applications compared to chest X-ray.
Our Modality Performance Data
With that context established, here is our current external validation performance across our three primary imaging modalities, with the detection task and reference standard specified for each figure.
Chest X-Ray — Pneumonia Detection
Sensitivity: 97.3% | Specificity: 91.4%
Validation set: 18,400 studies, case-controlled 40% positive, multi-center
Reference standard: 3-radiologist consensus, board-certified general radiologists
Chest X-Ray — Pleural Effusion
Sensitivity: 96.8% | Specificity: 93.1%
Validation set: 9,200 studies, multi-center
Reference standard: 3-radiologist consensus
Chest X-Ray — Pneumothorax
Sensitivity: 97.8% | Specificity: 94.6%
Validation set: 6,800 studies, multi-center
Reference standard: 3-radiologist consensus; CXR confirmed by CT where available
CT — Pulmonary Nodule Detection (6-30mm)
Sensitivity: 94.1% | Specificity: 88.2%
Validation set: 4,100 CT studies, case-controlled 35% positive
Reference standard: Lung-RADS assessment by subspecialty thoracic radiologist
CT — Intracranial Hemorrhage (head CT)
Sensitivity: 96.8% | Specificity: 95.3%
Validation set: 3,600 head CT studies, case-controlled 30% positive
Reference standard: Neuroradiology subspecialist consensus read
MRI — Brain White Matter Lesion Detection
Sensitivity: 93.6% | Specificity: 89.4%
Validation set: 2,800 brain MRI studies, case-controlled 45% positive
Reference standard: Neuroradiology subspecialist read on FLAIR and T2 sequences
MRI — Spinal Cord Compression
Sensitivity: 91.2% | Specificity: 90.8%
Validation set: 1,900 cervical and lumbar MRI studies
Reference standard: Spine radiologist consensus; surgical correlation available for 480 cases
What the Performance Gap Between Modalities Reflects
The pattern visible in the data above — higher sensitivity in chest X-ray applications, somewhat lower in CT, and lower again in MRI — reflects several factors that aren't random.
Training data volume is one. Chest X-ray is the highest-volume radiological study performed globally, and annotated chest X-ray datasets are orders of magnitude larger than annotated CT or MRI datasets. The chest X-ray models have been trained on more examples of more pathology variations, which generally translates to better generalization performance at external validation.
Pathology definition precision is another. Pneumothorax on a chest X-ray is a well-defined radiographic finding with clear diagnostic criteria. Spinal cord compression on MRI is diagnosed in the context of clinical presentation, degree of canal narrowing, cord signal change, and clinical acuity — a more complex decision that requires the model to integrate more contextual information to make a useful prediction.
The lower numbers in MRI also reflect acquisition variation we haven't fully solved. Brain MRI studies in our validation data came from 1.5T and 3T scanners with a mix of T1, T2, FLAIR, and DWI sequences. The model's performance varies by acquisition type within the validation cohort in ways that point to ongoing training improvements rather than fundamental limitations of the approach.
How to Use These Numbers
The honest use of these figures in deployment planning is to assess whether the performance level justifies the specific workflow application you're considering. A sensitivity of 97.8% for pneumothorax is sufficient to support emergency triage use — the false negative rate is low enough that clinical reliance on the alert is warranted. A sensitivity of 91.2% for spinal cord compression suggests a useful second-reader application, but you'd want to be cautious about using it as a primary triage filter where missed findings carry high-stakes consequences.
Every clinical use case has a different sensitivity threshold that's clinically adequate — and that threshold depends on the consequences of a false negative in that specific context, not just on the absolute number. We encourage clinical teams evaluating any AI tool, including ours, to ask that question explicitly rather than comparing sensitivity numbers in isolation.