Our chest X-ray model was accepted for publication in the spring of 2022. The peer review process had been rigorous — two rounds of revision, requests for additional validation cohorts, questions about statistical methodology that pushed us to improve our analysis. The final paper reported 96.4% sensitivity on pneumonia detection, validated on a prospective external cohort. It felt like an endpoint.
It was a starting line. The first clinical deployment happened 18 months later. This is an account of what those 18 months actually contained — not the polished version, but the sequence of problems we had to solve in approximately the order we encountered them.
Months 1-3: The Distribution Shift Problem
The first thing we discovered when we connected the published model to real PACS environments was distribution shift — a performance gap between the controlled research environment and clinical operations. The model had been trained and validated on curated research datasets where images met quality criteria, metadata was complete, and case selection was controlled. Clinical PACS environments don't look like that.
Over a two-month period of shadow deployment at our first partner hospital, we ran the model on every chest X-ray acquired and compared its predictions to radiologist reads. The sensitivity held up at 95.1% on pneumonia detection — close to published performance. But the false positive rate was 14.3%, nearly double what we'd reported in the paper. The additional false positives clustered around three image categories: technically degraded images acquired in portable ED settings, images with prior surgical hardware visible, and studies from pediatric patients where the chest anatomy differed significantly from the adult-dominated training set.
None of these categories were secret or surprising — they were just not well-represented in the research dataset. We spent months 2 and 3 sourcing additional training cases specifically from each category, retraining, and re-validating. The false positive rate dropped to 8.1%, which was still higher than published but within a range our clinical partners could work with as we continued improving.
Months 4-6: The Workflow Integration Wasn't Real
We had a DICOM integration design on paper before we started the hospital partnership. What we didn't have was a live implementation that had been tested against the specific PACS and RIS versions our partner was running. This turned out to matter more than we expected.
The partner hospital ran a PACS version that handled DICOM metadata tagging differently from what our integration assumed. Studies were arriving in our system with incorrect acquisition timestamps, which broke our temporal comparison logic — the feature that allows the model to compare a new study with a prior study to detect growth or change. Four weeks of engineering time went into compatibility work that wasn't in the original scope.
The more significant integration problem was HL7 — specifically, how the model's output was supposed to flow back into the radiologist's workflow. Our original design sent structured DICOM Secondary Capture images with overlay annotations back to the PACS. The radiology department preferred to see AI flags in the RIS worklist as a text field, not as a separate DICOM series they had to navigate to. Changing that output format required coordination with the RIS vendor, which had its own support queue and its own timeline. Months 5 and 6 were largely consumed by that coordination.
Months 7-9: Clinical Governance
The hospital's clinical informatics committee had approved the shadow deployment as a quality improvement project. Moving to active use — where the AI triage influenced worklist ordering — required a different governance pathway. The committee wanted a policy document defining who was responsible for AI alerts, how false positives and false negatives would be tracked, and what the process was for temporarily suspending the AI system if performance concerns arose.
Writing that policy document with the clinical informatics team took six weeks. Not because anyone was obstructing — everyone wanted to move forward — but because these were genuinely novel questions for the institution and the answers required working through scenarios that hadn't come up before. What happens if the AI flags a critical finding that the radiologist reviewed and assessed as a false positive, but it turns out the radiologist was wrong? Who documents that? What triggers a performance review? These questions have clear answers in hindsight, but building consensus on them in real time, with clinical and legal stakeholders in the room, takes time.
Months 10-14: FDA Submission
The FDA 510(k) submission is covered in more detail elsewhere on this site. The summary for purposes of this chronology: we submitted in month 10, received our first Additional Information request in month 12, responded in month 13, and received clearance in month 14. The clearance covered chest X-ray triage for pneumonia, pleural effusion, and pneumothorax detection. CT applications were filed separately under a second submission that cleared four months later.
The submission process ran in parallel with ongoing site deployment work, but it dominated engineering and clinical staff capacity in ways that created delays in the deployment timeline. The 14-month clearance process for a first-time medical AI submission is a significant operational reality that most research-to-deployment timelines underestimate.
Months 15-18: Calibration and Go-Live
Post-clearance, we moved from shadow deployment to active triage use. The transition required recalibrating our confidence thresholds for the specific population and workflow at the partner hospital. The threshold that minimized the false positive rate in our training data wasn't the right threshold for this environment — the case mix, the clinical stakes, and the radiologist alert tolerance were all different. We spent six weeks adjusting thresholds in collaboration with the clinical team and re-validating the calibrated model on a hold-out set from the hospital's own data.
The go-live itself was anticlimactic, which is exactly how it should be. The system was working, the alerts were integrating into the workflow, and the radiologists had been using the shadow outputs long enough that the active triage was a continuation rather than a new experience. The first-week performance review showed sensitivity of 96.1% on critical findings and a false positive rate of 5.4% — numbers we were comfortable with and that the radiology department found clinically acceptable.
Eighteen months. That's what the gap between publication and deployment actually looked like. For anyone else on that journey: the timeline is real, the problems are solvable, and none of the 18 months were wasted.