[press release] Background: Retrospective studies have shown promising results using artificial intelligence (AI) to improve mammography screening accuracy and reduce screen-reading workload; however, to our knowledge, a randomized trial has not yet been conducted. We aimed to assess the clinical safety of an AI-supported screen-reading protocol [Transpara, a CNN] compared with standard screen reading by radiologists following mammography.
Method: In this randomized, controlled, population-based trial, women aged 40–80 years eligible for mammography screening (including general screening with 1.5–2-year intervals and annual screening for those with moderate hereditary risk of breast cancer or a history of breast cancer) at 4 screening sites in Sweden were informed about the study as part of the screening invitation. Those who did not opt out were randomly allocated (1:1) to AI-supported screening (intervention group) or standard double reading without AI (control group). Screening examinations were automatically randomized by the Picture Archive & Communications System with a pseudo-random number generator after image acquisition. The participants and the radiographers acquiring the screening examinations, but not the radiologists reading the screening examinations, were masked to study group allocation. The AI system (Transpara version 1.7.0) provided an examination-based malignancy risk score on a 10-level scale that was used to triage screening examinations to single reading (score 1–9) or double reading (score 10), with AI risk scores (for all examinations) and computer-aided detection marks (for examinations with risk score 8–10) available to the radiologists doing the screen reading. Here we report the prespecified clinical safety analysis, to be done after 80 000 women were enrolled, to assess the secondary outcome measures of early screening performance (cancer detection rate, recall rate, false positive rate, positive predictive value [PPV] of recall, and type of cancer detected [invasive or in situ]) and screen-reading workload. Analyses were done in the modified intention-to-treat population (ie. all women randomly assigned to a group with one complete screening examination, excluding women recalled due to enlarged lymph nodes diagnosed with lymphoma). The lowest acceptable limit for safety in the intervention group was a cancer detection rate of more than 3 per 1,000 participants screened. The trial is registered with ClinicalTrials.gov, NCT04838756, and is closed to accrual; follow-up is ongoing to assess the primary endpoint of the trial, interval cancer rate.
Findings: Between April 12, 2021, and July 28, 2022, 80 033 women were randomly assigned to AI-supported screening (n = 40 003) or double reading without AI (n = 40 030). 13 women were excluded from the analysis. The median age was 54.0 years (IQR 46.7–63.9). Race and ethnicity data were not collected.
AI-supported screening among 39 996 participants resulted in 244 screen-detected cancers, 861 recalls, and a total of 46 345 screen readings. Standard screening among 40 024 participants resulted in 203 screen-detected cancers, 817 recalls, and a total of 83 231 screen readings.
Cancer detection rates were 6.1 (95% CI 5.4–6.9) per 1,000 screened participants in the intervention group, above the lowest acceptable limit for safety, and 5.1 (4.4–5.8) per 1,000 in the control group—a ratio of 1.2 (95% CI 1.0–1.5; p = 0.052). Recall rates were 2.2% (95% CI 2.0–2.3) in the intervention group and 2.0% (1.9–2.2) in the control group. The false positive rate was 1.5% (95% CI 1.4–1.7) in both groups. The PPV of recall was 28.3% (95% CI 25.3–31.5) in the intervention group and 24.8% (21.9–28.0) in the control group. In the intervention group, 184 (75%) of 244 cancers detected were invasive and 60 (25%) were in situ; in the control group, 165 (81%) of 203 cancers were invasive and 38 (19%) were in situ.
The screen-reading workload was reduced by 44.3% using AI.
Interpretation: AI-supported mammography screening resulted in a similar cancer detection rate compared with standard double reading, with a substantially lower screen-reading workload, indicating that the use of AI in mammography screening is safe. The trial was thus not halted and the primary endpoint of interval cancer rate will be assessed in 100 000 enrolled participants after 2-years of follow up.
Funding: Swedish Cancer Society, Confederation of Regional Cancer Centres, and the Swedish governmental funding for clinical research (ALF).
…AI-supported screening resulted in 20% more cancers (244 vs 203) being detected than with standard screening. 152 stage T1 invasive cancers were detected in the intervention group compared with 129 in the control group, which might indicate an increase in early detection without the need for supplementary imaging methods. The incremental increase was, however, not as large as that observed with digital breast tomosynthesis screening in a previous study.22 Still, the higher cancer detection with tomosynthesis compared with mammography in screening has not convincingly been shown to translate into a reduction of interval cancers,22 which could question its clinical importance since it is also a more resource-demanding technique. The clinical-significance of the additional detected invasive cancers in our study remains to evaluated. The evolution of AI over time could affect all available tests for breast cancer screening, but the use of AI in tomosynthesis screening has not yet been evaluated in a prospective study.
We also found increased detection of in situ cancers with AI-supported screening compared with standard screening (60 vs 38), which could be concerning in terms of overdiagnosis. The risk of overtreating an in situ cancer is more likely with low-grade cancers, since they might never progress into a clinically relevant event during the patient’s lifetime.23 Hence, the planned characterisation of detected cancers in the full study population will bring some clarity to possible overdiagnosis with AI-supported screening. Fenton and colleagues showed a 34% increase in the detection of in situ cancers (1.17 → 1.57 per 1,000 screening mammograms, p = 0.09) after the implementation of conventional CAD in screening but without a parallel increase in the detection of invasive cancer.24 Conventional CAD was also shown to increase false positives and related costs, and its use in screening could ultimately not be justified.2, 24, 25, 26, 27 AI thus seems to have improved performance compared with that of conventional CAD, but could still have hypersensitivity to calcifications, a typical presentation of in situ cancers.13 Subsequent screening will show whether the relatively higher detection observed in our trial is a result of screening with a more sensitive technique for the first time (ie. a prevalence effect), causing an initial high incidence that levels out during subsequent screening rounds.28
We found that the benefit of AI-supported screening in terms of screen-reading workload reduction was considerable. The actual time saved was not measured, but, if we assume that a radiologist reads on average 50 screening examinations per hour, it would have taken one radiologist 4.6 months less to read the 46 345 screening examinations in the intervention group compared with the 83 231 in the control group. There was concern about whether AI would lead to an increase in cases referred to consensus meetings, considering the eventual need to discuss CAD findings and the possible reader anxiety arising from single reading. Consensus meetings constitute an important step to increase the specificity, but are resource demanding.3 Contrary to expectations, the proportion of screenings that led to a consensus meeting was not affected by the use of AI.