“CT Foundation: Taking Medical Imaging Embeddings 3D”, 2024-10-21 (; similar):
Announcing the release of a new medical foundation tool for 3D CT volumes: CT Foundation. CT Foundation builds on our work in chest radiographs, dermatology and digital pathology, extending it to the realm of 3D volumes.
…CT Foundation was developed using VideoCoCa, a video-text model designed for efficient transfer learning from 2D Contrastive Captioners (CoCa). CoCa models take text and images as input and encode them into a shared, language-aligned embedding space. They include a multimodal text decoder that can decode these embeddings into text tokens.
CoCa models are trained to minimize two types of loss:
The first is captioning loss, the loss between the original ground-truth captions of the training images and the ones decoded by the CoCa model. This focuses on the accuracy of the provided caption.
The second is contrastive loss, which aims to minimize the distance between CoCa’s encodings of image-text pairs, resulting in a richer semantic understanding of the images. VideoCoCa extends an existing CoCa model by pooling together multiple frames to produce a compact representation of the entire set of sequence images.
CT Foundation was trained using over a half-million de-identified CT volumes that include a range of body parts from the head to extremities, each paired with their corresponding radiology reports.
We first trained a medical image-specific 2D CoCa model and applied it as a basis for VideoCoCa. We then trained VideoCoCa with axial CT slices (sequence of CT slices that comprise the volume) coupled with radiology reports.