Towards Large-Scale 3D Medical Image Pre-training with Geometric Context Priors

VOCO

This work is an extension of our CVPR 2024. We introduce the 3D Medical Vision Foundation Model VoCo. This paper introduces a new benchmark: (1) We built the largest pre-training dataset in this field, PreCT-160K, consisting of 160,000 3D CTs from various anatomical regions of the human body (~42M slices). (2) We investigate the scaling laws of medical foundation models, pre-train models with 31M to 1.2B parameters, and propose guidelines for tailoring different models for various downstream tasks. (3) We explore various pre-training strategies, ranging from fully-, self-, semi- to omni-supervised pre-training. (4) We open-sourced implementations for over 50 downstream medical tasks, including segmentation, classification, registration, and Visual Language Processing (VLP). (5) We also released a new segmentation dataset, VoComni, comprising 20,000 CTs (about 5M slices) for 20 organ and tumor classes segmentation.

Codes, data, and pre-trained models at https://github.com/Luffy03/Large-Scale-Medical.

Paper: Linshan Wu, Jiaxin Zhuang, and Hao Chen. “Large-Scale 3D Medical Image Pre-training with Geometric Context Priors”. CVPR 2024 Extension.

Paper link: https://arxiv.org/abs/2410.09890

Introduction

The scarcity of annotations poses a significant challenge in medical image analysis, which demands extensive efforts from radiologists, especially for high-dimension 3D medical images. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo implicitly encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Extensive experiments highlight the superiority of VoCo, showcasing promising transferability to unseen modalities and datasets. VoCo notably enhances performance on datasets with limited labeled cases and significantly expedites fine-tuning convergence.

VoComni

Fig. 1 PreCT-160K dataset. In 3D medical images, the geometric relations between different organs are relatively consistent. We present some examples from PreCT-160K to illustrate these anatomical relationships across different regions. Motivated by this observation, we propose to leverage geometric context priors for learning consistent semantic representations and introduce a novel position prediction pretext task for pre-training.

VoComni

Fig. 2 VoComni dataset: 20K volumes with 20 organ & tumor classes of labels.

Method

VoComni

Fig. 3 Generate position labels for supervision. A pair of random crop k and base crop q are assigned as positive if they share overlap areas, otherwise as negative. We calculate the overlap proportions as position labels y.

The pivotal procedure is to generate position labels for self-supervision. We propose to leverage the inherent geometric context priors in 3D medical images. Given an input volume V, we first randomly crop a sub-volume k, with the objective of constructing positive and negative pairs with k for contrastive learning. Specifically, we propose to employ position encoding to generate n non-overlap base crops q-i, where each base crop represents a distinct region of the input volume.

Within human body anatomy, various organs are situated in distinct regions, leading to a potential way for us to form positive and negative pairs. As shown in Fig., the random crop k and the positive base crops q-pos exhibit overlap areas, whereas the negative base crops q-neg, lacking such overlaps, more likely encompass different organs (not absolutely). For example, k and q-pos both contain stomach, pancreas, vein, aorta, and cava, while k and q-neg exhibit different organ information. Thus, we can employ the position encoding to construct positive and negative pairs for contrastive learning.

Previous contrastive learning methods mainly employ InfoNCE loss to maximize the mutual information of positive pairs. In this paper, we propose to generate labels with specific values to supervise the correlation extent of positive pairs, i.e., with labels to reflect how similar between k and q-pos. It can be observed that the correlation between k and q-pos is associated with their overlap proportions. Intuitively, if a positive base crop q-pos shares more overlap areas with k, this q-pos would be more similar to k. Thus, we propose to assign the overlap proportions as the values of position labels y, enabling us to measure the similarity between k and q-pos. In contrast, the position labels y of q-neg are assigned to 0.

VoComni

Overall framework of VoCo. (a) First, we generate base crops q with corresponding position labels y. Then we input the random crop k and base crops q for contextual position prediction. Specifically, we employ a student-teacher module to project k and q separately, where the teacher projector is frozen and updated from the student projector with Exponential Moving Average (EMA). Finally, we conduct volume contrast between k and q to predict similarity s, where s is supervised by position labels y. (b) We use the position labels to supervise the intra-volume contrast on k, q-pos, and q-neg, where k, q-pos, and q-neg are from the same volume. (c) We extract random crop k-A and base crops q-B from different volumes V-A and V-B for inter-volume contrast.

VoComni

Omni-supervised learning. Both fully- and self-supervised learning have specific merits and drawbacks. (a) Fully-supervised learning can learn discriminative decision boundaries with the guidance of labels yet it is constrained by the lack of labeled data. (b) SSL can leverage large-scale unlabeled data. However, lacking annotations for supervision, it struggles with learning clear decision boundaries between distinct classes.

Semi-supervised learning is a scalable learner. To effectively leverage labeled and unlabeled data, we propose to conduct semi-supervised learning to borrow the knowledge from labeled data to large-scale unlabeled data. Notably, segmentation emerges as a pivotal technique in supervised training, given that many medical tasks demand a granular understanding at the pixel level for accurate diagnosis. Previous works only leveraged a few hundred cases for semi-supervised segmentation. However, complex designs of semi-supervised segmentation will significantly increase the burden of training, which is not feasible to our large-scale data. In this paper, we adopt a simple semi-supervised learning baseline and scale up the data to 160K volumes. We find that incorporated with VoCo, the simplest semi-supervised baseline can already achieve competitive results.

As shown in the Algorithm, we first curate a labeled segmentation dataset X-L,Y-L from PreCT-160K and perform supervised segmentation training in the first stage. Then in the second stage, we generate pseudo labels Y-U for unlabeled data X-U, aiming to perform semi-supervised learning on both X-L and X-U. Note that SSL is collaboratively integrated with semi-supervised training in both two stages. In this way, we amalgamate the strengths of self- and semi-supervised learning, advancing towards omni-supervised pre-training.

Experiments

We build the largest benchmark in this field, where we open-source the implementation of more than 50 downstream tasks, including segmentation, classification, registration, and Visual Language Processing (VLP). Extensive experiments highlight the superiority of VoCo. Please refer to https://github.com/Luffy03/Large-Scale-Medical for more details.

VoComni

PreCT-160K：https://huggingface.co/datasets/Luffy503/PreCT-160K

VoComni

Downstream tasks: https://github.com/Luffy03/Large-Scale-Medical

VoComni

Extensive experiments highlight the superiority of VoCo

VoComni

Are larger models always better? In medical tasks, the answer appears to be no. It can be observed that for some specific tasks, models with smaller sizes can achieve better performances. In this paper, we delve into factors affecting the model capacity scaling law, including: number of finetuning cases, data diversity, and task difficulties.

(a) TotalSegmentator is a challenging dataset, containing 1.2K cases and 104 classes for segmentation. In this case, the largest model VoCo-H yields the best results. (b) BTCV is with only 24 cases for finetuning, potentially leading larger models to overfit on limited data, thus hindering validation performance. (c) Although CC-CCII encompasses 4.2K cases for training, it is a simple binary classification task (over 90% accuracy), suggesting that excessively large models may not be necessary. (d) OASIS is brain MRI datasets with only 0.4K cases for registration and it also lacks significant structural diversity. In this case, the smallest VoCo-B delivers the best results. (e) CT-Rate is with 50K cases for 18 classes vocabulary classification. Given large-scale data for training, larger models demonstrate higher performances.

Tailor different model sizes to various medical tasks. Drawing from experimental insights discussed above, we empirically propose simple and reasonable guidelines for tailoring various medical tasks: (1) Tasks with extensive labeled data for fine-tuning potentially benefit from larger models. (2) Tasks spanning diverse anatomical regions potentially benefit from larger models. (3) Tasks requiring recognition across a higher number of classes (more challenging) are better addressed with larger models.

Although these guidelines have been assessed on our comprehensive benchmark, they may not universally apply to all medical tasks given the substantial diversity within the medical domain. Thus, we release pre-trained models of varying sizes to aid researchers in selecting the most appropriate models for their specific needs.

Tweets by SMARTLab_HKUST