Recently, a computational pathology study jointly conducted by HKUST Smart Lab and Zhejiang University was published in IEEE TMI. The study proposes an iteratively coupled multiple instance learning (ICMIL) method to address challenges in feature extraction from high-resolution pathology images.
Whole Slide Image (WSI) classification is a critical task in computational pathology. However, processing these images, which can have resolutions in the billions, poses significant computational challenges for GPUs. Traditional Multiple Instance Learning (MIL) methods often employ a two-stage strategy: the first stage uses a non-trainable feature extractor to generate feature embeddings for WSI regions, and the second stage uses a trainable feature aggregator to combine these embeddings into a representation of the entire WSI, followed by a classifier for WSI classification. Although this approach reduces GPU memory consumption by using pre-trained feature extractors, the lack of information exchange between the two stages can lead to inconsistencies, resulting in suboptimal classification accuracy.
To address this issue, this study observes that the trained WSI classifier can serve as a teacher network to generate pseudo-labels for sub-regions within the WSI. Using these smaller sub-region images and their corresponding pseudo-labels allows for low-cost fine-tuning of the feature extractor. The fine-tuned feature extractor can then more accurately extract features, thereby improving the performance of the WSI classifier through iterative optimization. Based on this, we designed the Iteratively Coupled Multiple Instance Learning (ICMIL) method to train the embedder and WSI classifier at a low cost. In this iterative optimization, the feature extractor and WSI classifier become increasingly coupled, resulting in a near-end-to-end training effect.
ICMIL Method Diagram Figure 1. Schematic of the Iteratively Coupled Multiple Instance Learning Method
Multiple Instance Learning (MIL) is primarily used to handle datasets with incomplete annotations. In traditional supervised learning, each sample (instance) has a clear label. In MIL, data is organized into bags, each containing multiple instances, but only the bag has a label, and the instances themselves do not have explicit labels. This method is particularly suitable for image classification, medical image analysis, and text classification. The standard assumption in MIL is that if at least one instance in a bag is positive, the bag is labeled as positive; if all instances are negative, the bag is labeled as negative. This setting is well-suited to pathology image classification tasks, where only some regions of a positively labeled pathology image may contain tumors, while all regions of a negatively labeled image are normal tissue.
Some recent works have addressed the drawbacks of using non-trainable feature extractors in the MIL framework, typically using weakly supervised or self-supervised methods for fine-tuning. Weakly supervised fine-tuning methods infer instance-level pseudo-labels from the feature aggregator weights in the MIL framework, but the nature of the feature aggregator means that only a few positive instances in a bag can be correctly labeled. Self-supervised fine-tuning methods, such as those based on the SimCLR framework, directly fine-tune the feature extractor on the pathological image data domain to reduce domain shift issues during WSI classifier training. However, these approaches still lack information exchange between the feature extractor and WSI classifier, leading to performance inconsistencies.
Figure 2. The ICMIL Algorithm Framework
As shown in Figure 2, ICMIL consists of two stages: (1) Fix the feature extractor and train the aggregator and WSI classifier; (2) Fix the aggregator and WSI classifier and use them as teachers to train a better feature extractor. These stages can be repeated until convergence. ICMIL can be applied to various MIL backbone networks to improve their performance. In this introduction, we use the classic ABMIL as an example.
This stage is similar to the traditional MIL training process. We initialize ResNet50 with ImageNet pre-trained weights as the feature extractor, then input WSI bags with data augmentation into the network to obtain N instance feature embeddings. Next, we train the feature aggregator and WSI classifier, where the aggregator uses an attention mechanism to weight and sum the N instance feature embeddings, forming a final feature embedding representing the entire WSI. The WSI classifier is then trained on these embeddings using WSI category labels to fit the decision boundary.
This stage is the main contribution of ICMIL. Since bag-level feature embeddings are linear combinations of instance-level features, they exist in the same latent space. We assume that the decision boundary for instance-level features should be similar to the bag-level decision boundary. We propose fixing the WSI classifier trained in Stage One (the bag-level classifier) and using its learned decision boundary information to generate pseudo-labels for all instances.
To enhance flexibility and robustness, we design a teacher-student network for real-time fine-tuning of the feature extractor. The teacher branch includes a fixed feature extractor and WSI classifier, accepting original instance images without augmentation to generate accurate pseudo-labels. The student network, which also includes a fully trainable feature extractor, receives augmented instance images and introduces a “hidden” trainable instance-level classifier to account for subtle differences between instance-level and bag-level decision boundaries.
This stage is optimized using two loss functions: a consistency loss to ensure the student and teacher networks make the same predictions, and a weight similarity loss to ensure the “hidden” instance-level classifier does not deviate significantly from the weights of the WSI classifier in the teacher branch. Additionally, we introduce a confidence mechanism to account for potential errors in the pseudo-labels generated by the teacher. The confidence of each instance’s pseudo-label is generated by the aggregator trained in Stage One, as the aggregator’s role is to generate weights for each instance, which can be transformed to reflect the reliability of the pseudo-label.
Datasets: Camelyon16, TCGA-NSCLC, TCGA-RCC, and 1140 HCC pathology WSIs collected from Sir Run Run Shaw Hospital, Hangzhou.
Evaluation Metrics: AUC, F1-score, Accuracy.
This study selected three representative MIL backbone networks: simple Max Pooling-based MIL, the influential ABMIL, and the state-of-the-art DTFD-MIL. ICMIL effectively improved the classification performance of these MIL backbone networks across four different datasets. Specifically, ICMIL with the confidence mechanism increased the classification AUC of DTFD-MIL on the Camelyon16 dataset by 1.8%, setting a new state-of-the-art. Similar improvements were observed on the other three datasets.
Figure 3 shows the t-SNE visualization results of bag-level representations from ABMIL before and after ICMIL. The results demonstrate that the proposed ICMIL can notably improve the separability of the bag-level representations.
Figure 3. The t-SNE visualization results of bag-level representations from ABMIL. (a) Results before ICMIL. (b) Results after one iteration of vanilla ICMIL. (c) Results after one iteration of confidence-based ICMIL.
To further prove that a better feature extractor can enhance the performance of both the aggregator and classifier, we visualized the weights assigned to instances by the feature aggregator in the optimized ABMIL. As shown in Figure 4, the red lines outline the ground truth tumor areas, while the colored masks of the patches indicate their corresponding attention scores. The baseline ABMIL method tends to only highlight very few instances, while the vanilla ICMIL method can highlight the positive instances but struggles to mask the negative instances correctly. In contrast, the proposed confidence-based ICMIL method can not only highlight the positive instances but also correctly mask the negative instances.
Figure 4. Visualization of the attention scores of the instances in a bag before and after ICMIL (obtained with ABMIL backbone on Camelyon16). The areas surrounded by the red lines are the ground truth tumor areas, while the colored masks of the patches indicate their corresponding attention scores. (a) The baseline ABMIL method, which tends to only highlight very few instances. (b) The vanilla ICMIL method, which can highlight the positive instance but struggles to mask the negative instances correctly. (c) The proposed confidence-based ICMIL method, which can not only highlight the positive instances but also correctly mask the negative instances.
This paper proposes a novel Iteratively Coupled Multiple Instance Learning (ICMIL) method, which uses the WSI-level classifier to guide the generation of instance-level pseudo-labels and introduces a confidence scoring mechanism based on the feature aggregator’s loss function. This effectively addresses the domain shift and component inconsistency issues in existing MIL methods. Experimental results on multiple datasets show that ICMIL significantly improves the performance of different MIL backbone networks in WSI classification tasks.
[1] Hongyi Wang, Luyang Luo, Fang Wang, etc., Iteratively coupled multiple instance learning from instance to bag classifier for whole slide image classification. International Conference on Medical Image Computing and Computer-Assisted Intervention, 2023, pp. 467–476.
[2] Hongyi Wang, Luyang Luo, Fang Wang, etc., Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Bag-Level Classifier is a Good Instance-Level Teacher. IEEE Transactions on Medical Imaging, 2024. doi=10.1109/TMI.2024.3404549.