[IEEE-TMI] Shapley Values-enabled Progressive Pseudo Bag Augmentation for Whole Slide Image Classification

Recently, a computational pathology study completed in collaboration between HKUST Smart Lab and Tsinghua University was accepted by the top journal in medical image processing, IEEE Transactions on Medical Imaging. This study proposes a progressive pseudo bag augmentation method based on Shapley values to address the challenges of overly concentrated attention distribution and inaccurate attention score ranking in attention-based multiple instance learning.

Background

In computational pathology, the classification of whole slide images (WSIs) poses significant challenges due to their gigapixel resolution and extremely limited fine annotations. Multiple Instance Learning (MIL) offers a weakly supervised solution, but extracting instance-level information from bag-level labels remains challenging. While most MIL methods use attention scores to estimate Instance Importance Scores (IIS) for predicting slide labels, these methods often lead to:

Extreme Distribution of Attention Scores: A limited number of instances occupy the majority of attention scores (e.g., the sum of attention scores for the top 10 instances in the first three methods exceeds 75%), resulting in insufficient training.
Inaccurate Ranking of Attention Scores in Identifying Positive Instances: Positive instances are not prioritized in descending order of attention scores, leading to the assumption that the top few instances based on attention are positive, which does not hold true, introducing more noise during the training of multiple instance models or fine-tuning of feature encoders.

Figure 1 shows the distribution of attention scores in the CAMELYON-16 dataset and an example slide with the top five important instances. (a)-(d) respectively use ABMIL, CLAM, DTFD, and the proposed PMIL model. In the “Attention Distribution” column, the x-axis of the left subplot is normalized patch indices ranging from 0 to 1. In the “Top 5 Instances” column, positive instances are indicated by green borders, and negative instances by blue borders.

To address the first issue, that of extreme attention distribution, a common strategy is to divide the regular bags into several pseudo bags. This method encourages the MIL model to learn the diversity of bag-level features. Existing pseudo bag-based methods typically employ random partitioning strategies. For instance, DTFD first randomly divides the regular bag into several pseudo bags, which inherit the labels of their parent bags and undergo pseudo bag-level MIL. To address potential label errors in pseudo bags, they continue to aggregate the features extracted from all pseudo bags within the same regular bag into bag-level features for a second layer of MIL.

For the second issue, that of misranking and misidentification of positive instances, AdditiveMIL proposes using Shapley values as an alternative to attention scores during MIL training and inference, viewing them as weighted scores for each feature.

In this work, we propose a new method inspired by cooperative game theory: using Shapley values to assess the contribution of each instance, thereby improving the estimation of IIS. The calculation of Shapley values is accelerated through attention mechanisms, while simultaneously preserving the correct ordering and identification of instances. Subsequently, we further introduce a pseudo bag progressive allocation framework based on the estimated IIS, encouraging a more balanced attention distribution in the MIL model.

Method

Revisiting pseudo bag-augmented multiple instance learning, in this framework, a regular bag is randomly divided into \(M\) pseudo bags:

\[X = \{ X_i^{pse} \}_{i=1}^M.\]

Each pseudo bag inherits the label of its parent bag, resulting in an expanded dataset:

\[D^{pse} = \{ (X_i^{pse}, Y_i) \}_{i=1}^{M \times |D|}.\]

Consistent with the MIL process for regular bags, the computation for pseudo bags includes a feature extraction component and a multi-instance learning aggregation classification component, as follows:

\[\hat{Y}_i^{pse} = (f \circ g) \left( \{ h(x_{(i,j)}^{pse}) \}_{j=1}^{M \times |D|} \right).\]

Thus, we can further derive the objective function for pseudo bag-augmented MIL:

\[J(D^{pse}; \theta) = \sum_{i=1}^{M \times |D|} H(\hat{Y}_i^{pse}, Y_i),\]

where \(\theta\) represents the parameters of the MIL model, and \(H\) denotes the cross-entropy loss function.

However, since the true labels \(Y^{pse}\) of pseudo bags do not always align with the inherited labels \(Y\) from the parent bag, the objective function in the equation above can be further divided into two parts:

\[J(D^{pse}; \theta; \epsilon) = \sum_{i=1}^{M \times |D| - \epsilon} H(\hat{Y}_i^{pse}, Y_i | Y_i = \hat{Y}_i^{pse}) + \sum_{i=1}^{\epsilon} H(\hat{Y}_i^{pse}, Y_i | Y_i \neq \hat{Y}_i^{pse}),\]

where \(\epsilon\) is the number of pseudo bags with incorrect labels in the pseudo bag augmentation.

To improve pseudo bag-augmented MIL, the primary approach is to reduce the number of pseudo bags with incorrect labels \(\epsilon\). The proposed framework for progressive pseudo bag augmentation in MIL is illustrated in Figure 2.

Figure 2 shows the proposed framework for progressive pseudo bag-augmented MIL. (a) A set of patches extracted from WSI is divided into \(M\) pseudo bags based on their estimated IIS (with \(M\) gradually increasing) and then trained in the same manner as regular bags. (b) The weights of the MIL model are frozen to estimate IIS, facilitating pseudo bag partitioning. As the MIL model converges in iteration 0, the number of pseudo bags \(M\) gradually increases during the iterations, using the initial pseudo bags allocated based on the IIS estimated by the previous round’s MIL model. Note: Pseudo bag augmentation is only used during training.

A reasonable method to address the noise introduced by incorrect pseudo bag labels is to use attention scores to estimate the importance of instances, thereby improving the partitioning of pseudo bags. Shapley values in cooperative games provide a solution by quantifying the contribution of each instance to the classification outcome through the interactions between instances. However, considering the exponential complexity of calculating Shapley values, these computations often become resource-intensive, posing significant challenges in practical applications, especially in computational pathology. Therefore, this paper introduces optimized Shapley values as an alternative to attention scores \(\alpha\) for estimating instance importance scores \(IIS\):

\[\phi_{(i,j)}(v_{(i,j)}, V_i \setminus \{v_{(i,j)}\}) \equiv \sum_{S_i \subseteq V_i \setminus \{v_{(i,j)}\}} \frac{|S_i|! (|V_i| - |S_i| - 1)!}{|V_i|!} \left[ (f \circ g)(S_i \cup v_{(i,j)}) - (f \circ g)(S_i) \right],\]

where \(v*{(i,j)}\) is the \(j\)-th feature in the \(i\)-th bag for which we are calculating the Shapley value \(\phi*{(i,j)}\), \(V*i\) is the complete feature set of the \(i\)-th bag, and \(S_i \subseteq V_i \setminus \{v*{(i,j)}\}\) is any subset of available features.

According to the definition of multiple instance learning, a key factor in determining bag-level labels is whether the bag contains positive instances. Under this condition, this paper retains non-critical instances according to the order of attention scores, focusing on calculating critical instances, i.e., instances with high attention scores, and reordering them based on their estimated IIS from Shapley values:

\[IIS(v_{(i,j)}) = \phi_{(i,j)}(v_{(i,j)}, S_i^l), \quad v_{(i,j)} \in S_i^h,\]

where \(S_i^h\) and \(S_i^l = V_i - S_i^h\) represent the subsets of instances with high and low attention scores, respectively, and \(V_i\) is the complete set of instances. For simplicity, the number of instances in \(S_i^h\) is set to \(\mu M\), and the sampling number in \(S_i^l\) is set to \(\tau\).

The instances within each bag are reordered based on their IIS ranking, denoted as:

\[V_i' = \{ v_{(i,j)}' | IIS(v_{(i,1)}') \geq IIS(v_{(i,2)}') \geq \ldots \geq IIS(v_{(i,N_i)}') \}\]

Then, these instances are evenly interleaved into \(M\) pseudo bags using the modulus function, ensuring that \(v*{(i,j)}'\) satisfies \(j \equiv k \, (\text{mod} \, M)\), resulting in each pseudo bag \(X*{(i,k)}^{pse}\) being represented as samples from \(V_i'\). By fixing the parameters \(\theta\) of the MIL model, the optimization of \(\epsilon\) can be approximated as:

\[\epsilon^* = \min \sum ( \hat{Y}_i \neq Y_i^{pse}) = \max \sum ( \hat{Y}_i = Y_i^{pse}),\]

which is equivalent to

\[\min_{(V')} D_{KL} \left( P_{(V^{pse} \sim \Gamma(V'))} \left[ Y^{pse} | V^{pse}; \theta \right] \| P[Y | V; \theta] \right),\]

where \(\Gamma(V')\) is the instance importance distribution of \(V'\) with estimated IIS, and \(D\_{KL}\) is the Kullback-Leibler divergence function. Thus, the optimization of \(\epsilon\) is transformed into the optimization of \(\Gamma(V')\), where the estimation of IIS plays a decisive role.

Considering that the number of pseudo bags and the initialization of progressive strategies are crucial, dividing regular bags into a large number of pseudo bags may introduce excessive noise, especially when the regular bag contains a limited number of positive instances, potentially leading to training instability. To address this issue, once the MIL model converges during training, the number of pseudo bags is gradually increased:

\[M_t = \min \{ M_{(t-1)} + \Delta M, M_{\max} \}, \quad \text{s.t.} \{ g_{(t-1)}, f_{(t-1)} \} \to \{ g_{(t-1)}^*, f_{(t-1)}^* \}\]

where \(t\) represents the number of converging iterations, \(\Delta M\) denotes the increment in the number of pseudo bags, and \(M*0\) and \(M*{\max}\) represent the initial and maximum number of pseudo bags, respectively.

Furthermore, the initial allocation of pseudo bags significantly impacts subsequent training, particularly when dealing with challenging datasets. To address this issue, this paper gradually utilizes well-trained MIL models from the previous round to enhance the initial pseudo bag allocation by computing instance importance scores.

Guided by the IIS estimated based on Shapley values, this paper proposes a progressive pseudo bag-augmented MIL framework called PMIL. To mitigate the issue of incorrect pseudo bag labels, this paper introduces the Expectation Maximization (EM) algorithm to achieve optimal pseudo bag label allocation, alternating between pseudo bag-augmented MIL and Shapley-based pseudo bag partitioning. Each time the MIL model converges, the number of pseudo bags \(M\) increases, allocating initial pseudo bags based on the IIS estimated by the previous round’s MIL model.

Experimental Results

Datasets: CAMELYON-16, BRACS, TCGA-LUNG, TCGA-BRCA.

In the classification results at the bag level (Table 1) and instance level (Table 2), the proposed method demonstrates superior performance, particularly on BRACS, where there is a significant improvement in bag-level classification compared to other methods.

Table 1 shows the classification results of five repeated experiments at the bag level across four datasets.

Table 2 presents the classification results at the instance level from five repeated experiments on CAMELYON16.

An example of pseudo bag allocation in PMIL is illustrated in Figure 3, where red annotations represent cancer regions. Based on Shapley value ranking, even in cases of micrometastasis, only three positive instances can be located and evenly distributed into three pseudo bags. Attention ranking indicates that through accurate pseudo bag augmentation, more positive instances can be noted during training. Even in micrometastatic cases, PMIL successfully identified three critical instances. In this case, the random partitioning method for pseudo bags had only a 2/9 chance of accurately assigning positive instances into three different pseudo bags, otherwise introducing label noise into the training. In contrast, our method evenly placed these critical instances into different pseudo bags, significantly increasing the diversity of positive instances.

Figure 3 illustrates the effects of progressive pseudo bag partitioning.

In the case of macro metastasis from CAMELYON-16 (Figure 4a), all models exhibited decent performance. However, in the case of micrometastasis (Figure 4b), the attention score-based IIS indicated that ABMIL, DTFD, and our model incorrectly focused on some non-cancer areas, making the model’s judgments difficult to interpret. In contrast, using Shapley value-based IIS, our model accurately excluded non-cancer areas and correctly identified cancer regions.

Figure 4 shows the heatmaps generated by different models for two sub-regions of slides. (a) and (b) represent macro and micrometastatic cases from CAMELYON-16, where the red and green annotations in the “Ground Truth” column indicate cancerous and non-cancerous areas, respectively.

Unlike attention scores, the calculation of Shapley values encompasses the entire MIL classifier, integrating diverse categorical information, thus enabling category-based interpretations. As shown in Figure 5, the IIS estimated using attention scores and Shapley values under the malignant tumor category primarily focuses on malignant tumor regions. Meanwhile, the heat map generated using Shapley values under the atypical tumor category emphasizes atypical tumor regions, consistent with the true annotations. Although the heat map may not be entirely accurate for the BRACS dataset, this characteristic highlights the strong interpretability of Shapley value-based IIS in multi-class tasks.

Figure 5 shows heatmaps generated using different IIS estimation methods for a malignant tumor case in the BRACS dataset, where blue and red annotations in the “Ground Truth” column represent malignant tumors and atypical tumors, respectively.

Ablation Studies

In this ablation study, we evaluate the use of attention scores and Shapley values for estimating instance importance scores for subsequent training. In this experiment, the pseudo bag random partitioning strategy serves as the baseline for pseudo bag augmentation. According to the results in Table 3, the Shapley value-based IIS estimation demonstrates outstanding performance on the CAMELYON-16, TCGA-LUNG, and TCGA-BRCA datasets. In contrast, on the BRACS dataset, the attention score-based estimation yielded better results. This difference in performance may be attributed to the fact that attention scores are obtained directly through pooling operations, while calculating Shapley values requires an additional classifier, namely a fully connected layer. When the overall performance of the multiple instance learning model is not particularly strong, this leads to a lack of robustness in Shapley value calculations. Thus, when facing challenging datasets, attention mechanism-based progressive pseudo bag partitioning is more advantageous; conversely, when dealing with easier datasets, Shapley value-based progressive pseudo bag partitioning is more advantageous.

Table 3 presents the performance results of pseudo bag-augmented MIL based on different IIS estimation methods on the CAMELYON-16, BRACS, TCGA-LUNG, and TCGA-BRCA test sets.

Next, we examine the model’s sensitivity to the maximum number of pseudo bags. Due to variations in the magnification levels of WSIs and the sizes of tumor regions, the optimal number of pseudo bags \(M*{\max}\) varies across different datasets. As shown in Figure 6, our model exhibits optimal performance on the CAMELYON-16 dataset when \(M*{\max}\) is approximately 9, with performance sharply declining when exceeding this value. Similarly, the ideal \(M\_{\max}\) for the BRACS, TCGA-LUNG, and TCGA-BRCA datasets are 10, 14, and 6, respectively.

Figure 6 illustrates the results of ablation experiments on different maximum pseudo bag quantities for the CAMELYON-16, BRACS, TCGA-LUNG, and TCGA-BRCA datasets.

Furthermore, we assess the model’s sensitivity to hyperparameters in Shapley approximation. This paper conducts repeat experiments by varying the values of \(\mu\) and \(\tau\) used in the Shapley approximation calculations. As shown in Table 4, the results indicate that the model’s performance is not sensitive to the selection of hyperparameters \(\mu\) and \(\tau\), with fluctuations remaining within an acceptable range.

Table 4 presents the AUC results of the proposed method with different hyperparameters on the CAMELYON-16, BRACS, TCGA-LUNG, and TCGA-BRCA test sets.

To determine the effectiveness of gradually increasing the number of pseudo bags and optimizing initial pseudo bag allocation, this paper conducts a series of experiments. As shown in Table 5, the results indicate that models employing progressive strategies achieve the highest levels of performance.

Table 5 shows the evaluation results of different progressive strategies for pseudo bag augmentation on the CAMELYON-16, BRACS, and TCGA-LUNG test sets.

Conclusion

In this study, we address the challenges associated with attention-based multiple instance learning for whole slide image classification, focusing on the extreme distribution of attention and misidentification of positive instances. To tackle these challenges, we introduce accelerated Shapley values to quantify the contribution of each instance and, for the first time, introduce the concept of IIS. Compared to traditional pseudo bag allocation strategies like random partitioning, our method provides a more rational pseudo bag allocation, effectively mitigating mislabeling issues to some extent. Additionally, we propose a progressive pseudo bag-augmented MIL framework that combines Shapley value-based IIS and utilizes the Expectation Maximization algorithm. This framework systematically enhances pseudo bag augmentation and significantly improves the efficiency of MIL. Extensive experiments on four public datasets demonstrate that our method outperforms existing state-of-the-art techniques. Furthermore, Shapley value-based IIS provides valuable category interpretability. The outstanding performance and interpretability assist pathologists in making more comprehensive and efficient diagnoses in clinical practice.

References

R. Yan et al., “Shapley Values-enabled Progressive Pseudo Bag Augmentation for Whole-Slide Image Classification,” in IEEE Transactions on Medical Imaging, doi: 10.1109/TMI.2024.3453386.

Tweets by SMARTLab_HKUST