One paper was accepted in ICLR 2024

Our paper “Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction” was accepted on ICLR 2024 as a spotlight paper. Congratulations to the authors!

Background

Survival prediction aims to estimate the survival probability of patients at a future time point based on their characteristics. This is typically defined as an ordinal regression task, predicting the survival risk ranking of patients to distinguish between high-risk and low-risk groups. Survival prediction results help doctors better assess the effects and risks of different treatment plans, providing personalized treatment options for patients. Analyzing both pathology and genetic data is considered the gold standard for survival prediction. Pathology images provide visual information about the tumor microenvironment, such as cell arrangement, while genetic data offer quantitative molecular insights, identifying specific cancer mutations, subtypes, and gene expression patterns.

Challenges

  1. Intra-Modal Redundancy: Capturing distinctive information by eliminating redundancy within a single modality.
    • Whole Slide Images (WSIs) are typically gigapixel-scale images. Due to hardware limitations, these images are usually processed in smaller patches. However, WSIs often provide labels only at the WSI level, leading to weakly supervised learning for survival prediction. Without precise annotations, such as marking cancerous regions within WSIs, both relevant and irrelevant information is mixed in the model input, causing redundancy. For instance, tumor cells highly relevant to risk assessment may occupy only a small portion of the entire WSI. Existing Multiple Instance Learning (MIL) methods do not enforce constraints to eliminate redundant information, making it difficult to obtain distinctive representations.
    • Similar redundancy issues exist in genomic data. Some studies suggest that pathway-based genomic groups, known for their interactions in unique cellular functions, correspond more semantically to pathological features. However, these pathways can generate hundreds to thousands of groups, and only a few specific pathways show strong correlations with patient prognosis, such as immune-related pathways in bladder cancer prognosis.
  2. Inter-Modal Redundancy: Capturing compact and comprehensive knowledge from overlapping multimodal data.
    • Redundancy from overlapping multimodal information complicates knowledge extraction. Decoupling independent factors can enhance feature effectiveness while discarding redundant information. Knowledge can be categorized into modality-specific and modality-common knowledge. The former contains unique information from a single modality, while the latter represents shared information across modalities, showing consistency. Existing work focuses on integrating common information by emphasizing modality alignment. However, common information often dominates the network’s alignment and integration of multimodal information, suppressing modality-specific information and overlooking the richness of unique perspectives.

Method

PIBD Figure 1: The framework of Prototypical Information Bottlenecking and Disentangling (PIBD) for multimodal cancer survival prediction.

To address these challenges, we propose the Prototypical Information Bottlenecking and Disentangling (PIBD) multimodal survival prediction model, composed of Prototypical Information Bottleneck (PIB) to reduce “intra-modal redundancy” and Prototypical Information Disentanglement (PID) to reduce “inter-modal redundancy.” The model framework is shown in Figure 1. PIB models a set of prototypes representing feature distributions of different risk levels, selecting distinctive instances within modalities. These instance features are then disentangled by PID into modality-specific and modality-common features for survival prediction.

Information Bottleneck

The Information Bottleneck introduces a new variable to maximally express information about the target while compressing the original information from the input. The objective function maximizes the mutual information between the new variable and the target, and minimizes the mutual information between the new variable and the input.

Since mutual information is difficult to compute, the Variational Information Bottleneck (VIB) approximates it by maximizing its variational lower bound, resulting in the loss function.

Prototypical Information Bottleneck

The Information Bottleneck offers a promising solution to reduce intra-modal redundancy. However, in our task, modality data is organized into “bags” containing many instances. Directly applying the Information Bottleneck has two drawbacks: (1) Difficulty in deriving the overall distribution at the bag level from many single-instance distributions, leading to high-dimensional computation challenges. (2) Independent learning of each instance’s distribution makes it hard to capture compact information representing the bag.

To address this, we propose Prototypical Information Bottleneck (PIB) to approximate the bag-level distribution directly using a set of prototypes representing parameter distributions. Each prototype should represent the conditional probability distribution of its corresponding risk level. Instances within bags with the same label should be close to the prototype distribution. Therefore, the variational approximation changes.

We maximize the similarity between the parameter prototypes and the latent space distribution of instances obtained through a feature encoder. This optimization only requires modeling the prototype distribution and encoder, without modeling each instance distribution. To eliminate redundant instances irrelevant to risk prediction, we select a subset of instances within a bag with higher similarity scores, discarding instances that do not contribute to the learning process. During training, we group instances more related to positive prototypes together and distance them from negative prototypes, achieving the objective.

Prototypical Information Disentanglement

After eliminating redundancy from single-modal sources, we propose a Prototypical Information Disentanglement (PID) module to decouple shared and specific features, addressing “inter-modal redundancy.” We aim to ideally decompose entangled multimodal features into modality-common features and modality-specific features. Using the joint prototype distribution modeled by PIB, we extract common knowledge. By forcing specific knowledge to be independent of these shared features, common features guide the learning of modality-specific knowledge. We minimize the mutual information between common and specific features to retain modality-specific information.

Disentangling Figure 2: The disentangling Transformer in Prototypical Information Disentanglement (PID).

To achieve this, we design a disentangling Transformer as the disentangling layer, shown in Figure 2. This Transformer models the interaction of input features, obtaining specific features through self-attention and common features through cross-attention. The common information is guided by the joint posterior distribution of prototypes, defined by the Product of Experts (PoE).

Finally, the PIBD loss function combines these objectives.

Experimental Results

Comparative Experiments

We conducted extensive experiments with 5-fold cross-validation on five cancer datasets from TCGA, including single-modal methods, multimodal methods, and other information bottleneck-based methods. The quantitative results on the survival prediction metric C-Index are shown below.

Results Figure 3: The C-Index results of different methods on five cancer datasets.

Results indicate that PIBD achieves the best overall performance, outperforming the second-best method by 1.6% in overall C-Index and achieving the best results in 4 out of 5 datasets. This highlights the importance of addressing both intra-modal and inter-modal redundancy. Compared to other information bottleneck-based methods, our method consistently outperforms across all cancer datasets, with performance improvements ranging from 0.5% to 4.9%. PIBD effectively considers the bag structure characteristics under weak supervision, demonstrating its superiority in multimodal cancer survival prediction.

Statistical Analysis

To demonstrate the model’s performance in distinguishing high-risk and low-risk patients, we visualized the Kaplan-Meier (KM) curves of our proposed method, as shown below.

KM Figure 4: Kaplan-Meier curves of our method on five cancer datasets.

The results show that our method better distinguishes high-risk and low-risk patients. We also conducted log-rank tests, with p-values less than 0.05 indicating statistical significance. The shaded areas represent confidence intervals. Our method consistently achieves p-values below 0.05. Additionally, we reported median survival months as “high-risk: mean (standard deviation)/low-risk: mean (standard deviation).”

Interpretability Analysis

To verify that the prototypes learned by PIB model distinctive latent distributions for different risk levels, we randomly sampled each prototype with a sampling frequency of 2000 times. We then used t-SNE to reduce the high-dimensional vectors to a two-dimensional plane. As shown in the figure below, the distributions exhibit good separability.

Interpretability Figure 5: t-SNE visualization of prototypes learned by PIB.

Furthermore, we conducted interventions during inference, as shown in the table below. The results show significant differences. Removing positive prototypes leads to a sharp decline in C-Index (all below 0.5), indicating a complete loss of predictive ability. Intervening on positive prototypes further leads to incorrect guiding signals for the subsequent disentangling module PID, resulting in worse performance due to incorrect prototype distributions. In contrast, randomly removing a negative prototype results in only a slight decline in C-Index, emphasizing the effectiveness of modeling distinctive risk level distributions in PIB.

Redundancy Analysis

We evaluated the model under different redundancy removal levels, as shown below.

Redundancy Figure 6: Model performance under different redundancy removal levels.

Compared to the setting with 100% information retention (i.e., no redundancy removal), removing 60-75% and 30-35% of instances for the two modalities, respectively, achieves roughly the same performance, indicating that redundant information affects overall model performance. Our model can compress information to reduce redundancy, thereby improving performance.

For more experimental analysis, please refer to our paper: https://openreview.net/forum?id=otHZ8JAIgh

Conclusion

In this work, inspired by information theory, we explored the Information Bottleneck in multimodal cancer survival prediction and proposed a new framework called Prototypical Information Bottlenecking and Disentangling (PIBD) to address the challenges of “intra-modal redundancy” and “inter-modal redundancy.” First, we proposed a Prototypical Information Bottleneck (PIB) to reduce redundancy while preserving task-relevant information. PIB models prototypes of various risk levels, allowing us to select distinctive features from a large number of instances, alleviating “intra-modal redundancy.” Additionally, to address “inter-modal redundancy,” we proposed Prototypical Information Disentanglement (PID), guided by joint prototype distributions, to disentangle independent modality-common and modality-specific features. These compact features provide different perspectives and knowledge, effectively enhancing network performance.

References

Zhang, Yilan, Yingxue Xu, Jianqi Chen, Fengying Xie, and Hao Chen. “Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction.” In The Twelfth International Conference on Learning Representations. 2024.