Our paper “Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction” was accepted on ICLR 2024 as a spotlight paper. Congratulations to the authors!
Survival prediction aims to estimate the survival probability of patients at a future time point based on their characteristics. This is typically defined as an ordinal regression task, predicting the survival risk ranking of patients to distinguish between high-risk and low-risk groups. Survival prediction results help doctors better assess the effects and risks of different treatment plans, providing personalized treatment options for patients. Analyzing both pathology and genetic data is considered the gold standard for survival prediction. Pathology images provide visual information about the tumor microenvironment, such as cell arrangement, while genetic data offer quantitative molecular insights, identifying specific cancer mutations, subtypes, and gene expression patterns.
Figure 1: The framework of Prototypical Information Bottlenecking and Disentangling (PIBD) for multimodal cancer survival prediction.
To address these challenges, we propose the Prototypical Information Bottlenecking and Disentangling (PIBD) multimodal survival prediction model, composed of Prototypical Information Bottleneck (PIB) to reduce “intra-modal redundancy” and Prototypical Information Disentanglement (PID) to reduce “inter-modal redundancy.” The model framework is shown in Figure 1. PIB models a set of prototypes representing feature distributions of different risk levels, selecting distinctive instances within modalities. These instance features are then disentangled by PID into modality-specific and modality-common features for survival prediction.
Information Bottleneck
The Information Bottleneck introduces a new variable to maximally express information about the target while compressing the original information from the input. The objective function maximizes the mutual information between the new variable and the target, and minimizes the mutual information between the new variable and the input.
Since mutual information is difficult to compute, the Variational Information Bottleneck (VIB) approximates it by maximizing its variational lower bound, resulting in the loss function.
Prototypical Information Bottleneck
The Information Bottleneck offers a promising solution to reduce intra-modal redundancy. However, in our task, modality data is organized into “bags” containing many instances. Directly applying the Information Bottleneck has two drawbacks: (1) Difficulty in deriving the overall distribution at the bag level from many single-instance distributions, leading to high-dimensional computation challenges. (2) Independent learning of each instance’s distribution makes it hard to capture compact information representing the bag.
To address this, we propose Prototypical Information Bottleneck (PIB) to approximate the bag-level distribution directly using a set of prototypes representing parameter distributions. Each prototype should represent the conditional probability distribution of its corresponding risk level. Instances within bags with the same label should be close to the prototype distribution. Therefore, the variational approximation changes.
We maximize the similarity between the parameter prototypes and the latent space distribution of instances obtained through a feature encoder. This optimization only requires modeling the prototype distribution and encoder, without modeling each instance distribution. To eliminate redundant instances irrelevant to risk prediction, we select a subset of instances within a bag with higher similarity scores, discarding instances that do not contribute to the learning process. During training, we group instances more related to positive prototypes together and distance them from negative prototypes, achieving the objective.
Prototypical Information Disentanglement
After eliminating redundancy from single-modal sources, we propose a Prototypical Information Disentanglement (PID) module to decouple shared and specific features, addressing “inter-modal redundancy.” We aim to ideally decompose entangled multimodal features into modality-common features and modality-specific features. Using the joint prototype distribution modeled by PIB, we extract common knowledge. By forcing specific knowledge to be independent of these shared features, common features guide the learning of modality-specific knowledge. We minimize the mutual information between common and specific features to retain modality-specific information.
Figure 2: The disentangling Transformer in Prototypical Information Disentanglement (PID).
To achieve this, we design a disentangling Transformer as the disentangling layer, shown in Figure 2. This Transformer models the interaction of input features, obtaining specific features through self-attention and common features through cross-attention. The common information is guided by the joint posterior distribution of prototypes, defined by the Product of Experts (PoE).
Finally, the PIBD loss function combines these objectives.
Comparative Experiments
We conducted extensive experiments with 5-fold cross-validation on five cancer datasets from TCGA, including single-modal methods, multimodal methods, and other information bottleneck-based methods. The quantitative results on the survival prediction metric C-Index are shown below.
Figure 3: The C-Index results of different methods on five cancer datasets.
Results indicate that PIBD achieves the best overall performance, outperforming the second-best method by 1.6% in overall C-Index and achieving the best results in 4 out of 5 datasets. This highlights the importance of addressing both intra-modal and inter-modal redundancy. Compared to other information bottleneck-based methods, our method consistently outperforms across all cancer datasets, with performance improvements ranging from 0.5% to 4.9%. PIBD effectively considers the bag structure characteristics under weak supervision, demonstrating its superiority in multimodal cancer survival prediction.
Statistical Analysis
To demonstrate the model’s performance in distinguishing high-risk and low-risk patients, we visualized the Kaplan-Meier (KM) curves of our proposed method, as shown below.
Figure 4: Kaplan-Meier curves of our method on five cancer datasets.
The results show that our method better distinguishes high-risk and low-risk patients. We also conducted log-rank tests, with p-values less than 0.05 indicating statistical significance. The shaded areas represent confidence intervals. Our method consistently achieves p-values below 0.05. Additionally, we reported median survival months as “high-risk: mean (standard deviation)/low-risk: mean (standard deviation).”
Interpretability Analysis
To verify that the prototypes learned by PIB model distinctive latent distributions for different risk levels, we randomly sampled each prototype with a sampling frequency of 2000 times. We then used t-SNE to reduce the high-dimensional vectors to a two-dimensional plane. As shown in the figure below, the distributions exhibit good separability.
Figure 5: t-SNE visualization of prototypes learned by PIB.
Furthermore, we conducted interventions during inference, as shown in the table below. The results show significant differences. Removing positive prototypes leads to a sharp decline in C-Index (all below 0.5), indicating a complete loss of predictive ability. Intervening on positive prototypes further leads to incorrect guiding signals for the subsequent disentangling module PID, resulting in worse performance due to incorrect prototype distributions. In contrast, randomly removing a negative prototype results in only a slight decline in C-Index, emphasizing the effectiveness of modeling distinctive risk level distributions in PIB.
Redundancy Analysis
We evaluated the model under different redundancy removal levels, as shown below.
Figure 6: Model performance under different redundancy removal levels.
Compared to the setting with 100% information retention (i.e., no redundancy removal), removing 60-75% and 30-35% of instances for the two modalities, respectively, achieves roughly the same performance, indicating that redundant information affects overall model performance. Our model can compress information to reduce redundancy, thereby improving performance.
For more experimental analysis, please refer to our paper: https://openreview.net/forum?id=otHZ8JAIgh
In this work, inspired by information theory, we explored the Information Bottleneck in multimodal cancer survival prediction and proposed a new framework called Prototypical Information Bottlenecking and Disentangling (PIBD) to address the challenges of “intra-modal redundancy” and “inter-modal redundancy.” First, we proposed a Prototypical Information Bottleneck (PIB) to reduce redundancy while preserving task-relevant information. PIB models prototypes of various risk levels, allowing us to select distinctive features from a large number of instances, alleviating “intra-modal redundancy.” Additionally, to address “inter-modal redundancy,” we proposed Prototypical Information Disentanglement (PID), guided by joint prototype distributions, to disentangle independent modality-common and modality-specific features. These compact features provide different perspectives and knowledge, effectively enhancing network performance.
Zhang, Yilan, Yingxue Xu, Jianqi Chen, Fengying Xie, and Hao Chen. “Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction.” In The Twelfth International Conference on Learning Representations. 2024.