Recently, a study from the HKUST Smart Lab on the fusion of multimodal medical data has been accepted by the prestigious medical imaging journal IEEE Transactions on Medical Imaging. This research proposes a Group-Individual Cooperative Learning (CCL) approach to address the challenges of modality heterogeneity and data redundancy in multimodal data fusion.
Survival analysis is one of the most critical tasks in cancer prognosis prediction, aiming to estimate the risk of death in cancer patients. Recently, cancer survival analysis models that integrate multimodal pathological images and genomic data have achieved remarkable success. However, the heterogeneity among various modalities poses significant challenges for effectively fusing multimodal knowledge. Additionally, the high dimensionality of the data increases the risk of the model learning task-irrelevant information, leading to decreased generalization performance.
To tackle these issues, this study presents a Cohort-Individual Cooperative Learning (CCL) framework that enhances the performance of cancer survival analysis models by combining multimodal knowledge decomposition and a group guidance module. Specifically, the study first introduces a multimodal knowledge decomposition module that explicitly breaks down multimodal knowledge into four distinct parts: common knowledge, collaborative knowledge, and unique knowledge from two modalities. Compared to existing knowledge decomposition frameworks (which only consider common knowledge and modality-specific knowledge), the proposed method offers a more comprehensive and effective decomposition of multimodal knowledge, enabling the model to perceive the impact of different information in multimodal data on predictions and facilitating more effective data fusion. Secondly, the study introduces a group guidance module that leverages group data to reduce the risk of the model learning task-irrelevant information. This promotes a more comprehensive and robust understanding of the knowledge distribution across multimodal data, alleviating overfitting issues and enhancing the model’s generalization ability. By integrating multimodal knowledge decomposition and the group guidance module, the proposed multimodal survival analysis model demonstrates superior discriminative and generalization capabilities. Extensive experimental results on five TCGA datasets validate the model’s effectiveness and generalizability in performing survival analysis tasks with multimodal data.
Figure 1. The proposed Group-Individual Cooperative Learning framework, including unimodal feature extraction, multimodal knowledge decomposition, and group guidance module.
Unimodal Survival Analysis
The Cox proportional hazards model is one of the most classic statistical methods in survival analysis, modeling the risk function of cancer prognosis as a product of two components: 1) a baseline hazard function that describes the time-varying risk of death due to cancer at baseline levels; and 2) a personalized hazard coefficient that measures the overall impact of personalized features on the risk function. The Cox model primarily relies on various clinical indicators, and its accuracy largely depends on the correlation between the indicators used and survival risk. However, since the factors influencing survival risk cannot be fully determined and may be numerous, the performance of traditional models often falls short compared to the latest deep learning models. In recent years, with the advancement of deep learning techniques, methods utilizing deep models to analyze cancer patient survival based on genomic or pathological image data have attracted significant attention. Despite achieving good performance, unimodal survival analysis models cannot leverage valuable information from other modalities, limiting their performance ceiling.
Multimodal Survival Analysis
Survival analysis models based on multimodal data can derive richer information from different modalities, leading to a more comprehensive understanding of cancer prognosis and more accurate predictions. Existing multimodal models often utilize cross-modal attention to model intermodal relationships, focusing on leveraging common knowledge between modalities to extract more discriminative features. However, an overemphasis on common knowledge may cause the model to overlook other valuable information, such as modality-specific knowledge and new knowledge generated from intermodal collaboration. Additionally, due to the high dimensionality of multimodal data, existing models are at risk of overfitting task-irrelevant information, resulting in decreased generalization performance.
Unimodal Feature Extraction
For pathological image data, foreground regions are first cropped into 256x256 image patches. Subsequently, features for each patch are extracted using the ImageNet-pretrained ResNet-50 model, forming a feature set of image patches. Since foreground regions are typically large, each pathological image usually contains tens of thousands of patches, resulting in a very high-dimensional feature set. To further reduce data redundancy, K-means clustering is used to group all image patch features into \(k\) classes. Due to the randomness of clustering algorithms, the centroids of clusters at the same position may not match across different samples, leading to substantial apparent differences that hinder the model from extracting more targeted features for specific appearances. To address this issue, the study constructs an anchor for iteratively updating the centroid features. Then, a bipartite graph matching method is employed to match and align the current sample’s cluster centroids with the features stored in the anchor. The matched centroids are used to update the features in the anchor and for subsequent multimodal feature fusion. This approach enables the model to extract targeted information for specific appearances, enhancing the discriminative power of pathological features.
For genomic data, the study divides RNA sequences, CNV sequences, and SNV sequences into six subsequences. Features for each subsequence are extracted using SNN networks, and multiple subsequence features are fused into a single gene feature.
Group-Individual Cooperative Learning
Multimodal Knowledge Decomposition: In this study, multimodal knowledge is comprehensively decomposed into common knowledge, collaborative knowledge, and unique knowledge from each modality. Specifically, common knowledge refers to information contained in each modality; collaborative knowledge refers to new knowledge that does not exist independently in any modality and can only be generated through multimodal data collaboration; and modality-specific knowledge refers to knowledge that exists only within a specific modality. Multiple encoders are utilized to explicitly model these mutually exclusive knowledge components from multimodal features.
However, merely using different encoders does not guarantee that they will learn the corresponding knowledge components as intended. Therefore, this study employs additional supervisory signals to guide the feature learning of each encoder. From a knowledge perspective, based on the relationship between each knowledge component and the original modality information, specialized loss functions are designed to constrain the learned knowledge features, as illustrated in Figure 2. Specifically, common knowledge is information shared across all modalities; thus, the similarity between common knowledge features and modality features is maximized. Collaborative knowledge, which does not exist independently in any original modality, is constrained to be orthogonal to modality features. Modality-specific knowledge exists only in a single modality; therefore, its similarity to the features of its source modality is maximized while being orthogonal to features from other modalities. From a sample perspective, patients with similar risk levels should also have more similar knowledge components. Consequently, we group samples based on their risk levels using labels. Then, through contrastive learning, the features of knowledge components for patients with similar risk levels are brought closer together, while those with significantly different risk levels are made more distinct.
Figure 2. Supervisory signals for the group guidance module.
Multimodal Fusion and Prediction
Based on the comprehensive decomposition of multimodal data into four mutually exclusive knowledge features, we utilize a Transformer to fuse all knowledge features and obtain the final prediction results. Following previous work, this study constrains survival prediction results using the negative log-likelihood (NLL) loss for censored data.
Datasets: TCGA-BLCA, TCGA-BRCA, TCGA-GBMLGG, TCGA-LUAD, TCGA-UCEC
Comparison with SOTA Methods
This study compares the proposed model with state-of-the-art (SOTA) unimodal and multimodal methods, including genomic-based (SNN), pathological image-based (AttMIL, CLAM, TransMIL, and DTFD), and multimodal data-based (MCAT, M3IF, GPDBN, Porpoise, HFBSurv, SurvPath, MOTCat, and CMTA). The results are shown in Table 1. Overall, the proposed model achieves SOTA performance.
Table 1: Survival Analysis Experiments on TCGA Datasets
Ablation Studies of Proposed Modules
Ablation studies were also conducted to validate the effectiveness of the proposed modules, with results shown in Table 2. CCA refers to clustering center matching in pathological features; MKD refers to multimodal knowledge decomposition; CGM refers to the group guidance module. The experimental results indicate that each proposed feature module significantly enhances the model’s accuracy.
Table 2: Ablation Study of Proposed Modules
Multimodal Data Visualization
The attention regions corresponding to each knowledge component in the original data were visualized, as shown in Figure 3. It can be observed that different knowledge components typically correspond to different parts of the original data, demonstrating their differences and the rationale behind the proposed method.
Figure 3. Visualization results of different knowledge components in the input data.
In this study, we proposed a Group-Individual Cooperative Learning framework to effectively fuse multimodal genomic and pathological image data for cancer patient survival analysis. Specifically, we first introduced a multimodal knowledge decomposition module that explicitly and comprehensively breaks down multimodal knowledge into four distinct components. Secondly, we presented a group guidance module to enhance the generalization and discriminative capabilities of the decomposed knowledge components. By combining multimodal knowledge decomposition with the group guidance module, the proposed framework achieves more efficient and robust multimodal knowledge modeling, facilitating more effective multimodal fusion. The experimental results from survival analysis on five TCGA datasets demonstrate that the proposed model achieves state-of-the-art performance.
H. Zhou, F. Zhou, and H. Chen. Cohort-Individual Cooperative Learning for Multimodal Cancer Survival Analysis. IEEE Transactions on Medical Imaging, doi: 10.1109/TMI.2024.3455931.