[MICCAI 2024 Paper] Multimodal Hybrid Expert Model for Whole Slide Image and Gene Sequence Survival Analysis

Recently, a study on medical multimodality by HKUST Smart Lab has been accepted for presentation at the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2024. This paper is among the first batch of early acceptances at MICCAI 2024. The research proposes a multimodal hybrid expert model for whole slide image (WSI) and gene sequence survival analysis. The related code is open-sourced at https://github.com/BearCleverProud/MoME. The paper can be accessed at https://arxiv.org/abs/2406.09696. The following is the detail of the paper.

Abstract

Survival prediction requires the integration of whole slide images (WSI) and genomics data to support robust clinical decisions. Two main challenges exist: significant heterogeneity between modalities and complex interactions within and between modalities. Previous studies have adopted co-attention mechanisms, performing a one-time feature fusion after encoding both modalities separately. However, due to the high heterogeneity between modalities, this method is insufficient for modeling such a complex task. To address this, we propose a Biased Progressive Encoding (BPE) paradigm, allowing simultaneous encoding and fusion of modalities. This paradigm uses one modality as a reference to assist in encoding the other, with multiple iterations to deeply integrate the two modalities, gradually reducing cross-modal differences and promoting multimodal information fusion. Additionally, survival prediction involves comprehensive analysis of biomarkers from WSI, genomics, and cross-modal interactions, where key biomarkers may vary between individuals and modalities, requiring model flexibility. Therefore, we further propose a Mixture of Multimodal Experts (MoME) layer to dynamically select customized expert models at each stage of the BPE paradigm. Experts incorporate reference information from another modality to varying degrees, allowing balanced or biased attention to different modalities during encoding. Experimental results show that our method outperforms others on multiple datasets, including TCGA-BLCA, TCGA-UCEC, and TCGA-LUAD. The code is available at https://github.com/BearCleverProud/MoME.

Introduction

Survival analysis based on WSI and genomics is crucial for cancer prognosis as it assesses mortality risk and provides essential references for treatment planning. The key challenge is effectively utilizing information from both modalities, such as identifying genomic biomarkers, exploring tumor microenvironments in histopathological images, and interactions between genomic data co-expressions. Recently, research focus has shifted from single-modality cancer classification to more complex survival analysis using multimodal information.

A major challenge in this task is the significant heterogeneity between histopathological images and genomics data, stemming from their inherent differences and varying preprocessing methods. Furthermore, interactions within and between modalities are complex, as both contain rich information, but only a small portion is interrelated and useful for survival prediction. Previous methods have attempted to address this using co-attention (CA) mechanisms, but feature fusion was performed only once. Given the task’s complexity and significant differences between modalities, these methods may be considered shallow and insufficient.

To address these issues, we propose a Biased Progressive Encoding (BPE) paradigm. Unlike previous methods that encode modalities separately before fusion, our approach performs encoding and fusion simultaneously. In this method, one modality is encoded while using the other as a reference, aiding in extracting more relevant information. Additionally, feature encoding alternates between modalities, gradually reducing their feature space differences. This leads to deeper integration and fosters cross-modal interactions.

In addition to modality heterogeneity, individual differences may cause key features for survival analysis to appear in different modalities. This presents new challenges in model design, requiring selective attention to specific modalities or their interactions. To achieve this, based on our BPE paradigm, we propose a Mixture of Multimodal Experts (MoME) layer. The MoME layer consists of multiple experts capable of modeling complex interactions within and between modalities. These experts incorporate reference information from another modality to varying degrees, allowing our model to selectively focus on different modalities during encoding. Additionally, since the function of reference information may vary, we allow flexible expert selection at each layer. In the network’s shallow layers, the MoME layer can use the reference modality as a filter to eliminate irrelevant features and enhance relevant ones. Conversely, in the network’s deeper layers, it can serve as a guide, seeking cross-modal combined representations as key biomarkers for survival prediction.

Methods

Biased Progressive Encoding

Figure 1

Fig. 1. (a) Our Biased Progressive Encoding paradigm, and (b) our Mixture of Multimodal Experts layer illustration. The left part of (b) represents our gating network, and the right part depicts our proposed three expert components designed to utilize reference modality to varying degrees.

Our BPE paradigm is outlined in Fig. 1(a). For simplicity, we refer to the currently encoded modality as \(M\), and the reference modality as \(R\), where \(M\) can be either \(M_1\) or \(M_2\), and \(N\) is the number of encoding iterations. Our BPE paradigm follows a progressive learning strategy where \(M_1\) is encoded to discover complex interactions with \(R\). This process is then reversed, using the encoded first modality \(M_1\) to encode the other modality \(M_2\). The deep feature extraction and progressive learning strategy in our BPE enable deep integration and reduce inter-modal differences. The complete progressive encoding process involves encoding both modalities twice for deep integration. After encoding, these features, along with a pre-set classification feature, are fed into an attention layer, and the output classification feature is used for final survival prediction.

Mixture of Multimodal Experts Model

The structure of our proposed MoME layer is shown in Fig. 1(b). Our MoME layer is derived from the traditional Mixture of Experts (MoE). Traditional MoE consists of a set of parallel feedforward networks (experts) and a gating network that controls expert selection. Unlike the classic MoE that operates at the feature level, our MoME innovates at the sample and layer levels, allowing different samples within the same layer or the same sample across different layers to be sent to different experts. This innovation is crucial for handling the information-rich but sparse nature of WSIs and genomics, where a single feature might be meaningless, emphasizing the need for collective processing.

Our MoME consists of two components: 1) a gating network and 2) a set of expert structures specifically designed for multimodal survival analysis. First, multimodal features are passed to the gating network, which determines the most suitable expert to use. Then, the selected expert uses \(R\) to fuse and encode \(M\).

Gating Network

The gating network is designed to be lightweight yet capable of selecting the appropriate expert. It consists of linear layers, GELU activation functions, and RMSNorm layers. These modules map features to a new space, then average the mapped features to obtain multimodal features. The gating network is defined as follows,

\[G(X) = \text{RMSNorm}(\text{GELU}(W_3 \cdot \text{GELU}(W_2 \cdot \text{GELU}(W_1 \cdot X))))\]

where \(W_1, W_2, W_3\) are learnable matrices. The obtained logits are used to select the appropriate expert. Unlike traditional MoE models that use weighted sums of multiple experts, our method enforces the selection of a single expert within each module. By adopting this strategy, the gating network can make careful expert selections while reducing computational costs.

Multimodal Expert Pool

We designed the following four experts based on two principles: 1) must include experts specializing in WSI, genomics, and their interactions (at least three in total), and 2) expert models should have the capability of both encoding and feature fusion.

TransFusion

This multimodal expert is based on self-attention. This expert maximizes the utilization of the reference modality by allowing comprehensive information exchange between the two modalities. Given the input pair \((M, R)\), our proposed TransFusion (TF) expert can be represented as,

\[TF(M, R) = \text{SelfAttention}(\text{Concat}(M, R))\]

where \(\text{SelfAttention}\) is the self-attention mechanism, \(\text{Concat}\) is the concatenation of \(M\) and \(R\), and \(\text{Select}\) represents the selection matrix’s first \(m\) rows, with \(m\) being the number of features in modalities \(M\) and \(R\).

Bottleneck TransFusion

Both WSI and genomics contain a large amount of information, but only a small portion is useful for survival analysis. Therefore, it is necessary to distill the information, focusing only on the parts relevant to the outcome. To address this, we propose the Bottleneck TransFusion (BTF) expert, which avoids direct interaction between the two modalities, using bottleneck features as independent features to exchange information between the two modalities. Compared to TF, BTF has a lower utilization rate of the reference modality, as it does not allow full intercommunication between the two modalities. Let \(B\) represent the bottleneck features, where \(k\) is the number of bottleneck features, the BTF expert is defined as follows,

\[BTF(M, R) = \text{SelfAttention}(\text{Concat}(M, B))\]

SNNFusion

The SNN Fusion (SF) expert is primarily used for genomics data fusion, as it performs well when applied solely to genomics data. When using this expert, the reference modality’s utilization rate is lower. In SNNFusion, there are two SNNs, and our SF expert is defined as follows,

\[SF(M, R) = \text{SNN}(M) + \text{SNN}(R)\]

where SNN includes a linear layer, an Exponential Linear Unit (ELU) activation function, and an alpha dropout layer.

DropF2Fusion

This expert discards \(R\) during the fusion process, meaning it does not use the information contained in \(R\) at all. This method is particularly useful when using one modality is sufficiently accurate. Mathematically, the DropF2Fusion (DF) expert is defined as follows,

\[DF(M, R) = M\]

Why Self-Attention Outperforms Co-Attention

Mathematically, we can conclude that co-attention is a suboptimal substitute for self-attention. Thus, we design expert models based on self-attention rather than co-attention. The detailed proof can be found in our paper.

Experiments

Comparison with Other Methods

We collected common survival analysis datasets from TCGA, including TCGA-BLCA, TCGA-UCEC, and TCGA-LUAD. We evaluated the performance of different models on these three datasets, as shown in the table below. Detailed information about the datasets can be found in our paper or on the TCGA website.

Table 1

Table 1. C-index results of different methods on three TCGA datasets. The best results are underlined in red, and the second-best results are italicized in blue. “Geno.” indicates using genomic data, and “Patho.” indicates using WSI data.

We conducted comprehensive comparisons of our method with single-modality and other multimodal methods. The results are presented in Table 1. Our method consistently outperforms all other methods across all datasets, especially on the UCEC and LUAD datasets, showing significant performance advantages over previous methods. Compared to previous methods, our method improves the C-index by 0.4%, 1.4%, and 1.5% on the three datasets, respectively, and overall performance improves by 1.4%. These results demonstrate that our MoME is suitable for general survival analysis settings. Additionally, multimodal methods consistently outperform single-modality methods, further proving the effectiveness and necessity of multimodal approaches and the need to combine multiple forms of data.

Expert Model Selection

We conducted experiments by disabling each expert individually to evaluate their effectiveness. The experimental results are presented in the left part of Table 2. We observed that our MoME performs best on LUAD and second-best on UCEC when all experts are enabled. MoME without the SNNFusion expert performs best on UCEC. These results indicate that the most important features for accurately predicting UCEC and LUAD survival times differ, suggesting that focusing on specific modalities with MoME may be beneficial. This further proves the necessity of our MoME, which can dynamically select different experts for different samples. Additionally, the performance comparison between our MoME and TransFusion again confirms its superiority.

Sensitivity Analysis of BTF

We conducted sensitivity analysis on the UCEC and LUAD datasets by varying the number of bottleneck features used in BTF (from 1 to 16). The results are presented in the right part of Table 2. We observed that our MoME achieves the overall best performance when the number of bottleneck features is 2, balancing performance across both datasets. Specifically, when the number of bottleneck features is 8, our MoME performs best on UCEC but lowest on LUAD.

Table 2

Table 2. Ablation study results for our MoME using different experts (left) and sensitivity analysis of the number of bottleneck features (right). “T.” represents TransFusion, “B.” represents Bottleneck TransFusion, “S.” represents SNNFusion, “D.” represents DropF2Fusion, and “#B.” indicates the number of bottleneck features.

Conclusion

In this paper, we introduced a BPE paradigm and a MoME layer for cancer survival analysis based on WSI and genomic data. The BPE paradigm achieves deep integration by simultaneously performing feature encoding and fusion, using one modality as a reference to encode the other. With this, our BPE can address the severe heterogeneity issue between WSI and genomic features. Additionally, our MoME layer dynamically selects the most suitable experts to model complex intra- and inter-modal interactions, addressing the challenges posed by varying key features. Through extensive experiments, we demonstrated that our method outperforms other multimodal learning methods in survival prediction, showing that our approach can be applied to general survival analysis settings.