[IEEE-TMI] Boosting Convolution with Efficient MLP-Permutation for Volumetric Medical Image Segmentation

A study from the HKUST Smart Lab Lab has been accepted by IEEE Transactions on Medical Imaging. The study proposes a permutable hybrid network, named PHNet, for 3D volumetric medical image segmentation. PHNet capitalizes on the strengths of both convolution neural networks (CNNs) and MLP. PHNet outperformed the state-of-the-art methods with lower computational costs on the widely-used yet challenging COVID-19-20, Synapse, LiTS and MSD BraTS benchmarks. The code is publicly available.

Background

Medical image segmentation (MedSeg) is a fundamental yet challenging task in medical image analysis. The primary objective of MedSeg lies in the precise identification and localization of semantic lesions or anatomical structures within medical images obtained through imaging modalities such as Computed Tomography, X-ray, and Magnetic Resonance Imaging, and it already serves as an important auxiliary tool in various clinical applications, such as disease diagnosis, treatment planning, and monitoring. Despite the developments in MedSeg domain via Transformer methods, the computational complexity of selfattention increases quadratically with the size of input images. The heavy computational costs of Transformer methods limit their practical application, particularly for 3D volumetric medical images, especially in high resolution images with dense features across scales that necessitate a substantial number of forward and backward passes.

Method

PHNet embodies an encoder-decoder paradigm. Notably, the encoder utilizes a 2.5D CNN structure that capitalizes on the inherent anisotropy of medical images, while avoiding information loss in shallow layers by capturing the varying information density in different directions of volumetric medical images. A Multi-Layer Permute Perceptron module is further proposed that can maintain the positional information with the axial decomposition operation while integrating global interdependence in a computationally efficient manner. To enhance computational efficiency, token-group operation is introduced, which efficiently aggregates feature maps at a token level, reducing the number of computations required.

PHNet

Fig. 1. Illustration of multi-layer permute perceptron (MLPP) module.

Experiment

Comparisons with State-of-the-Arts Experiments are conducted on four publicly available datasets: COVID-19-20, Synapse, LiTS, and MSD BraTS. PHNet consistently outperforms state-of-the-art methods on four datasets.

Table 1. Result comparisons with the state-of-the-art methods on Synapse. Table 1

Table 2. Result comparisons with the state-of-the-art methods and top-13 solutions on COVID-19-20. Table 2

Table 3. Result comparisons with state-of-the-art methods on LiTS and MSD BraTS. Table 3

Quantitative comparisons

Among the compared methods, PHNet demonstrates the best visual segmentation results when compared to nnUNet, TransUNet, UNext, and Mixer. Upon examination, it becomes evident that our PHNet achieves the most exceptional segmentation results, characterized by both semantic consistency and boundary integrity.

Figure 2

Fig. 2. Qualitative visualizations of different methods on COVID-19-20. The red boxes highlight the main differences.

Figure 3

Fig. 3. Qualitative visualizations of different methods on the Synapse. Regions of evident improvements are enlarged to show better details.

Ablation Study

Table 4 validates the effectiveness of the core components in the PHNet network model. By incrementally adding each basic component, this paper conducted experiments on the Synapse dataset. Compared with the baseline model, each core component significantly improves the performance, validating the effectiveness of the PHNet network model.

Table 4. Result comparisons with different effects of each component on Synapse.

Table 4

Comparisons with conventional architectures

The results show that MLPP augments CNN performance while upholding efficiency. While Transformer showcases competitive performance when configured with stacked blocks, our method shows significant superiority in terms of efficiency.

Comparisons with architecture combinations

We further explored different combinations of CNN, Transformer, and MLP across both shallow and deep layers of the encoder. The best performance was achieved by using CNN in shallow layers and MLP in deep layers. This supports our argument that CNN excels in capturing local features, while MLP is more effective in modeling long-range dependencies. Interestingly, the inverse configuration led to a substantial performance decline, indicating the importance of layer-specific functionality.

Figure 4

Fig. 4. (a) Comparisons between conventional architectures in terms of effectiveness and efficiency; (b) Comparison between architecture combinations.

Conclusion

This paper introduced a novel permutable hybrid network, referred to as PHNet, specifically designed for volumetric medical image segmentation. By integrating 2D CNN, 3D CNN, and MLP, PHNet effectively captures both local and global features. Additionally, a permutable MLP block is proposed to address spatial information loss and alleviate computational burden. Experimental results on four public datasets demonstrate the superiority of PHNet over state-of-the-art approaches. Future research will explore extending the framework to other medical image analysis tasks, such as disease diagnosis and localization, and further examine the interactions and effectiveness of CNN, Transformer, and MLP.