Recently, the HKUST Smart lab team in collaboration with CUHK, Southern Medical University, Sun Yat-sun University, Jinfeng Laboratory, and Guangdong Provincial Key Laboratory of Molecular Tumor Pathology has completed a project on pathology foundation models, leveraging multimodal knowledge to enhance these models while achieving whole-slide-level pretraining. This work offers a feasible solution for training whole-slide pathology foundation models.
In the past year, several excellent works on pathology foundation models have emerged [1-4], demonstrating superior generalization performance on numerous clinical tasks. However, these works still face unresolved challenges:
Figure 1: Workflow involving three modalities in clinical practice: pathology slides, reports, and gene expression profiles.
Neglect of Clinical Multimodal Data: Existing works are limited to vision-only [2] or vision-caption [1, 3] data, where captions mostly derive from Twitter [1] or annotations from textbooks or papers [3]. Clinically, as shown in Figure 1, pathology reports and gene expression profiles play significant roles in many tasks. These data, closer to clinical practice, have been overlooked in developing pathology foundation models. Pathology reports provide detailed descriptions, and gene expression profiles offer quantitative molecular information to supplement the morphological information from pathology slides. Combining multimodal data can provide a more comprehensive perspective, enhancing the model’s generalization capability.
Patch/ROI-Level Modeling Limitation: Current works focus on patch/ROI-level modeling, restricting the context for whole-slide applications. Typically, a patch image is treated as an independent sample to train a patch extractor, which is then used to extract patch features for whole-slide prediction based on multiple instance learning (MIL). Recent attempts to train whole-slide pathology foundation models [4] still rely on a pre-trained patch extractor, limiting the model’s generalization due to the quality of patch features.
Data: To address these challenges, we curated a large multimodal dataset from The Cancer Genome Atlas (TCGA), including pathology slides, reports, and RNA-Seq data, covering 32 cancer types with 26,169 modality pairs from 10,275 patients (Figures 3c-e).
Method: We propose a novel whole-slide pretraining paradigm, Multimodal Self-TAught PRetraining (mSTAR), to inject multimodal knowledge into pathology foundation models, expanding the modeling context to the whole-slide level (Figure 2).
In the first phase, a whole-slide aggregator is trained using contrastive learning to absorb multimodal knowledge. This aggregator serves as a bridge in the second phase to infuse this knowledge into the patch extractor. Specifically, the aggregator aggregates pre-extracted patch features from a whole-slide image (WSI) to obtain whole-slide features, which are aligned with other modalities through contrastive learning.
In the second phase, the pre-trained aggregator seamlessly transfers the learned multimodal knowledge to the patch extractor through self-taught training. The aggregator supervises the patch extractor, aiming to make the extracted patch features as similar as possible to those re-extracted by the trained aggregator.
Figure 2: mSTAR pretraining framework
Results: Our pretraining method revolutionizes the pretraining paradigm for pathology foundation models, endowing them with strong whole-slide capabilities. To our knowledge, this is the first attempt to inject whole-slide multimodal knowledge into pathology models, extending the context modeling from single modality to multimodal and from patch-level to whole-slide level.
We conducted experiments on seven major task categories, encompassing 43 clinical sub-tasks, achieving the most comprehensive clinical task evaluation to date. The results show that mSTAR excels in both single-modality and multimodal tasks, with statistically significant improvements, particularly in multimodal applications (Figures 3f-g). These findings demonstrate the powerful capability of whole-slide multimodal knowledge, enhancing the model’s generalization in diverse downstream tasks (Figures 4-9).
Figure 3: Pretraining data information (c-d) and mSTAR performance overview (f-g)
Whole-Slide Classification: Consistency in performance ranking shows that with ABMIL [5] as the backbone, mSTAR outperforms other models in 11 out of 12 sub-tasks (Figure 4c), ranking first overall (Figure 4a). With TransMIL [6] as the backbone, mSTAR leads in 9 out of 12 sub-tasks (Figure 4d), also ranking first overall. When using the pre-trained aggregator, mSTAR+ surpasses other SOTA pathology models in 10 out of 12 sub-tasks (Figure 4d).
In terms of Macro-AUC, mSTAR shows an overall improvement of +1.3% (P < 0.001) compared to the next best model, UNI, with average Macro-AUC rising from 0.905 to 0.908 (P < 0.001) and from 0.870 to 0.892 (+2.2%, P < 0.005) in ABMIL and TransMIL, respectively.
For diagnostic tasks, the performance improvement is more pronounced with ABMIL, with mSTAR outperforming the next best method on all diagnostic datasets (P < 0.001, Figure 4c). Similar trends are observed with TransMIL (Figure 4d), with significant differences in 4 out of 6 datasets. Combining the pre-trained aggregator, mSTAR+ shows further improvement across various datasets, achieving the best results in 5 out of 6 diagnostic tasks (Figures 4d and e). For instance, on the CAMELYON dataset, mSTAR+ outperforms the second-best FM by +2.01% (P < 0.001).
Additionally, an interesting finding from Figure 4e is that the pre-trained aggregator can collaborate with other patch extractors to achieve significant performance improvements. In most tasks, it consistently outperforms other methods using different FMs. Notably, the pre-trained aggregator shows significant growth compared to the from-scratch aggregator, with improvements up to +14.94% (P < 0.001), despite slight decreases in some specific sub-tasks (up to -1.15%).
For molecular prediction, with ABMIL (Figure 4c), mSTAR achieves the best results in 5 out of 6 tasks (Figure 4c). With TransMIL as the backbone (Figure 4d), mSTAR shows an overall improvement of +3.71%, with significant enhancements of +4.60% and +4.21% (P < 0.001) in BRCA-Molecular and CRC-Molecular tasks, respectively. Combining the pre-trained aggregator (TransMIL+), mSTAR+ shows further significant improvements (Figures 4d and e), with enhancements of +6.35%, +2.93%, +1.82%, and +1.5% in CRC-Molecular, BCNB-HER2, GBMLGG-IDH1, and BCNB-ER, respectively.
The results of the pre-trained aggregator with different pathology patch extractors in molecular prediction tasks show a different pattern compared to diagnostic tasks. In diagnostic tasks, the pre-trained aggregator did not significantly improve CONCH and UNI models. However, in molecular prediction tasks, the pre-trained aggregator significantly enhanced their performance, with CONCH showing improvements of +13.10% and +7.70% (P < 0.001) in CRC-Molecular and BRCA-Molecular, respectively, and UNI showing improvements of +9.00% and +6.29% (P < 0.001) in BRCA-Molecular and BCNB-ER tasks, respectively. This indicates that incorporating RNA-Seq data in pretraining effectively improves molecular prediction performance. The less significant performance growth of mSTAR combined with the pre-trained aggregator suggests that mSTAR has already learned gene-related knowledge through self-taught training, achieving good performance in molecular prediction.
Figure 4: Results for 12 whole-slide classification tasks, including 6 diagnostic tasks and 6 molecular prediction tasks.
Survival Analysis: Consistency in performance improvement shows that with ABMIL, mSTAR achieves the best performance in 7 out of 9 sub-tasks (Figure 5c); with TransMIL, it leads in 8 out of 9 sub-tasks (Figure 5d), ranking first overall (Figure 5a). Significant improvements are observed in UCEC, LUAD, LUSC, and SKCM, with increases of +3.87%, +3.34%, +2.89%, and +1.46%, respectively.
Combining the pre-trained aggregator (Figures 5d and e), mSTAR+ shows significant improvements over the second-best model in SKCM, UCEC, BRCA, KIRC, and CRC, with enhancements of +5.23%, +4.9%, +3.0%, +2.69%, and +2.33% (P < 0.001), respectively.
The results of the pre-trained aggregator with different extractors (Figure 5e) show that the pre-trained aggregator can further improve the survival analysis performance of different extractors, with improvements up to +7.79% (P < 0.001). This indicates the significant contribution of pathology reports and gene data introduced during pretraining to survival analysis tasks. Similarly, combining the pre-trained aggregator with gene data and pathology reports, PLIP and CONCH, which lack gene and pathology report knowledge, show more significant growth than mSTAR, supporting the hypothesis that gene and pathology report knowledge has been transferred to mSTAR through self-taught training, achieving good performance in survival analysis.
Figure 5: Results for 9 survival analysis tasks.
Multimodal Survival Analysis: To verify that mSTAR’s multimodal pretraining helps alleviate modality heterogeneity, we replaced the pathology feature inputs of several SOTA multimodal models (MCAT [7], Porpoise [8], MOTCat [9], and CMTA [10]) with features extracted by the corresponding pathology foundation models for performance comparison.
For different multimodal models, mSTAR’s performance ranks between 1.22 and 1.67, with an average rank of 1.47, significantly outperforming the second-best model (rankings between 2.56 and 3.00, average 2.68), achieving the top average rank in all 9 sub-tasks across 4 multimodal models (Figure 6a), with an average performance improvement of +1.5% (P < 0.001).
Figure 6: Results for multimodal survival analysis.
Few-Shot Whole-Slide Classification: For 6 whole-slide classification sub-tasks, mSTAR ranks first overall (Figure 7b), achieving the best performance in 4 out of 6 sub-tasks. Compared to the second-best model, mSTAR shows improvements of +4.1%, +1.1%, and +1.7% in Camelyon, UBC-OCEAN, and BCNB-PR, respectively.
Figure 7: Results for few-shot whole-slide classification.
Zero-Shot Whole-Slide Classification: Overall, mSTAR achieves the best results in half of the sub-tasks, ranking first overall (Figure 8b), with an average improvement of +3.1% (P < 0.001) (Figure 8c). Compared to vision-only models, mSTAR achieves better results in 4 out of 6 sub-tasks (Figure 8c).
Significant improvements are observed in BCNB-ER, CAMELYON, and BCNB-PR tasks, with enhancements of +9.4%, +5.7%, and +4.0% (P < 0.001), respectively.
Figure 8: Results for zero-shot whole-slide classification.
Pathology Report Generation: We adopted our previous pathology report generation model, HistGen [11], replacing its pathology features with features extracted by various pathology foundation models for downstream task validation.
Quantitatively, we achieved the best results across 6 metrics (BLEU@1-4, METEOR, and ROUGE-L) (Figure 9a).
Qualitatively (Figure 9b), mSTAR hits the most diagnostic facts, while CONCH, PLIP, and R50 exhibit numerous factual errors. Compared to the second-best UNI, mSTAR generates immunohistochemical information, such as HER2, and provides more accurate staging information. Detailed results are in the supplementary material. This improvement likely stems from incorporating gene-related knowledge during pretraining.
Figure 9: Results for pathology report generation.
[1] Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
[2] Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology. Nature Medicine, 30(3):850–862, 2024.
[3] Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology. Nature Medicine, 30(3):863–874, 2024.
[4] Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data. Nature, pages 1–8, 2024.
[5] Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International conference on machine learning, pages 2127–2136. PMLR, 2018.
[6] Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021.
[7] Richard J Chen, Ming Y Lu, Wei-Hung Weng, Tiffany Y Chen, Drew FK Williamson, Trevor Manz, Maha Shady, and Faisal Mahmood. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4025, 2021.
[8] Richard J Chen, Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Jana Lipkova, Zahra Noor, Muhammad Shaban, Maha Shady, Mane Williams, Bumjin Joo, et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell, 40(8):865–878, 2022.
[9] Yingxue Xu and Hao Chen. Multimodal optimal transport-based co-attention transformer with global structure consistency for survival prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21241–21251, 2023.
[10] Fengtao Zhou and Hao Chen. Cross-modal translation and alignment for survival analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21485–21494, 2023.
[11] Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, and Hao Chen. Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction. MICCAI, 2024.
For more results, click Read Original to follow our preprint paper.
If you have any questions or suggestions, please feel free to contact us. For bibtex citation, please use the following format:
@article{mstar,
title={A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model},
author={Xu, Yingxue and Wang, Yihui and Zhou, Fengtao and Ma, Jiabo and Yang, Shu and Lin, Huangjing and Wang, Xin and Wang, Jiguang and Liang, Li and Han, Anjia and Chan, Ronald Cheong Kin and Chen, Hao},
journal={arXiv preprint arXiv:2407.15362},
year={2024}
}