Recently, the HKUST Smart lab team in collaboration with Sun Yat-sen University, Southern Medical University, CUHK and Jinfeng Laboratory has completed a project on pathology foundation model. To fairly compare the performance of existing foundation models, we established a most comprehensive benchmark to evaluate the performance of off-the-shelf foundation models across six distinct clinical task types, encompassing a total of 39 specific tasks. Our findings reveal that existing foundation models excel at certain task types but struggle to effectively handle the full breadth of clinical tasks.
Figure 1: Overview of GPFM. GPFM is a state-of-the-art pretrained foundation model that demonstrates exceptional performance across 39 diverse clinical tasks. a. The GPFM dataset comprises a large-scale collection of 86,104 slides spanning 34 major tissue types, enabling comprehensive model training and evaluation. b. Overall performance of different foundation models across 39 tasks. GPFM outperforms other leading foundation models, achieving the best average rank of 1.36 (the first place in 29 out of the 39 tasks). c. The overview of unified knowledge distillation for GPFM. The experts used for Expert Knowledge Distillation will be selected based on their average performance on six different clinical tasks. The pretraining algorithm includes three key components: 1) Mask Image Modeling (MIM), 2) Self-Distillation, and 3) Expert Knowledge Distillation. The parameters of GPFM are updated through Exponential Moving Average (EMA). d. The versatility of GPFM is showcased through its application to a wide range of downstream clinical tasks, including whole slide image classification, patch-level tissue classification, survival analysis, pathological tissue retrieval, visual question answering, and pathology report generation.
To improve the generalization of pathology foundation models, we propose a unified knowledge distillation framework consisting of both expert and self knowledge distillation, where the former allows the model to learn from the knowledge of multiple expert models, while the latter leverages self-distillation to enable image representation learning via local-global alignment. Based on this framework, a Generalizable Pathology Foundation Model (GPFM) is pretrained on a large-scale dataset consisting of 190 million images from around 86,000 public H&E whole slides across 34 major tissue types. Evaluated on the established benchmark, GPFM achieves an impressive average rank of 1.36, with 29 tasks ranked 1st, while the second-best model, UNI, attains an average rank of 2.96, with only 4 tasks ranked 1st. The superior generalization of GPFM demonstrates its exceptional modeling capabilities across a wide range of clinical tasks, positioning it as a new cornerstone for feature representation in computational pathology. The overview of the GPFM is presented in Fig. 1.
Data: To build the benchmark and pretrain GPFM, we collected 86,104 slides (34 major tissues) from 47 sources. The overview of data used in this project is shown in Fig.1a. After processing, 190 million patches are adopted for pretraining.
Pretraining Strategy: In the field of CPath, some efforts have been dedicated to pretraining foundation models that can learn inherent representations of histopathology images, catering to the diverse array of tasks encountered in clinical pathology practice. However, the current foundation models have only been evaluated on a limited type of tasks as shown in Fig. 2a, leaving their overall performance unclear.
To comprehensively evaluate these models, we built a most comprehensive benchmark spanning six major clinical task categories, comprising 39 specific tasks. Our findings revealed that the generalization ability of these models is still limited, and no single model can effectively address all the tasks as shown in Fig. 1c. This can be attributed to the fact that each foundation model is trained using distinct datasets and pretraining strategies, leading to specific advantages for each model within particular datasets. These findings highlight the need for further research to develop more generalizable foundation models that can consistently perform well across the diverse type of clinical tasks encountered in CPath. By addressing this challenge, we can unlock the full potential of foundation models in CPath. To improve the generalization of pathology foundation model and enhance the overall performance, an intuitive idea is to leverage the specific strengths of existing models by employing knowledge distillation techniques. Accordingly, we proposed a novel self-supervised learning framework with expert and self knowledge distillation to develop a Generalizable Pathology Foundation Model (GPFM). The framework of our proposed pretraining method is illustrated in Fig. 1c. Similar to DINOv2, we employ teacher-student networks with masking image modeling (MIM) loss and DINO (self-distillation) loss to optimize the student network. Specifically, given an input image \(x\), we obtain two augmented views, \(u\) and \(v\).
Random masking is then applied to both \(u\) and \(v\), resulting in masked views, \(\hat{u}\) and \(\hat{v}\). For the MIM objective, the student network takes \(\hat{u}\) and \(\hat{v}\) as inputs and aims to predict the masked tokens. With the DINO objective, we first crop \(n\) additional local views, \(w_i\), and extract encoded class ([CLS]) tokens using the student network. Next, we obtain the [CLS] tokens of the global views (\(u\) and \(v\)) using the teacher network. Finally, we compute the cross-entropy loss between the local views and global views’ [CLS] tokens. However, this strategy fails to leverage the knowledge from existing vision foundation models, such as UNI, and vision-language foundation models like CONCH, which restricts their applicability to different tissue types. To transfer the knowledge from established pathology foundation models, we introduce an Expert Knowledge Distillation module to distill knowledge into the student network. Specifically, we adopt three foundation models that excel in classification tasks (UNI), survival analysis tasks (Phikon), and visual question answering tasks (CONCH). To achieve distillation, we use the student network to encode the global views \(u\) and \(v\) and extract the [CLS] and [PATCH] tokens. Additionally, we employ the off-the-shelf foundation models (UNI, Phikon, CONCH) to obtain their respective [CLS] and [PATCH] tokens. For aligning the class tokens, we utilize cosine similarity. As for the patch token alignment, we employ both cosine similarity and smooth L1 distance.
Across all tasks, the GPFM model achieved the top average rank score of 1.36 (best performance 29 out of 39), outperforming the second-best model, UNI, which had a ranking score of 2.96 (best performance 4 out of 39). In addition, we also simply averaged the evaluation metrics across all 39 tasks. The GPFM model achieved an average score of 0.859, while the second-best performing model, Phikon, achieved an average score of 0.849. The results are presented in Fig. 2.
Figure 2: Comprehensive Comparison of Foundation Models across 39 Diverse Tasks. a. Tasks evaluated by different foundation models. b. The average performance of foundation models across 39 tasks. The Wilcoxon signed-ranks test is employed to detect significant differences between off-the-shelf foundation models and the proposed GPFM. The figure demonstrates that GPFM achieved the highest average performance. c. The average rank of foundation models across 39 downstream tasks. The Nemenyi test is utilized to assess the critical differences (CD) in the ranking score of various foundation models. In the CD figure, there are no significant differences between the models covered by the black line. d. Overall ranking order of foundation models across 39 tasks. ext indicates external validation.
Across 12 WSI classification tasks, GPFM achieved an average ranking score of 1.08, outperforming the second-best performing model, UNI, which achieved an average ranking score of 3.04. Furthermore, we evaluated the overall performance using average metrics, including AUC, balanced accuracy, and weighted F1 score. GPFM achieved the best average AUC of 0.956, representing a 1.4% improvement over the second-best performing model, CONCH, with statistical significance P < 0.001 (+1.4%, CONCH, P<0.001). Similarly, GPFM achieved the best balanced accuracy of 0.833 (+1.9%, UNI, \P<0.001) and the best weighted F1 score of 0.834 (+1.9%, UNI, P<0.001). The detailed results of different foundation models is presented in Following Fig. 3.
Figure 3: The results of WSI classification.
Across all 12 survival analysis tasks, the GPFM achieved an impressive average ranking score of 1.42, outperforming the competitors in 7 out of the 12 tasks. In comparison, the second-best performing model, Phikon, attained an average ranking score of 2.08, achieving the best performance in 4 out of the 12 tasks. Furthermore, when evaluated using the widely recognized C-Index metric, the GPFM once again emerged as the top performer, achieving an average C-Index of 0.726. This result represents a statistically significant improvement of 0.6% over the Phikon model (P<0.001), further solidifying the generalization capability of GPFM for survival analysis tasks. The detailed ranking scores of the various foundation models are presented in the Fig.4.
Figure 4: The results of survival analysis.
Across all 12 ROI classification tasks, GPFM achieved the best ranking score of 1.60, while the second-best performing model, UNI, attained a ranking score of 2.67. Furthermore, when considering the conventional metrics, GPFM attained the highest AUC of 0.955 (+0.2%, UNI, P<0.001), the best weighted F1 score of 0.848 (+0.8%, UNI, P<0.001), and the top balanced accuracy of 0.853 (+0.8%, UNI, P<0.001) on average. The comparison of various foundation models is illustrated in Fig. 5.
Figure 5: The overall results of ROI classification.
In this study, we employ the CRC-100K dataset for conducting pathological image retrieval tasks. The experimental results for ROI retrieval are depicted in Fig. 6a. We observe that the GPFM model achieved the second-best Top-1 accuracy with a value of 0.906 (-0.5%, UNI). However, GPFM outperforms other models in terms of Top-3 and Top-5 accuracy, achieving values of 0.993 (+1.2%, UNI) and 0.995 (+1.2%, UNI) respectively. Overall, GPFM achieved the best performance compared with other foundation models.
To evaluate the effectiveness of foundation models in pathology VQA, we utilize the PathVQA dataset, which is the largest and most widely used dataset in the pathology domain for VQA tasks. The dataset consists of 32,799 image-question-answer triplets, divided into three subsets: a training set (50%) containing 16,400 triplets used for model training, a validation set (30%) comprising 9,840 triplets for hyperparameter tuning and overfitting prevention, and a test set (20%) including 6,560 triplets for final model performance evaluation. The performance of different foundation models on open-ended and close-ended VQA problems is presented in Fig. 6d. Our model achieved the second-best performance, performing only slightly lower than CONCH. It is worth noting that CONCH is a vision-language foundation model trained on million-scale image-text pairs, which naturally gives it an advantage in VQA tasks. The results demonstrate the significant potential of our approach for VQA tasks compared with other pure vision foundation models. Additionally, we have provided visualizations of the query image, questions, and answers generated by different foundation models. As shown in the figure, the GPFM and CONCH foundation models consistently produced more reliable answers compared to the other models evaluated. Through unified knowledge distillation, the knowledge that CONCH acquired from the million image-text pairs can be distilled into GPFM without using any of the original image-text pair data. The performance of GPFM demonstrates the potential of leveraging textual knowledge without directly utilizing text data.
Figure 6: Overview of Pathology Tissue Retrieval and VQA
In the task of WSI report generation, foundation models are employed to provide representations of WSIs as inputs to the report generation model. We use HistGen as the baseline model, which takes the WSI features extracted by foundation models as input and generates the corresponding diagnosis report. For training and evaluation, we split the dataset into train-validation-test cohorts with a ratio of 7:1:2 (5,383: 769: 1,538 slides). The performance of each foundation model in report generation is presented in Fig. 7. The experimental results demonstrate that Phikon achieved the best performance across all six metrics, while GPFM achieved comparable performance and ranked as the second-best model. This is quite surprising that vision foundation models (e.g., Phikon and GPFM) achieved much better performance in this task compared to vision-language foundation models such as CONCH and PLIP. A potential reason behind this is that PLIP and CONCH were only trained with short descriptions or captions of pathological images, without access to global contextual information. This may have made the text-image pairs less effective for this task compared to the VQA tasks they were originally designed for. It is worth noting that the Phikon model was pretrained exclusively on data from TCGA. While this may enhance Phikon’s efficacy on tasks specifically utilizing data derived from the TCGA, it also limits Phikon’s generalization ability on tasks outside of the TCGA domain. To leverage the complementary strengths of existing models, the proposed unified knowledge distillation approach can distill the capabilities of Phikon in report generation into the GPFM. This synergistic integration allows us to combine the respective strengths of these foundation models, leading to the development of a more generalizable model.
Figure 7: Performance of Report Generation.
In the self-supervised learning framework proposed in this study, we introduced a unified knowledge distillation model to facilitate the transfer of knowledge from off-the-shelf foundation models to GPFM during the pretraining stage. To assess the effectiveness of this module, we conducted an experiment where we removed the Expert Knowledge Distillation module, resulting in a modified self-supervised learning framework known as DINOv2. We trained both DINOv2 and GPFM on the same dataset and evaluated their performance in tissue classification tasks. The experimental results are presented in Fig. 8. The experimental results clearly demonstrate the positive impact of Expert Knowledge Distillation on the performance of the models across 12 tasks.
Not only did the individual task performances improve significantly, but the average performance also exhibited enhancement, with notable improvements in all three metrics. The AUC increased by 0.6%, the weighted F1 score improved by 1.8%, and the balanced accuracy showed an increase of 1.8%. These findings provide strong evidence for the effectiveness of transferring knowledge from off-the-shelf pathology foundation models through the proposed knowledge distillation learning framework.
Figure 8: The Effectiveness of Expert Knowledge Distillation.
For more results, please see our paper Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation via arXiv.
If you have any questions or suggestions, please feel free to contact us. For bibtex citation, please use the following format:
@article{GPFM,
title={Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation},
author={Ma, Jiabo and Guo, Zhengrui and Zhou, Fengtao and Wang, Yihui and Xu, Yingxue and Cai, Yu and Zhu, Zhengjie and Jin, Cheng and Lin, Yi and Jiang, Xinrui and Han, Anjia and others},
journal={arXiv preprint arXiv:2407.18449},
year={2024}
}