[IJCAI 2025] A Survey of Pathology Foundation Models: Recent Advances and Future Directions

A collaborative work by The Chinese University of Hong Kong, HKUST Smart Lab, and Nanyang Technological University titled “A Survey of Pathology Foundation Models: Recent Advances and Future Directions” has been accepted at IJCAI 2025, a premier conference in artificial intelligence. This survey offers a comprehensive and systematic overview of Pathology Foundation Models (PFMs), a transformative direction in computational pathology.

This work delivers the first hierarchical classification framework for pathology foundation models (PFMs), establishes a systematic evaluation benchmark, and delineates critical technical challenges and future research priorities for advancing the field.


Background

Computational pathology (CPath) enables AI-powered analysis of whole slide images (WSIs) for disease diagnosis and prognosis. WSIs are gigapixel pathology images, and the dominant analysis framework is Multiple Instance Learning (MIL). In MIL, a WSI is divided into smaller image patches, from which a feature extractor produces embeddings that are subsequently aggregated by an aggregator to yield slide-level predictions.

Prompt Comparison

Historically, models like ImageNet-pretrained ResNet-50 served as extractors, but these suffer from domain mismatch, failing to fully capture pathology-specific features such as subtle staining patterns or hierarchical tissue structures. Pathology Foundation Models (PFMs)—large-scale pathology-pretrained networks, often using self-supervised learning (SSL)—address this gap, enabling robust morphological representation and better performance on downstream tasks.

Yet, despite promising results, PFMs still face unique challenges in development, scalability, and deployment.


Hierarchical Taxonomy of PFMs

Our proposed taxonomy systematically organizes PFMs along three key dimensions using a top-down analytical framework:
(1) Model scope, which categorizes PFMs by their functional emphasis—focusing primarily on feature extraction, feature aggregation, or joint optimization of both;
(2) Model pretraining, which dissects image-centric pretraining methodologies at the slide, patch, and multimodal levels;
(3) Model architecture, which classifies PFMs by parameter scale and structural complexity.
This framework enables comprehensive, consistent comparison across PFMs.

Model Scope

The standard MIL workflow for WSIs involves three stages: patch extraction, feature extraction, and feature aggregation. Patch extraction is already mature; as a result, a model’s overall performance depends critically on the quality of its extractor and aggregator. WSIs themselves have hierarchical tissue organization—the extractor captures local morphological details, while the aggregator models global structural patterns. The synergy between these two components largely determines diagnostic accuracy. Based on functional emphasis, PFMs can be divided into three categories:

Extractor-Oriented PFMs form the current research mainstream. Their popularity is driven by two factors:
(1) The critical role of high-quality feature extraction;
(2) The urgent need to address domain adaptation challenges when transferring ImageNet-pretrained CNNs to pathology.
This design philosophy mirrors clinical workflows, where pathologists rely on fine-grained cellular features at the patch level to make diagnoses. For instance, CTransPath pioneered a semantic-based contrastive learning method to train a CNN–Transformer hybrid extractor on 15 million patches. REMEDIS demonstrated cross-domain limitations of ResNet-50 across medical imaging modalities, reinforcing the necessity for pathology-specific extractors—a point further validated by later works such as Virchow and SINAI.

Aggregator-Oriented PFMs focus on the slide-level representation. Since the aggregator is the only MIL component trained with direct supervision from true WSI labels, it plays a pivotal role in slide-level tasks. Yet, research in this direction remains limited. CHIEF was the first to show the value of aggregator pretraining, using anatomical-site supervision to create site-aware aggregation. More recent works like MADELEINE, TITAN, and THREAD employed multimodal data to pretrain aggregators while keeping patch features frozen—improving performance in data-constrained scenarios. This reflects an increasing recognition of the aggregator’s importance, consistent with transfer learning principles: pretraining on large datasets alleviates downstream data scarcity. However, results from CHIEF showed that its pretrained aggregator could sometimes underperform simple linear probes on extractors. Possible reasons include small pretraining model sizes or domain bias conflicting with generalized features—indicating the need for further study on large-scale aggregator pretraining.

Hybrid Optimization PFMs pretrain both extractor and aggregator, aiming to maximize synergy between these components. HIPT pioneered this by hierarchically pretraining the first two layers of the extractor (excluding the final layer), yielding significant performance gains. Prov-GigaPath followed a similar strategy, pretraining a ViT extractor alongside a LongNet slide encoder; however, LongNet produced instance-level features rather than a single slide-level embedding, requiring additional pooling (e.g., ABMIL or non-parametric methods) for classification. TANGLE pretrained a ViT extractor and an ABMIL aggregator guided by transcriptomic data. mSTAR introduced a reverse pipeline: pretraining a multimodal aggregator first, then using it to pretrain the extractor, achieving a fully pretrained hybrid design.

Prompt Comparison

From recent trends, two key insights emerge:
First, research focus is shifting from extractor pretraining toward aggregator pretraining—a likely consequence of extractor performance plateauing and the growing appreciation of aggregator importance, especially in limited-data settings.
Second, aggregators are increasingly hierarchically dependent: later models often reuse earlier ones’ capabilities. For example, TITAN builds on features from CONCHv1.5, which itself is based on UNI—forming a dependency chain where TITAN’s performance rests on CONCHv1.5, which in turn depends on UNI.

Model Pretraining

PFM pretraining methods fall into supervised and self-supervised learning (SSL). SSL dominates due to its ability to extract rich morphological representations without manual labels; in fact, CHIEF is the only surveyed PFM to use supervised aggregator pretraining. SSL approaches can be grouped into two directions:
Pure vision methods (contrastive learning, masked image modeling, self-distillation) and cross-modal methods (multimodal alignment).

Contrastive Learning aims to minimize the distance between positive sample pairs and maximize separation between negative pairs. Key developments include:

  • SimCLR: established strong data augmentation and large-batch training regimes;
  • MoCo v3: stabilized ViT performance in SSL;
  • CLIP: extended contrastive alignment to image–text pairs;
  • CoCa: unified contrastive pretraining with text generation.

Prompt Comparison

In pathology, REMEDIS used SimCLR to enhance robustness and data efficiency; TANGLE augmented SimCLR with gene expression reconstruction and intra-slide patch alignment; Pathoduet extended MoCo v3 with cross-scale localization and stain-transfer tasks to tackle stain variability and tissue heterogeneity. CLIP’s multimodal alignment has been widely deployed—not only in extractors (PLIP) but also in aggregators (Prov-GigaPath, mSTAR, MADELEINE, THREAD). KEEP improved CLIP-based extractors by integrating curated knowledge graphs to clean image–text pairs. CoCa-based frameworks have been adopted for both extractors (CONCH, trained on 1.17M image–caption pairs) and aggregators (PRISM, TITAN).

Masked Image Modeling (MIM) predicts masked image regions to learn context-aware features. While SimMIM simplified design with random masking and lightweight decoders, MAE employed heavy masking with asymmetric encoder–decoder architectures. Pathology models like SINAI (pretrained a ViT on 3.2B patches) have demonstrated MIM’s scalability; follow-ups like MUSK (BEiT-3) and BEPH (BEiTv2) confirmed its utility. Prov-GigaPath further showed that MIM benefits aggregator pretraining—applying MAE to its LongNet encoder.

Self-Distillation enables models to learn from their own predictions, often combining “teacher” and “student” networks. DINO introduced the momentum encoder with multi-crop training; iBOT integrated patch-level masking into self-distillation; DINOv2 improved scale stability. In pathology:

  • Phikon applied iBOT to 43M patches from 16 tumor types;
  • Phikon-v2 used DINOv2 across 456M patches;
  • RudolfV merged DINOv2 with pathologist knowledge across 58 tissue types;
  • Hibou scaled DINOv2 to 1.2B patches.
    Self-distillation has also been used for aggregators—TITAN applied iBOT for generic aggregator learning. Domain-specific refinements include PLUTO (combining DINOv2, MAE, and Fourier loss on 195M patches), GPFM (a unified knowledge distillation framework mixing MIM, self-distillation, and expert knowledge), and Virchow2 (pathology-specific augmentations and redundancy reduction for DINOv2).

Model Architecture

PFM architecture design hinges on backbone choice, parameter count, and scale. Scale is primarily determined by parameter count; the authors introduce a ViT-based size taxonomy to standardize scale comparisons including the following sizes: XS (2.78M params), S (21.7M), B (86.3M), L (307M), H (632M), g (1.13B), and G (1.9B).

Extractor architectures include CNNs (ResNet) and Transformers (ViT, Swin, BEiT/BEiTv2, FlexiViT, multimodal BEiT-3). Aggregator architectures are dominated by ABMIL variants and Perceiver models.

Analysis reveals notable patterns:

  1. ABMIL is the most widely used aggregator, while extractors are predominantly Transformer-based.
  2. ViT-L is the most common backbone, with smaller variants (ViT-S, ViT-B) used for efficiency.
  3. Extractors often have significantly more parameters than aggregators (e.g., ViT-L extractors vs. ViT-XS aggregators), leading to data–model scale mismatches.
  4. Model sizes are steadily increasing—from ViT-B in early PFMs to ViT-L/H/g/G in recent designs—highlighting an ongoing scale-up trend despite computational costs.

Evaluation Framework

The survey structures PFM evaluation tasks into four broad categories:

  1. Slide-Level: WSI classification, survival prediction, retrieval, segmentation.
  2. Patch-Level: Patch classification, retrieval, segmentation—testing extractors directly.
  3. Multimodal: Cross-modal retrieval, report generation, visual question answering.
  4. Biological: Gene mutation detection, molecular subtype classification.

Leading models such as CONCH, UNI, GPFM, TITAN provide comprehensive multi-setting benchmarks, enabling evaluation across zero-shot, few-shot, and fully supervised scenarios.


Key Challenges & Future Directions

While PFMs have made substantial strides, several open problems must be addressed to unlock their full research and clinical impact. These challenges span from domain-specific learning to deployment infrastructure.

1. Pathology-Specific Pretraining
Current PFMs largely rely on techniques designed for natural image analysis, insufficiently tailored to the unique textures, staining variations, and hierarchical tissue organization in pathology. There is a pressing need for domain-specialized algorithms, designed for both single-modality (WSI) and multi-modality (WSI + reports + omics) contexts, to capture nuanced visual–biological correlations.

2. End-to-End WSI Learning
The two-stage extractor–aggregator paradigm often leads to misalignment between local patch and global WSI representations. Future systems must achieve fully integrated gigapixel-scale training, simultaneously optimizing both components while efficiently handling the massive computational and memory requirements.

3. Data–Model Scalability
Scaling trends for PFMs—considering dataset size, patch count, model capacity, and data diversity—are not yet fully understood. Research must determine optimal scaling strategies, develop adaptive architectures, and improve curation pipelines to ensure both quality and diversity of training data.

4. Efficient Federated Learning
Training with multi-institutional datasets without data sharing is essential but computationally challenging, especially in non-IID settings. Innovations in algorithm design are needed to reduce communication overhead, improve training stability, and preserve patient privacy, enabling large-scale collaborative learning at lower cost.

5. Robustness Across Institutions
Performance degradation due to institutional differences—scanner types, staining protocols, imaging resolutions—remains a critical obstacle. Approaches to domain generalization, adaptive normalization, and style-invariant modeling will be necessary to ensure consistent diagnostic performance across diverse clinical environments.

6. Retrieval-Augmented Pathology Language Models
Integrating retrieval-augmented generation (RAG) with PFMs could enrich decision support by combining visual analysis with retrieved pathology-specific knowledge bases, such as atlases or literature. This multimodal integration could yield explainable, context-aware AI tools for clinical pathology.

7. Model Adaptation & Maintenance
Given the high cost of full retraining, PFMs must be maintainable and updatable as new diseases emerge, staining protocols evolve, or institutional practices change. Lightweight continual learning and parameter-efficient tuning strategies will be key to maintaining clinical relevance.


Conclusion

This IJCAI 2025 survey delivers a structured taxonomy, a comprehensive evaluation framework, and a forward-looking research roadmap for pathology foundation models. By addressing the outlined challenges, the authors envision PFMs progressing toward greater generalization, clinical reliability, and theoretical advancement.

Here are the links to the paper and the GitHub repository: Read the full paper on arXiv
Explore resources on GitHub