[CVPR 2025] FOCUS: Knowledge-Enhanced Adaptive Visual Compression for Few-Shot Whole Slide Image Classification

Recetnly, HKUST Smart Lab’s new work FOCUS: Knowledge-Enhanced Adaptive Visual Compression for Few-Shot Whole Slide Image Classification, accepted at CVPR 2025, a top-tier conference in computer vision. This study introduces FOCUS, a novel framework designed to tackle the few-shot weakly supervised learning (FSWL) problem in whole slide image (WSI) classification. By leveraging adaptive visual compression and knowledge enhancement, FOCUS effectively filters out irrelevant regions and prioritizes diagnostically significant areas, addressing key challenges in pathology AI under data-scarce conditions.

Background

Whole slide image (WSI) classification is a crucial task in computational pathology (CPath), enabling automated disease diagnosis from high-resolution pathology images. This is particularly important for cancer detection, where accurate classification can significantly impact clinical decision-making. However, WSIs are extremely high-resolution (often exceeding 100,000 × 100,000 pixels), and diagnostic information is sparsely distributed across the image. Traditional fully supervised learning approaches require extensive expert-annotated training data, which is costly and limited due to privacy constraints and expert availability.

To address this, few-shot weakly supervised learning (FSWL) has emerged as a promising solution, allowing models to learn effectively from limited labeled samples combined with unlabeled data. A widely used approach in FSWL is multiple instance learning (MIL), where WSIs are divided into smaller image patches, and features are aggregated to form a slide-level representation. However, when labeled samples are extremely scarce, MIL struggles to distinguish diagnostically relevant patches from irrelevant ones. Additionally, existing methods underutilize pathology foundation models (FMs) and language priors, limiting their effectiveness in few-shot scenarios.

This raises a key question:
How can we efficiently compress visual information and focus on diagnostically relevant regions under few-shot conditions?

FOCUS addresses this by integrating foundation models (FMs) and language knowledge into an adaptive compression mechanism, ensuring that critical diagnostic features are prioritized. Figure 1 illustrates the overall framework.

FOCUS Framework

A. Multiple Instance Learning (MIL)

MIL is the dominant approach for WSI analysis, treating each WSI as a bag and its patches as instances. Patch features are aggregated to form a slide-level representation, making MIL well-suited for weakly supervised learning. However, traditional MIL requires large labeled datasets, making it ineffective in few-shot settings. FOCUS enhances MIL by introducing adaptive visual compression to improve feature selection.

B. Pathology Foundation Models (FMs)

Recent pathology foundation models (e.g., CONCH, UNI, GPFM, Virchow) have demonstrated strong feature extraction capabilities through large-scale pretraining. However, current methods limit their use to feature extraction, failing to leverage their semantic understanding. FOCUS utilizes FMs for global redundancy removal and semantic relevance assessment, improving feature selection.

C. Few-Shot Weakly Supervised Learning (FSWL)

FSWL aims to perform WSI classification with minimal labeled samples. Existing methods incorporate language guidance or additional visual samples, but they often struggle with multi-resolution requirements or reliance on extra reference samples. FOCUS introduces a three-stage compression strategy, offering a more generalizable and efficient solution.

Methodology

FOCUS employs a three-stage progressive visual compression strategy, consisting of the following steps:

A. Global Redundancy Removal

Each WSI is divided into non-overlapping patches, and features are extracted using pretrained pathology foundation models. To remove redundant information, FOCUS applies cosine similarity-based filtering within a local sliding window. A dynamic threshold is computed based on mean similarity and standard deviation, ensuring that highly redundant patches are eliminated.

\[\hat{\mathbf{b}}_i=\frac{\mathbf{b}_i}{\left\|\mathbf{b}_i\right\|_2}, \quad S_{i j}=\hat{\mathbf{b}}_i \cdot \hat{\mathbf{b}}_j, \quad \tau_g=\mu(S)+\sigma(S)\]

B. Language-Guided Visual Token Prioritization

After redundancy removal, the remaining patch features are matched with expert-provided textual descriptions. A text encoder converts expert descriptions into text tokens, which are then mapped to visual features. A cross-modal attention mechanism calculates similarity scores, ranking visual tokens by semantic relevance. The top-ranked tokens are selected for further processing, ensuring that diagnostically significant regions are prioritized.

\[\begin{gathered} \mathbf{A}=\operatorname{softmax}\left(\frac{\left(\mathbf{T} W_q\right)\left(\mathbf{B} W_k\right)^{\top}}{\sqrt{d}}\right), \quad r_i=\frac{1}{t_1+t_2} \sum_{j=1}^{t_1+t_2} \mathbf{A}_{j i}, \\ k=\min \left(M_{\max }, \gamma N^{\prime}\right), \quad \mathbf{B}_s=\left\{\mathbf{b}_i \mid \operatorname{rank}\left(r_i\right) \leq k\right\}, \end{gathered}\]

C. Sequential Visual Token Compression

To further refine the feature set, FOCUS applies sequential compression to remove local redundancy among selected visual tokens. Cosine similarity is computed between adjacent tokens, and a dynamic thresholding mechanism iteratively filters out redundant tokens. This process preserves spatial coherence while eliminating unnecessary information.

\[\begin{gathered} s_{j, j+1}=\frac{\mathbf{b}_j^{l \top} \mathbf{b}_{j+1}}{\left\|\mathbf{b}_j\right\|_2 \cdot\left\|\mathbf{b}_{j+1}\right\|_2}, \quad j \in\{1, \ldots, k-1\} \\ \operatorname{mask}_j= \begin{cases}1, & \text { if } \min \left(s_{j-1, j}, s_{j, j+1}\right)<\theta_i \\ 0, & \text { otherwise }\end{cases} \end{gathered}\]

D. Cross-Modal Aggregation and Final Prediction

The compressed visual tokens are fused with text embeddings using a multi-head attention mechanism. The attention scores are normalized and passed through a fully connected layer to generate the final classification probabilities. The model is trained end-to-end using a cross-entropy loss function, optimizing feature extraction, cross-modal attention, and classification jointly.

\[\begin{gathered} \operatorname{Head}_i=\operatorname{softmax}\left(\frac{\mathbf{Q} W_q^i\left(\mathbf{K} W_k^i\right)^{\top}}{\sqrt{d_k}}\right) \mathbf{V} W_v^i, \\ \mathbf{O}=\operatorname{LayerNorm}\left(\operatorname{Concat}\left(\operatorname{Head}_1, \ldots, \operatorname{Head}_h\right) W_o\right), \\ P\left(Y \mid \mathbf{B}_c, \mathbf{T}\right)=\operatorname{softmax}\left(W_c \mathbf{O}+\beta_c\right), \end{gathered}\]

Experimental Results

Datasets

FOCUS was evaluated on three benchmark pathology datasets:

  • TCGA-NSCLC: A dataset from The Cancer Genome Atlas, containing lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) WSIs (~1,000 slides).
  • CAMELYON: A dataset from the CAMELYON16 and CAMELYON17 challenges, including normal and metastatic lymph node samples (~900 slides).
  • UBC-OCEAN: A dataset from the University of British Columbia, containing five ovarian cancer subtypes (~500 slides).

Experiments were conducted under 4-shot, 8-shot, and 16-shot settings, where only 4, 8, or 16 labeled WSIs per class were used for training.

Evaluation Metrics

The model was evaluated using:

  • Balanced Accuracy (Balanced ACC): Accounts for class imbalance.
  • Area Under the ROC Curve (AUC): Measures classification performance.
  • F1 Score: Reflects overall classification quality.

Key Findings

FOCUS Performance

The above Table summarizes FOCUS’s performance:

  • TCGA-NSCLC: Achieves 81.9% Balanced ACC and 91.5% AUC under 4-shot settings, outperforming all baselines.
  • CAMELYON: Maintains high accuracy across low-shot settings, reaching 70.1% ACC in 4-shot experiments.
  • UBC-OCEAN: Demonstrates strong performance in multi-class classification, achieving 70.4%, 77.3%, and 86.4% Balanced ACC in 4-shot, 8-shot, and 16-shot settings, respectively.

Effect of Foundation Model Selection

We evaluated different foundation models (FMs) for feature extraction, including CONCH, UNI, GPFM, and Virchow. CONCH consistently outperformed others, particularly in low-shot settings, highlighting the importance of vision-language pretrained models for pathology tasks.

FM Comparison

Impact of Prompt Generation

We also examined the effect of different LLM-generated prompts. Prompts generated by Claude-3.5-Sonnet led to the highest Balanced ACC (86.4%) in 16-shot settings, outperforming GPT-3.5-Turbo (84.9%), demonstrating the impact of high-quality textual guidance.

Prompt Comparison

Conclusion

FOCUS introduces a knowledge-enhanced adaptive visual compression framework for few-shot WSI classification, addressing key challenges in data-scarce pathology AI. By integrating foundation models and language priors, FOCUS significantly improves diagnostic feature selection and classification accuracy.

Future work will explore:

  • Expanding FOCUS to more computational pathology tasks.
  • Enhancing multi-resolution feature modeling.
  • Applying FOCUS to broader medical AI applications.

For more details, check out the full paper:
Guo, Zhengrui, et al. “FOCUS: Knowledge-Enhanced Adaptive Visual Compression for Few-Shot Whole Slide Image Classification.” arXiv preprint arXiv:2411.14743 (2024).