[ICLR 2026] Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification

A joint research effort from The Chinese University of Hong Kong (CUHK), HKUST Smart Lab, and Nanyang Technological University (NTU) has been accepted at ICLR 2026, one of the premier machine learning conferences. The study presents MR Block (Manifold Residual Block), a plug-and-play, geometry-aware drop-in replacement for standard linear layers in Multiple Instance Learning (MIL) models, specifically targeting the challenge of few-shot Whole Slide Image (WSI) classification in computational pathology.

MR Block decomposes a linear projection into two parallel paths: a fixed random geometric anchor that preserves the intrinsic manifold structure of pathology foundation model features, and a trainable low-rank residual path (LRP) for task-specific adaptation. This design introduces a structured inductive bias that simplifies learning into a more tractable residual-fitting problem, achieving state-of-the-art performance with significantly fewer trainable parameters.

Background

Histopathology is the gold standard for disease diagnosis, and computational analysis of Whole Slide Images (WSIs) faces two structural constraints. First, WSIs operate at the gigapixel scale, making Multiple Instance Learning (MIL) the de facto paradigm: each slide is represented as a bag of patch features. Second, expert annotations are costly and scarce, and real-world data often involves few labeled slides with only slide-level labels, making models highly susceptible to overfitting.

To understand the root cause of overfitting beyond the learning algorithm itself, this work examines the intrinsic geometric structure of features. Based on the manifold hypothesis, the authors analyze feature representations on the Camelyon16 dataset using different feature extractors (CONCH, UNI, ResNet-50), providing multi-angle evidence that these representations lie on a low-dimensional, nonlinear manifold:

Spectral analysis reveals an effective rank of only 29.7 (against CONCH’s 512-dimensional space), confirming low-dimensionality.
t-SNE visualization shows clear clustering topology.
Tangent space analysis demonstrates non-flat, distance-dependent geometric drift—quantitative evidence of the manifold’s intrinsic curvature, ruling out a purely linear subspace hypothesis.

HKUST SmartLab

Based on these observations, the authors argue that a key driver of few-shot overfitting is geometric: while pathology foundation models produce features with a fragile low-dimensional manifold structure, existing MIL models fail to preserve it. The primary source of distortion is the most ubiquitous and indispensable component—the linear layer. Linear layers appear in projections, attention computation, and classification heads, yet they are inherently geometry-agnostic. Tangent space analysis provides direct evidence: trained linear layers significantly distort the intrinsic geometry of the manifold, causing models to learn overly complex mappings in few-shot settings that both violate the low-rank nature of features and discard the geometric priors learned during pre-training.

Existing MIL approaches for WSI classification can be broadly grouped by how they handle feature geometry:

A. Standard MIL Backbones

Attention-based approaches such as ABMIL and CATE are effective general frameworks but rely on unconstrained linear layers that are inherently geometry-agnostic. In few-shot settings, this leads to manifold distortion and overfitting.

B. Few-Shot Specialized Methods

Methods such as ViLaMIL and FOCUS are specifically designed for few-shot WSI classification and represent the current state of the art. However, they do not explicitly account for the geometric structure of pre-trained features, leaving the manifold distortion problem unaddressed.

C. Manifold Residual Block (MR Block)

MR Block takes a fundamentally different approach by directly addressing the geometric distortion caused by linear layers. Rather than designing a new MIL backbone, it provides a plug-and-play replacement for standard linear layers, explicitly preserving and leveraging the low-dimensional manifold geometry of foundation model features.

Methodology

Preliminaries

In the bag-level MIL setting, a WSI is divided into non-overlapping patches, encoded by a pre-trained feature extractor into patch features, forming a bag. An MIL aggregator (typically attention pooling) produces a slide-level feature, which is then fed into a classifier. Modern MIL models use attention pooling that simultaneously yields patch-level importance scores.

Spectral analysis estimates the intrinsic dimensionality of representations via the eigenvalue distribution of the Gram matrix, computing the Von Neumann entropy and effective rank. Tangent space analysis probes the non-linear structure beyond low-dimensionality by constructing a neighborhood graph on normalized features and estimating local tangent spaces via PCA at each point—quantifying how much local geometry varies with position.

MR Block Architecture

To mitigate the geometric degradation introduced by standard linear layers, the authors propose MR Block, a parameter-efficient, plug-and-play, geometry-aware alternative. As illustrated below, MR Block decomposes the linear mapping into two parallel paths:

HKUST SmartLab

Given input x ∈ ℝ^{d_in}, MR Block is defined as:

MR(x) = W_anchor · x + Up(GELU(Down(x)))

where:

Geometric Anchor Path: W_anchor ∈ ℝ^{d_out × d_in} is a fixed random matrix (Kaiming uniform initialized, never updated). It serves as a geometric anchor that approximately preserves the original feature topology while acting as a spectral sharpener to enhance spectral discriminability.
Low-rank Residual Path (LRP): Down ∈ ℝ^{r × d_in} and Up ∈ ℝ^{d_out × r} are trainable matrices with a bottleneck rank parameter r ≪ d_in. This structural bottleneck explicitly aligns with the low effective rank of features, modeling only task-relevant residuals.

Initialization: Up is initialized to all zeros, so MR Block initially behaves as W_anchor alone, contributing zero residual at the start of training. The LRP activates only when it can improve the training objective, counteracting the geometric distortion that LRP alone would introduce.

Parameter efficiency: When r ≪ d_in, LRP has r(d_in + d_out) parameters—strictly fewer than a standard linear layer’s d_in × d_out, theoretically reducing overfitting risk.

Experimental Results

Comparison with State-of-the-Art Methods

Table 1 summarizes results across multiple datasets. Three key conclusions emerge:

HKUST SmartLab

Consistent Improvement: Whether on large cohorts with artificially constructed few-shot settings or on naturally few-shot treatment response datasets, MR-augmented models consistently outperform their respective baselines across different datasets and shot numbers. On Camelyon16, TCGA-NSCLC, and TCGA-RCC, MR versions match or surpass current state-of-the-art methods (ViLaMIL and FOCUS) with significantly fewer trainable parameters.
Parameter Efficiency: Replacing standard linear layers with MR Block produces a smaller model that performs better, and this pattern holds across multiple MIL backbones. This confirms that MR’s gains come from a beneficial low-rank inductive bias, not from increased model capacity.
Stability: As k increases, all methods improve steadily; MR versions show the largest gains at moderate shot counts and remain competitive at higher shot counts. MR also exhibits lower variance across runs in several settings, suggesting improved training stability.

Ablation Studies

Component Ablation

HKUST SmartLab

Table 2 ablation validates the decoupled design. Removing the LRP consistently degrades performance, confirming LRP’s necessity for task adaptation. Removing the geometric anchor—whether retaining or discarding the residual connection—significantly harms performance, highlighting its dual role as both a geometric anchor and a spectral sharpener. Most critically, making the anchor trainable causes catastrophic performance collapse, providing direct empirical evidence that unconstrained linear layers tend to disrupt feature manifolds, while the MR design better preserves them.

Capacity-Matched Analysis

To disentangle “fewer parameters” from “geometric inductive bias,” the authors construct a capacity-matched MR-ABMIL where MR Block’s trainable parameter count exactly matches the original gated attention layer. On Camelyon16 and RCC, capacity-matched MR-ABMIL significantly outperforms ABMIL across all shot settings; results on NSCLC are broadly comparable. Since these gains are achieved with identical parameter counts, they directly demonstrate that MR’s geometry-aware structure—not parameter reduction alone—plays the central role in few-shot performance gains.

Rank Sensitivity Analysis

HKUST SmartLab

The residual rank r controls information flow in the low-rank path. Sensitivity analysis shows performance saturates at approximately r ≈ 32 across evaluated datasets and shot settings. This saturation point closely matches the theoretically predicted effective rank of the features (~29.7), confirming that a low-rank path suffices to capture the principal task information encoded in the manifold structure. The simpler MR-ABMIL shows a pronounced peak at r = 32, while the more expressive MR-CATE shows only marginal gains beyond this—attributed to CATE’s additional modeling capacity capturing finer feature interactions beyond the main manifold.

Interpretability in Extreme Resource-Constrained Settings

HKUST SmartLab

Figure 6 shows heatmaps generated by MR-CATE in the extreme 2-shot setting. Standard CATE under 2-shot supervision typically fails to produce meaningful attention maps, while MR-CATE exhibits significantly stronger robustness. In the original images (top), blue curves mark approximate tumor boundaries; corresponding heatmaps are shown below. Notably, MR-CATE also captures finer-grained boundaries beyond those in the original annotations—even within the blue tumor boundaries, the model accurately distinguishes between different morphological patterns, demonstrating sensitivity to heterogeneity within tumors and surrounding normal tissue.

Conclusion

This work re-examines few-shot WSI classification overfitting from a geometric perspective. The authors provide quantitative and visual evidence that pathology foundation model features exhibit a fragile low-dimensional manifold geometry, and identify a common failure mode in MIL models: their indispensable linear layers systematically destroy this manifold structure due to a lack of geometric awareness.

MR Block addresses this by combining a fixed random geometric anchor that preserves manifold structure with a low-rank residual path for parameter-efficient task-specific adaptation. Extensive experiments not only demonstrate state-of-the-art performance but also empirically support the geometric diagnosis, offering a new geometry-aware paradigm for building more robust models—with implications extending beyond computational pathology.

Future Directions

Extend MR Block to other MIL tasks beyond WSI classification.
Explore more sophisticated spectral shaping strategies for the geometric anchor.
Apply geometry-aware inductive biases to other domains where pre-trained features exhibit manifold structure (e.g., natural image few-shot learning, medical image segmentation).

The code is open-sourced at https://github.com/BearCleverProud/MR-Block.

For more details, check out the full paper:
Conghao Xiong, Zhengrui Guo, Zhe Xu, Yifei Zhang, Raymond Kay-yu Tong, Si Yong Yeo, Hao Chen, Joseph J. Y. Sung, and Irwin King. “Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification.” arXiv preprint arXiv:2505.15504, 2026.

Tweets by SMARTLab_HKUST

[ICLR 2026] Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification

Background

Related Work

A. Standard MIL Backbones

B. Few-Shot Specialized Methods

C. Manifold Residual Block (MR Block)

Methodology

Preliminaries

MR Block Architecture

Experimental Results

Comparison with State-of-the-Art Methods

Ablation Studies

Component Ablation

Capacity-Matched Analysis

Rank Sensitivity Analysis

Interpretability in Extreme Resource-Constrained Settings

Conclusion

Future Directions