[MIA2025] MedlAnomaly: A Comparative Study of Anomaly Detection in Medical Images

Introduction

HKUST SmartLab is pleased to announce that our paper, “MedlAnomaly: A Comparative Study of Anomaly Detection in Medical Images,” has been accepted by Medical Image Analysis. This study establishes a standardized benchmark for anomaly detection (AD) in medical imaging, covering seven diverse datasets across five imaging modalities, including chest X-rays, brain MRIs, retinal fundus images, dermatoscopic images, and histopathology whole-slide images.

We conduct a comprehensive evaluation of 30 representative AD methods, assessing their performance on both image-level anomaly classification and pixel-level anomaly segmentation. Furthermore, for the first time, we systematically analyze the impact of key methodological components in existing AD approaches, revealing unresolved challenges and potential future research directions.

Background & Motivation

Anomaly detection (AD) aims to identify abnormal samples that deviate from expected normal patterns. It plays a crucial role in rare disease recognition and health screening in the medical domain. Despite the rapid development of AD methods, the field lacks a fair and comprehensive evaluation, leading to ambiguous conclusions and inconsistent comparisons.

Challenges in Existing Works

  1. Dataset and Partition Variability

    • Many prior studies use different datasets or inconsistent data partitions, making it difficult to fairly compare methods.
    • Some datasets, such as Hyper-Kvasir and OCT2017, are too simple for AD tasks. As shown in Table 1, even a basic autoencoder (AE) achieves nearly perfect performance on these datasets, suggesting they are not sufficiently challenging for benchmarking.

    Table 1: Summary of datasets, including modality, sample size, and anomaly types

  2. Lack of Unified Implementations

    • Even within the same paradigm, different works adopt varied network architectures and training strategies, leading to unfair comparisons.
    • For example, in reconstruction-based methods:
      • f-AnoGAN uses residual blocks with multiple convolutional layers.
      • AE-Flow employs a Wide ResNet-50-2 encoder.
      • AE-U relies on simple convolutional layers without specific designs.
    • These discrepancies make it difficult to isolate the true effectiveness of each method.
  3. Limitations of Existing Benchmarks

    • While some surveys and benchmarks exist, they fail to provide a comprehensive and fair comparison.
    • For example, [1] analyzed various state-of-the-art (SOTA) AD methods but did not include key non-recent yet representative methods (e.g., different AE variants and self-supervised approaches using synthetic anomalies).
    • Additionally, their study did not employ unified network architectures, making it difficult to analyze the inherent properties of each method.

Our Contributions

To address these issues, we propose a comprehensive benchmark with the following contributions:

  • A Taxonomy of AD Methods: We categorize existing methods into reconstruction-based, self-supervised learning-based, and feature reference-based Methods, providing a structured overview of the field.
  • A Curated Dataset Collection: We compile seven medical datasets spanning five imaging modalities, ensuring a diverse and challenging benchmark.
  • A Fair and Extensive Method Comparison: We evaluate 30 representative AD methods under standardized conditions and systematically analyze the effects of key components.
  • Insights into Unresolved Challenges: Our study highlights critical open problems and future research directions in medical AD.

Taxonomy of Primary Methods

1. Reconstruction-Based Methods

Basic Idea: Train a model to reconstruct only normal images, then use the reconstruction error as an anomaly score. We divide these methods into:

  • Image-Reconstruction Approaches
  • Feature-Reconstruction Approaches

2. Self-Supervised Learning-Based Anomaly Detection

  • One-Stage Approaches: Train a model to detect synthetic anomalies, then apply it to real anomalies.
  • Two-Stage Approaches: First, learn self-supervised representations on normal data, then train a one-class classifier on the learned features.


Benchmark

1. Datasets

We evaluate methods on seven medical datasets across five imaging modalities:

2. Evaluated Methods

We compare 30 representative AD methods, covering:

  • Reconstruction-based methods
  • Self-supervised learning-based methods


Main Experimental Results


Please refer to the original paper for a detailed analysis.

Conclusion

This paper presents a comprehensive benchmark for medical anomaly detection, incorporating seven datasets and evaluating 30 typical methods. Several key findings are summarized as follows:

  1. Reconstruction-Based Methods as Strong Baselines

    • Without pretraining, reconstruction-based methods outperform self-supervised learning (SSL) methods.
    • The simplest AE serves as a good baseline, achieving near-perfect performance on simpler datasets like Hyper-Kvasir and OCT2017.
    • Recommendation: Future works should always include AE as a reference.
  2. Importance of Latent Space Configuration & Reconstruction Loss

    • Latent Space Size:
      • Datasets with local anomalies (near OOD) benefit from small latent sizes. Among these datasets, the more complex ones require larger latent sizes for optimal performance.
      • Datasets with global semantic anomalies (far OOD) perform better with large latent sizes.
      • Self-adaptive latent space restriction remains unexplored.
    • Reconstruction Loss:
      • Perceptual loss significantly outperforms $\ell_2$ loss on most datasets, but fails on ISIC2018 dataset.
      • No existing distance function outperforms the simple $\ell_2$ distance on all datasets, suggesting a promising research direction.
  3. Effectiveness of Pretrained Models

    • ImageNet pre-trained weights are highly effective in medical AD, used for:
      • Distance measurement
      • Input transformation
      • Feature extraction
    • Challenge: Fine-tuning these weights for task-specific datasets remains an open problem.
    • Future Direction: Vision-Language Models (VLMs) offer new possibilities for leveraging pretrained features.
  4. Exploration of Special AD Settings

    • One-Class Semi-Supervised Learning and Zero-/Few-Shot Learning align well with real-world scenarios.
    • VLM advancements facilitate zero-/few-shot AD, making this a promising future research area.

Our benchmark establishes a solid foundation for future research in medical anomaly detection. By providing fair comparisons and insights into key challenges, we hope to guide the development of more effective and robust AD methods, ultimately benefiting rare disease detection and health screening.

For original full paper, please visit this link.