[MICCAI 2024 Paper] Histopathology Report Generation via Local-global Feature Encoding and Cross-modal Context Interaction

Recently, a study on medical report generation by HKUST Smart Lab has been accepted for presentation at the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2024. This work proposes a novel framework for histopathology report generation, named HistGen, aiming at facilitating the labor-intensive, time-consuming, and error-prone task of histopathology report writing with deep learning-enabled automated solution. The related code, model weight, and dataset are open-sourced at https://github.com/dddavid4real/HistGen. The paper is publicly available at https://arxiv.org/abs/2403.05396. The following is the detailed introduction of this work.

Abstract

Histopathology serves as the gold standard in cancer diagnosis, with clinical reports being vital in interpreting and understanding this process, guiding cancer treatment and patient care. The automation of histopathology report generation with deep learning stands to significantly enhance clinical efficiency and lessen the labor-intensive, time-consuming burden on pathologists in report writing. In pursuit of this advancement, we introduce HistGen, a multiple instance learning-empowered framework for histopathology report generation together with the first benchmark dataset for evaluation. Inspired by diagnostic and report-writing workflows, HistGen features two delicately designed modules, aiming to boost report generation by aligning whole slide images (WSIs) and diagnostic reports at both local and global granularities. To achieve this, a local-global hierarchical encoder is developed for efficient visual feature aggregation from a region-to-slide perspective. Meanwhile, a cross-modal context module is proposed to explicitly facilitate alignment and interaction between distinct modalities, effectively bridging the gap between the extensive visual sequences of WSIs and corresponding highly summarized reports. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin. Moreover, the results of fine-tuning our model on cancer subtyping and survival analysis tasks further demonstrate superior performance compared to SOTA methods, showcasing strong transfer learning capability. Dataset and code are available in https://github.com/dddavid4real/HistGen.

Introduction

Histopathology tissue analysis reveals critical details about tumor characteristics and is considered the gold standard in cancer diagnosis and prognosis. In the last decade, the increasing availability of digital WSIs has given rise to computational pathology (CPath), promoting diagnostics with computer-assisted tools. The recent success of deep learning has notably advanced this field, outperforming experienced pathologists on specific tasks. However, pathologists are still required to compile findings and craft reports for each WSI, a task that is labor-intensive, time-consuming, and error-prone. Thus, the automated generation of pathology reports holds immense potential in facilitating the report-writing process and alleviating the heavy workloads for pathologists. Additionally, these automated reports can underscore abnormalities and offer diagnostic rationales, aiding pathologists in their analyses.

Contrasting with the field of radiology report generation, WSI report generation faces unique challenges. A primary obstacle is the lack of recognized benchmark datasets; existing datasets predominantly focus on patch-level images and captions, neglecting comprehensive, global descriptions of entire WSIs. Additionally, the gigapixel size of WSIs poses unique challenges for the visual encoding stage of report generation and hinders the direct adoption of existing report generation methods. Moreover, the contrast between enormous informative WSI patches of dense visual signals and succinctly summarized diagnostic reports of discrete textual tokens underscores the need for cross-modal interactions. Further, prevalent WSI analysis methods rely on multiple instance learning (MIL), necessitating pre-trained feature extractors, which we identify as a critical bottleneck, impeding optimal performance in WSI report generation.

Building on the success of image captioning and radiology report generation, and addressing unique challenges in WSI data, we present HistGen, a MIL-based histopathology report generation framework, designed to address WSI report generation from a cross-modal interaction perspective by aligning WSIs and diagnostic reports across local and global granularity. Our contributions are summarized as follows:

  1. Dataset Curation: We curate a benchmark WSI-report dataset of around 7,800 pairs by building a report collection and cleaning pipeline based on the TCGA Data Portal.
  2. Model Design: A local-global hierarchical visual encoder is proposed for effective encoding and aggregation of extensive WSI patch features in a region-to-slide manner. Moreover, we develop a cross-modal context module to enable interactions between visual encoding and textual decoding, leveraging cross-modal information.
  3. MIL Feature Extractor: Further, we pre-train a general-purpose MIL feature extractor on over 55,000 WSIs to boost MIL feature encoding.
  4. Performance Evaluation: With extensive experimental validation, our framework outperforms SOTA methods on WSI report generation by a large margin. Moreover, experiment results on cancer subtyping and survival analysis tasks exhibit superior transfer learning capability of HistGen, with it achieving SOTA on both tasks.

Methodology

WSI-Report Dataset Curation

Differing from the well-established benchmarks in radiology report generation, there is a lack of accessible WSI-report datasets, hindering the advancement of this field. In light of this, our goal is to curate a publicly available WSI-report dataset to serve as an evaluation benchmark. Based on uncurated report data from the TCGA Data Portal, we build a dataset curation pipeline detailed as follows. We begin by downloading diagnostic report PDFs from TCGA and extracting raw texts from them. Since raw texts are still noisy and cluttered with redundant information, we utilize GPT-4 for further cleaning and summarization. And the prompt used for report data preprocessing using GPT-4 is shown in Fig. 1. This prompt checks formatting and spelling while preserving the report’s diagnostic meaning, ensuring clinical correctness in the final report. Finally, by matching the case IDs from the reports with corresponding WSIs, we form a WSI-report dataset with 7,753 pairs, encompassing various diseases from different primary sites.

Figure 1 Figure 1. Prompt for report data preprocessing using GPT-4.

HistGen for Automated WSI Report Generation

Fig. 2 shows the overview of our HistGen framework. Fig. 2a denotes the local-global hierarchical visual encoder, which firstly extracts patch features with our pre-trained backbone and then efficiently encodes them under a local-to-global manner. After that, the visual features are fed into the decoder module (shown in Fig. 2c) for report generation. Fig. 2b corresponds to the cross-modal context-aware learning module, designed to bridge different modalities and store the knowledge from previous iterations for the model to refer to. Further, Fig. 2d highlights our strategy for fine-tuning the model to WSI-level prediction tasks.

Figure 2 Figure 2. Overview of the proposed HistGen framework: (a) local-global hierarchical encoder module, (b) cross-modal context module, (c) decoder module, (d) transfer learning strategy for cancer diagnosis and prognosis.

Pre-training Feature Extractor

To boost the MIL feature encoding process, we collect over 55,000 WSIs from public datasets to develop a general-purpose feature extractor that offers informative and robust patch embeddings. After tissue segmentation and patch tiling, we leverage around 200 million patches to pre-train a ViT-L model \(f(\cdot;\theta)\) with DINOv2 strategy, which is then used for feature extraction, as depicted in Fig. 2a.

Local-global Hierarchical Encoder

Efficiently leveraging the information and correlations among WSIs’ extensive patch sequence is a vital problem. Given that pathology reports typically contain both region-level details and overall WSI diagnoses, we deduce that the visual encoder should process visual features from a local-to-global perspective to effectively capture this granularity.

In this work, we propose the local-global hierarchical encoder (LGH, see Fig. 2a) for region-aware representation learning. Specifically, we first segment the WSI into \(N\) regions \(X_i=\{x_{i,1}, x_{i,2}, x_{i,S}\}, i=1,\cdots,N\) with the region size \(S\). Besides, we additionally add a region token \(x_{i, S+1}\) at the end of each region sequence to learn a region-level representation. After that, we employ \(f(\cdot;\theta)\) to obtain patch embeddings and add positional encoding (PE) to each region sequence. Consequently, we design a region-level encoder \(E_l(\cdot;\theta_l)\) to process \(X_i'\) to enable intra-region interactions:

\[X_i' = \{x_{i,1}', \cdots, x_{i,S+1}'\} = E_l\left(f(x_{i,1};\theta) + e_{i,1}, \cdots, f(x_{i,S+1};\theta) + e_{i,S+1};\theta_l\right).\]

Moreover, to capture inter-region information and long-range dependencies among patches from various regions, we introduce PE to each region-level representation token \(x_{i, S+1}'\), followed by processing through a WSI-level encoder \(E_g(\cdot;\theta_g)\) for global interactions. Subsequently, these global-aware region tokens are reorganized to infuse global information into original regions with \(E_l(\cdot;\theta)\). Finally, we obtain region-level representations \(X_i''\) by attentive pooling on each region:

\(X_i''=\text{Pooling}\left(\{x_{i,1}'', \cdots, x_{i,S+1}''\}\right)=\text{Pooling}\left(E_l(x_{i,1}', \cdots, \tilde{x}_{i,S+1}';\theta_l)\right)\) \(~~~~~~~~~~~~=\text{Pooling}\left(E_l(x_{i,1}', \cdots, x_{i,S}', E_g(x_{1,S+1}',x_{2,S+1}',\cdots,x_{N,S+1}';\theta_g)_{i};\theta_l)\right),\) and then input region representations \(\{X_1'', X_2'', \cdots, X_N''\}\) into the decoder (see Fig. 2c) for report generation.

Cross-modal Context Module

Given that report generation is an image-to-text task involving the transformation between distinct modalities, bridging the gap between pixel-based visual inputs and token-based textual outputs is crucial. Meanwhile, leveraging the model’s accumulated knowledge and memory from previous iterations is expected to significantly enhance this task.

The Cross-modal context module (CMC, see Fig. 2b) is proposed in light of the above insights, denoted as \(C = \{C_1, C_2, \cdots, C_m\}\), where \(C_i\) is the context vector stored in this module. For visual inputs \(\{X_1',\cdots, X_N'\}\), we select key patches \(\{p_1, \cdots, p_l\}\) via uniform sampling-based prototype learning, which aims to use representative patches as queries for the CMC module, addressing redundancy and computational complexity of the patch sequence. After querying, the responses generated by the CMC module are aggregated back into the original visual features, infusing them with cross-modal information. This step aims to help the model leverage learned knowledge from the training data that might not be captured by the input embeddings alone. Similar interactions are applied to textual embeddings \(\{y_1, y_2, \cdots, y_t\}\) from the decoder, except for prototype learning. Additionally, the CMC module serves as an external memory, enabling the model to access and update iteratively, thereby progressively enhancing generation performance.

Experiments

In this work, our model is firstly evaluated on the histopathology report generation task with the proposed dataset and then fine-tuned on cancer subtyping and survival analysis tasks to demonstrate its transfer learning capability.

WSI-Report Generation

Pre-trained Feature Extractor Comparison

We compare our pre-trained DINOv2 ViT-L with ImageNet-pretrained ResNet50 and WSI-pretrained CTransPath. Most MIL methods utilize ImageNet-pretrained ResNet50, and results in Tab. 1 show that this leads to relatively low generation performance. We then turn to CTransPath, which is pre-trained on 30,000 WSIs using contrastive learning strategy. However, Tab. 1 demonstrates only minor increases in the generation performance of baseline models. In contrast, as illustrated in Tab. 1, the baseline models show significant improvement by adopting our DINOv2 ViT-L, and several models such as R2GenCMN could already achieve plausible generation performance. This indicates that our DINOv2 ViT-L significantly outperforms both out-of-domain pretrained and domain-aligned pretrained methods, boosting subsequent tasks such as report generation.

Table 1 Table 1. Comparisons of the proposed HistGen with other SOTA models on WSI report generation w.r.t NLG metrics.

WSI-Report Generation Result Analysis

For SOTA report generation methods in Tab. 1, Show\&Tell and UpDownAttn utilize CNN for image encoding and RNN for language decoding, manifesting lower performance compared to Transformer-based methods such as R2Gen. M2Transformer and R2GenCMN further improve the performance by aligning encoding and decoding stages. However, visual token selection methods are needed to apply methods to WSI report generation due to the large sequence length, degrading report generation performance. Our proposed method addresses this problem and outperforms these methods by a large margin via efficiently exploring the local and global information inside the patch sequence, interactively aligning the visual and textual context, and effectively utilizing the model’s knowledge and memory from previous iterations. Building upon the DINOv2 ViT-L feature extractor, we compare our method to the aforementioned models. Tab. 1 confirms the superiority of our report generation model over SOTA models, with an average of around 3% improvement observed on all NLG metrics.

Transfer Learning for Cancer Diagnosis and Prognosis

WSI report generation could be considered as vision-language pre-training. To prove that our model learns diagnosis-related information during this task, we further fine-tune our model and evaluate on cancer subtyping and survival analysis with the strategy illustrated in Fig. 2d.

Table 2 Table 2. Transfer Learning Results on Cancer Subtyping.

Tab. 2 summarizes the comparison of our model to SOTA MIL methods in cancer subtyping datasets, showing that our method outperforms all MIL approaches on all three classification tasks by a significant margin.

Table 3 Table 3. Transfer Learning Results on Survival analysis

Results of six survival analysis tasks, summarized in Tab. 3, reveal that our model outperforms others in most tasks and maintains the highest average score across all six datasets, manifesting superior transfer capabilities.

Tab. 2 and 3 highlight the effectiveness of our proposed model on WSI-level prediction tasks as well as the feasibility of transferring our model to these downstream tasks, confirming its ability to capture diagnostic information during WSI report generation.

Conclusion

In this work, we introduce HistGen, a MIL-empowered framework designed to enhance automated histopathology report generation, covering from model design to benchmark evaluation. To this end, we develop an advanced WSI report generation model by efficiently leveraging local-region and global-WSI information together with explicitly aligning the cross-modal encoding and decoding stages. To conduct experimental evaluation, we curate a benchmark WSI-report dataset, containing around 7,800 WSI-report pairs with high text quality. Extensive experiments on report generation, cancer subtyping, and survival analysis demonstrate the superiority of the proposed model as well as its strong capabilities of transfer learning. While this work lays the groundwork for WSI report generation, its scope is currently limited to histopathology. Future efforts will expand to include other fields such as radiology and ophthalmology, treating report generation across different fields as a unified problem within a single framework.