[IEEE-TMI] Reg2RG: Large Language Model with Region-Guided Referring and Grounding for CT Report Generation

A recent study from HKUST Smart Lab, led by Zhi-Xuan Chen, has been accepted by IEEE Transactions on Medical Imaging (Impact Factor: 8.9). The work presents Reg2RG, a novel region-guided large language model (LLM) framework for chest CT report generation. By explicitly linking image regions with diagnostic text, Reg2RG significantly improves the accuracy, interpretability, and clinical value of automated CT reporting.

Introduction

In medical image analysis, automated report generation aims to assist clinicians in interpreting and articulating the visual content of scans. The challenge lies in comprehending complex anatomical structures and expressing findings accurately in natural language. Chest CT report generation is a representative task that requires precise lesion localization, feature interpretation, and the synthesis of clinically relevant textual descriptions. However, most existing methods rely on end-to-end full-image modeling, where the entire image is encoded to train language models. Although these methods achieve initial automation, they commonly exhibit poor localization, diagnostic bias, and limited clinical interpretability.

To address this, we focus on a central question: Can a clear correspondence be established between specific visual regions in medical images and their respective diagnostic paragraphs in textual reports? Inspired by cross-modal tasks such as image-text matching, referring expression, and visual grounding, Reg2RG introduces a core concept: extracting discriminative local image features via region masks and embedding them as tokens into large language models. This allows the model to generate natural language reports that not only follow clinical conventions but also explicitly refer to the relevant anatomical regions in the CT scan, thereby improving the traceability and interpretability of diagnostic conclusions.

Moreover, chest CT scans are high-dimensional, with lesions that may be subtle and dispersed across multiple anatomical regions. Thus, simultaneously modeling both local anomalies and global anatomical context is essential for generating clinically valuable reports. To address this challenge, Reg2RG adopts a dual-branch encoding architecture that processes local region features and global semantics independently, then integrates them during the language generation phase. This design enhances both localized diagnostic precision and global contextual consistency, enabling the language model to produce well-structured, accurate, and region-aware diagnostic paragraphs. The overall framework of Reg2RG is illustrated in Figure 1.

Reg2RG Framework

Figure 1: Overview of the Reg2RG framework. The model extracts local region features and global context, then generates region-referred diagnostic paragraphs using a large language model.

Medical Image Report Generation

Medical image report generation is a critical task in multimodal understanding, aiming to automatically generate clinically meaningful descriptions from complex medical images. Early approaches typically adopted encoder-decoder frameworks, using convolutional neural networks (CNNs) to extract global image features, followed by recurrent neural networks (RNNs) or Transformer-based models to produce diagnostic text. These methods focused on capturing overall image semantics but often overlooked fine-grained representation and localization of specific lesions, leading to reports that lacked interpretability and diagnostic specificity. To improve spatial awareness, some studies introduced attention mechanisms to allow models to focus on key regions. However, these approaches still relied on global features, which made it difficult to explicitly align textual fragments with specific image regions. In recent years, the advent of multimodal large language models (LLMs) has significantly improved report generation quality, but region-level semantic description and grounding consistency remain substantial challenges.

Region-Level Referring and Grounding

In general vision-language tasks, various studies have explored integrating LLMs with region-level features to enhance fine-grained perception. However, applications in the medical imaging domain remain scarce. Existing methods often fail to capture semantic relationships between different regions or rely solely on coarse positional information when referring to image areas. This results in imprecise descriptions that do not meet the high standards of detail and accuracy required for clinical diagnosis. To address this, we propose a new region-level understanding framework tailored for report generation. Through a decoupled representation strategy, we preserve complete geometric information while enhancing texture details, satisfying the clinical demands for sensitivity to lesion location and morphology. Additionally, we incorporate global context to model inter-regional relationships, improving overall image understanding. A region-text alignment mechanism is also introduced to explicitly link visual regions with report content, enabling efficient and coherent multi-region expression in a single inference step—thus advancing the field of region-level visual perception in medical imaging.

Method

Overall Framework Overview

Traditional CT report generation methods primarily rely on global features of the entire scan, which often fail to capture subtle abnormalities located in specific anatomical regions, thus limiting the diagnostic accuracy and value of the generated reports. To overcome this limitation, our approach introduces multiple local regional features extracted using a generic segmentation module and combines them with global contextual information to guide report generation. Each local feature corresponds to a specific anatomical region, providing more targeted, fine-grained semantic information. After fusing local and global features, these representations are fed into a large language model for inference. This allows the model to accurately identify and describe multiple anatomical areas, resulting in diagnostic reports that are more comprehensive and clinically interpretable.

Decoupled Local Feature Representation Strategy

To efficiently preserve high-resolution details of local regions, we propose a decoupled representation strategy that separates local features into texture and geometric components, which are processed independently and then merged into a unified regional representation. For texture feature extraction, we first apply region masks to isolate target areas from the original CT images. To reduce redundancy and retain high-quality detail, we crop the extracted region images into smaller, focused patches. These are then encoded using an image encoder and compressed through a feature projection module to produce visual embeddings compatible with the language model. For geometric features, we retain the uncropped region masks to preserve spatial position and scale within the original image. These masks are processed through a lightweight encoding network and mapped into representations that can also be understood by the language model. Finally, the texture and geometric features are concatenated to form a complete local representation, which is subsequently used for region-level language generation tasks.

Global-Local Feature Collaboration Mechanism

Different anatomical regions in medical images often exhibit strong structural and semantic interdependencies. Relying solely on local information is insufficient for comprehensive understanding. To address this, we introduce a global-local feature collaboration mechanism that jointly models both levels of information to enhance contextual consistency and narrative coherence in the generated reports. Specifically, the model receives both independent local regional features and global visual features extracted from the entire CT volume. These features are processed by a unified projection module and embedded into the input prompt of the language model.

We design a structured prompt in which global and regional features are represented by special tokens that are subsequently replaced with their corresponding embeddings. The language model uses these embeddings during inference, allowing it to reason about relationships between different anatomical regions and produce reports that are consistent and explainable.

Region-Report Alignment Training Strategy

To enhance the model’s ability to refer accurately to specific regions during report generation, we introduce a region-report alignment training strategy. This approach requires the model to identify the anatomical region associated with each paragraph before generating the corresponding diagnostic content.
During training, we prepend a prefix to each ground-truth regional report indicating the region it describes. For instance, the model is prompted with text such as “Region i corresponds to the heart,” thereby guiding it to perform anatomical identification before generating the report. To prevent the model from memorizing the input order, we randomly shuffle the sequence of local features in each iteration. This forces the model to recognize region identity based solely on the embedded features rather than positional bias.
The regional reports are still trained using standard language modeling loss, and the generation remains in free-form textual format. This strategy significantly improves the model’s regional awareness, allowing each paragraph of the generated report to not only be accurate in content but also traceable to a specific anatomical location, thus enhancing interpretability and clinical reliability.

Experimental Results

Datasets & Metrics

Reg2RG was evaluated on two large-scale chest CT datasets:

  • RadGenome-ChestCT: 25,000+ region-annotated CTs with structured reports.
  • CTRG-Chest-548K: 1,800 CTs with unstructured reports, auto-segmented for region-level evaluation.

Performance was assessed using:

  • NLG metrics: BLEU, METEOR, ROUGE-L (textual similarity)
  • Clinical efficacy metrics: Precision, recall, F1 (diagnostic accuracy, via RadBERT extraction)
  • Region recognition accuracy: Correctly linking text to anatomical regions

Experimental Setup and Baselines

We use the SAT segmentation model to process CT volumes and resample all images to a spatial resolution of 256×256×64. Local texture features are extracted using a pretrained ViT3D encoder and adapted via a Perceiver-based dimensionality reduction module. The same ViT3D backbone is used for global feature extraction. Geometric features are encoded by a lightweight three-layer ViT3D model and mapped into the language model’s input space through fully connected layers.
For language decoding, we adopt the LLaMA2-7B model with the LoRA fine-tuning strategy for efficiency. The model is optimized using AdamW with a fixed learning rate and a warm-up schedule at the beginning. Training is conducted for 6 epochs on the RadGenome dataset and 10 epochs on the CTRG dataset using two NVIDIA 3090 GPUs with a batch size of 16. To reduce memory usage, we apply the ZeRO optimization strategy along with gradient checkpointing.
We compare Reg2RG against several state-of-the-art 3D report generation baselines including CT2Rep, RadFM, and M3D, as well as two 2D-based models: R2GenGPT and MedVInT. All models use LLaMA2-7B as the language decoder and receive the same preprocessed input to ensure fairness. For 2D models, we convert 3D volumes into multi-channel 2D representations.

Quantitative Results

As shown in Table 1, Reg2RG outperforms all competing methods on NLG metrics across both datasets. On the RadGenome-ChestCT dataset, Reg2RG surpasses the next-best model, MedVInT, on all language metrics. In particular, it achieves over 9% relative improvement in METEOR, indicating superior vocabulary diversity and semantic coherence. On the CTRG-Chest-548K dataset, Reg2RG also leads in BLEU and METEOR, though its ROUGE-L score is slightly lower, primarily due to the more fragmented structure of region-level reports, which affects the longest common subsequence matching.

NLG Results

Table 1: Reg2RG achieves state-of-the-art results on both language and clinical efficacy metrics compared to strong baselines.

As shown in Table 2, Reg2RG also demonstrates clear superiority in Clinical Efficacy (CE) on the RadGenome dataset. It achieves significant improvements in precision, recall, and F1 score compared to all other models. In particular, the F1 score improves by nearly 20%, validating Reg2RG’s reliability in capturing critical diagnostic information. Notably, Reg2RG maintains a strong balance between precision and recall, whereas many models trade off one for the other, highlighting its robust diagnostic capability.

Clinical Efficacy Results

Table 2: Comparison of Reg2RG with other models on Clinical Efficacy (CE) metrics.

Additionally, we evaluate the model’s performance in region recognition. As reported in Table 3, recognition accuracy exceeds 95% for eight out of ten anatomical regions. Lower performance on the lungs and pleura is attributed to segmentation challenges rather than model limitations. Once accurate masks are available, Reg2RG shows consistently strong recognition performance.

Clinical Efficacy Results

Table 3: Region recognition accuracy of Reg2RG.

Ablation Studies

Ablation experiments confirm the importance of each module:

  • Removing local feature decoupling (texture/geometry separation) degrades region identification.

To further assess the contribution of each module to overall model performance, we conducted a series of ablation experiments. As shown in Table 4, removing the local feature decoupling (LFD) strategy results in performance degradation across several metrics. This highlights the importance of maintaining high-resolution texture and separate geometric information for accurately identifying abnormal regions.

Ablation Results

Table 4: Effectiveness of Local Feature Decoupling (LFD) strategy.

  • Using only texture or only geometry reduces performance; combining both with global context yields best results.

In Table 5, we examine the effects of different feature combinations. Using only texture features without spatial location information limits model performance. The inclusion of geometric features improves results significantly, and further addition of global context features enhances the model’s ability to capture semantic relationships across regions.

Ablation Results

Table 5: Effectiveness of texture (TXT), geometric (GEO), and global (GLB) features.

  • Disabling region-report alignment leads to less accurate and less interpretable reports.

In Table 6, we explore the impact of the Region-Report Alignment (RRA) training strategy. Without region-guided prompts, the model can still generate coherent text, but the accuracy and consistency of region-specific content degrade notably. This confirms that regional prompting is crucial for reinforcing semantic alignment between visual input and generated text.

Ablation Results

Table 6: Effectiveness of Region-Report Alignment (RRA) strategy.

  • Larger LLMs (LLaMA2-7B) provide significant gains over smaller models.

Furthermore, we evaluate how model performance varies with different language model sizes. As shown in Table 7, replacing LLaMA2-7B with a smaller model such as GPT-2 leads to substantial drops in performance across all metrics. This suggests that large models possess superior capabilities for region-level referring, cross-region reasoning, and complex report generation.

Ablation Results

Table 7: Effectiveness of language model size on Reg2RG performance.

Report Quality Analysis

To gain deeper insights into the quality of generated reports, we conducted statistical analysis on report length distributions and compared them with real-world references.
As shown in Figure 2, the length distribution of reports generated by Reg2RG closely mirrors that of real clinical reports, exhibiting less deviation than the MedVInT baseline. This indicates that Reg2RG produces more complete and information-rich outputs, reducing the likelihood of missing findings.

Report Length Distribution

Figure 2: Report length distribution comparison between Reg2RG and other methods.

A case study in Figure 3 demonstrates that Reg2RG accurately identifies and describes abnormalities across multiple anatomical regions, whereas competing methods frequently miss or misidentify findings. The output shown in Figure 4 reveals clearly delineated report sections, each explicitly referring to a specific region. This structure greatly enhances the clinical interpretability and practical utility of the model output. The region-aware alignment mechanism not only improves the precision of generated report but also provides clinicians with precise spatial references for validation.

Case Study

Figure 3: Report comparison between Reg2RG and the second-best model.

Case Study

Figure 4: Example of a region-aligned diagnostic report generated by Reg2RG.

Conclusion

This paper presents Reg2RG, a region-guided framework for chest CT report generation, which introduces region-aware mechanisms to enhance diagnostic accuracy, fine-grained expression, and interpretability. Unlike traditional approaches that rely solely on global image features, Reg2RG integrates anatomical region features with global contextual information to produce precise multi-region diagnostic reports.
We propose a decoupled local feature strategy to model both texture and geometric information and further enhance inter-region semantic relationships through a global-local collaboration mechanism. Additionally, a region-report alignment training strategy reinforces the alignment between visual and linguistic modalities, improving the model’s region-specific reasoning and reporting capability.
Experimental results demonstrate that Reg2RG surpasses state-of-the-art methods in both language generation quality and clinical accuracy. The framework shows strong generalizability and practical potential. In future work, we aim to extend the approach to additional imaging modalities and diagnostic tasks, further advancing the practical deployment of automated medical report generation.

For more details, code and models are available at:
Paper (arXiv) | GitHub | HuggingFace

References

Z. Chen, Y. Bie, H. Jin, and H. Chen, “Large Language Model with Region-guided Referring and Grounding for CT Report Generation,” IEEE Transactions on Medical Imaging, doi: 10.1109/TMI.2025.3559923.