[Nature Biomedical Engineering] Towards generalizable AI in medicine via Generalist–Specialist Collaboration

Recently, the SmartX Lab team, in collaboration with several leading institutions, has completed a project on a cooperative framework that synergistically combines a powerful medical generalist foundation model with lightweight specialist models. Published in Nature Biomedical Engineering, this work introduces Generalist–Specialist Collaboration (GSCo) — a paradigm that elevates the generalist foundation model (GFM) from a task “orchestrator” to the central “decision-maker” that arbitrates final diagnoses by fusing specialist guidance with its own intrinsic medical knowledge.

GSCo framework overview (Fig. 1a)

Introduction

Generalist foundation models (GFMs) are renowned for their flexibility across diverse tasks, while specialist models excel in precision because of their domain-specific training. Bridging these complementary strengths is a central challenge for clinical AI. To this end, a collaborative research team led by Professor Hao Chen at HKUST’s SmartX Lab, together with Harvard Medical School, Weill Cornell Medicine, NYU Langone Health, the Chinese University of Hong Kong, the University of Hong Kong, and several other institutions, has introduced GSCo, a unified framework that combines an open-source medical GFM (MedDr) with a suite of lightweight specialist models.

Comprehensive experiments were conducted on a benchmark of 32 datasets comprising approximately 260,000 testing images across diverse medical modalities — including radiology, pathology, dermatology, ophthalmology, and gastroenterology. MedDr surpasses prior medical GFMs (RadFM, LLaVA-Med, Med-Flamingo, InternVL) on downstream datasets, and GSCo further exceeds both GFMs and specialist models on medical image diagnosis (MID) and medical report generation (MRG).

GSCo related figure (Fig. 1)

Background

Existing medical AI systems face several persistent gaps that limit clinical adoption:

  • Specialists are precise but narrow — task-tuned specialist models achieve high accuracy on the tasks they were trained for, yet they cannot generalize across tasks or modalities.
  • GFMs are flexible but imprecise — medical GFMs can follow instructions and handle a variety of tasks, but they trail specialists on focused diagnostic benchmarks.
  • Agent-based “delegation” leaves the GFM idle — recent agent systems (e.g., MMedAgent, VILA-M3) cast the GFM as an “orchestrator” that hands tasks off to specialist tools, underutilizing the GFM’s own medical knowledge.
  • Privacy and cost barriers — fine-tuning a GFM on every new clinical site is computationally expensive and often requires sharing protected health information across institutions.

To address these limitations, this work re-positions the GFM as the final decision-maker that integrates specialist signals with its own intrinsic knowledge, and validates this paradigm on a unified, large-scale benchmark.

Method

GSCo consists of two stages — model construction and collaborative inference — and three core components:

  • MedDr — an open-source medical GFM. MedDr is built upon InternVL and instruction-tuned on a corpus of more than 2 million samples spanning five data types: medical image diagnosis (MID), medical report generation (MRG), visual question answering (VQA), and two newly curated datasets — Diagnosis-Guided Bootstrapping (DGB) and Medical Image Description (DES). DGB is a key innovation: rather than relying on text-only signals from image–text pairs, it jointly uses image, modality, and disease information to generate high-quality medical instruction data, substantially improving MedDr’s intrinsic diagnostic ability.

  • A suite of lightweight specialist models. For each downstream dataset, ten vision backbones — VGG16, AlexNet, ResNet-18, DenseNet-121, EfficientNet-B4, ViT-B/16, CLIP ViT-B/16, EVA-02 ViT-B/16, DINO ViT-B/16, and SAM ViT-B/16 — are independently fine-tuned. These models are small enough to train on a single consumer-grade GPU.

  • Two collaborative inference mechanisms. Mixture-of-Expert Diagnosis (MoED) uses each specialist’s prediction as contextual reference for MedDr, which then arbitrates a final diagnosis. Retrieval-Augmented Diagnosis (RAD) turns each specialist into a retriever that fetches the most similar cases from a medical image–text database; the retrieved metadata is supplied to MedDr as additional context. MedDr integrates both signals together with its own intrinsic knowledge to render the final decision.

MoED and RAD collaborative inference (Fig. 8)

Clinical Validation

GSCo was validated against ten fine-tuned vision foundation models across internal, external, and out-of-domain (cross-domain) splits.

In medical image diagnosis, GSCo ranks first on both the internal (8 datasets) and external (16 datasets) benchmarks, surpassing every individual specialist model — including the best-performing DenseNet-121. In cross-domain validation, where specialists are trained on a source dataset and then deployed on an unseen target dataset, GSCo achieved an F1 of 0.8420 on HAM10000→BCN20000 (significantly above MedDr’s zero-shot 0.7545 and the best specialist’s 0.8292, P < 0.0001), and 0.7337 on PneumoniaMNIST→RSNA Pneumonia (P < 0.0001).

Medical image diagnosis benchmark results (Fig. 4)

In visual question answering, MedDr reaches a tokenized F1 of 61.10% on VQA-RAD and 90.21% closed-question accuracy on Path-VQA, significantly outperforming all baselines (P < 0.0001). In medical report generation on MIMIC-CXR, GSCo attains the best clinical-efficacy F1 of 36.38%, ahead of MedDr (30.80%) and the SOTA specialist R2GenGPT (33.04%).

In a blinded evaluation by seven board-certified radiologists on 50 randomly selected chest X-ray cases, GSCo achieved the highest overall ranking (Kendall’s W = 0.8688, indicating strong inter-rater agreement). In pairwise comparisons, all seven radiologists preferred MedDr over RadFM, and six of seven preferred GSCo over R2GenGPT.

Medical report generation and radiologist evaluation (Fig. 6)

Robust Arbitration

A distinctive property of GSCo is MedDr’s arbitration capability — it neither rigidly defends its own initial answer nor blindly follows external guidance. When all specialists were systematically biased to “Normal” on PCam200, MedDr still correctly identified 67.6% of “Tumor” images. When MedDr’s initial diagnosis was correct, it upheld that decision in 94.5% of conflict cases on PCam200; when the initial diagnosis was wrong, contradictory guidance helped it self-correct 81.0% of the time on PneumoniaMNIST. This dynamic integration of intrinsic knowledge with task-specific context is essential for safe clinical deployment.

Translational Potential

GSCo offers three practical advantages for clinical translation:

  • Effectiveness — GSCo outperforms both standalone GFMs and specialists across internal, external, and cross-domain settings on 24 medical image diagnosis datasets, and achieves the top human-evaluation rank for chest X-ray report generation.
  • Efficiency — adapting to a new task requires only training a small specialist model, not re-fine-tuning the GFM. The total compute cost for the entire specialist suite is roughly two orders of magnitude lower than the cost of adapting the GFM itself.
  • Privacy-friendliness — specialist models can be trained on protected health information within their own institution; only the anonymized model weights need to be plugged into GSCo, enabling cross-institutional knowledge sharing without exchanging patient data.

By repositioning the generalist model from a task “orchestrator” to the final “decision-maker,” GSCo provides a scalable, deployable path for medical AI that combines the flexibility of foundation models with the precision of domain experts.


Resources

For more details, please see our paper Towards generalizable AI in medicine via Generalist–Specialist Collaboration in Nature Biomedical Engineering.

Code and models are available at https://github.com/sunanhe/MedDr.