Authors: Daniel P. Jeong, Pranav Mani, Saurabh Garg, Zachary C. Lipton, Michael Oberst
Abstract: Several recent works seek to develop foundation models specifically for
medical applications, adapting general-purpose large language models (LLMs) and
vision-language models (VLMs) via continued pretraining on publicly available
biomedical corpora. These works typically claim that such domain-adaptive
pretraining (DAPT) improves performance on downstream medical tasks, such as
answering medical licensing exam questions. In this paper, we compare ten
public “medical” LLMs and two VLMs against their corresponding base models,
arriving at a different conclusion: all medical VLMs and nearly all medical
LLMs fail to consistently improve over their base models in the zero-/few-shot
prompting and supervised fine-tuning regimes for medical question-answering
(QA). For instance, across all tasks and model pairs we consider in the 3-shot
setting, medical LLMs only outperform their base models in 22.7% of cases,
reach a (statistical) tie in 36.8% of cases, and are significantly worse than
their base models in the remaining 40.5% of cases. Our conclusions are based on
(i) comparing each medical model head-to-head, directly against the
corresponding base model; (ii) optimizing the prompts for each model separately
in zero-/few-shot prompting; and (iii) accounting for statistical uncertainty
in comparisons. While these basic practices are not consistently adopted in the
literature, our ablations show that they substantially impact conclusions.
Meanwhile, we find that after fine-tuning on specific QA tasks, medical LLMs
can show performance improvements, but the benefits do not carry over to tasks
based on clinical notes. Our findings suggest that state-of-the-art
general-domain models may already exhibit strong medical knowledge and
reasoning capabilities, and offer recommendations to strengthen the conclusions
of future studies.
Source: http://arxiv.org/abs/2411.08870v1