Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

Authors: Daniel P. Jeong, Saurabh Garg, Zachary C. Lipton, Michael Oberst

Abstract: Several recent works seek to develop foundation models specifically for
medical applications, adapting general-purpose large language models (LLMs) and
vision-language models (VLMs) via continued pretraining on publicly available
biomedical corpora. These works typically claim that such domain-adaptive
pretraining (DAPT) improves performance on downstream medical tasks, such as
answering medical licensing exam questions. In this paper, we compare seven
public “medical” LLMs and two VLMs against their corresponding base models,
arriving at a different conclusion: all medical VLMs and nearly all medical
LLMs fail to consistently improve over their base models in the zero-/few-shot
prompting regime for medical question-answering (QA) tasks. For instance,
across the tasks and model pairs we consider in the 3-shot setting, medical
LLMs only outperform their base models in 12.1% of cases, reach a (statistical)
tie in 49.8% of cases, and are significantly worse than their base models in
the remaining 38.2% of cases. Our conclusions are based on (i) comparing each
medical model head-to-head, directly against the corresponding base model; (ii)
optimizing the prompts for each model separately; and (iii) accounting for
statistical uncertainty in comparisons. While these basic practices are not
consistently adopted in the literature, our ablations show that they
substantially impact conclusions. Our findings suggest that state-of-the-art
general-domain models may already exhibit strong medical knowledge and
reasoning capabilities, and offer recommendations to strengthen the conclusions
of future studies.

Source: http://arxiv.org/abs/2411.04118v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these