Authors: Ahmed Akib Jawad Karim, Muhammad Zawad Mahmud, Samiha Islam, Aznur Azam
Abstract: In this research, we explored the improvement in terms of multi-class disease
classification via pre-trained language models over Medical-Abstracts-TC-Corpus
that spans five medical conditions. We excluded non-cancer conditions and
examined four specific diseases. We assessed four LLMs, BioBERT, XLNet, and
BERT, as well as a novel base model (Last-BERT). BioBERT, which was pre-trained
on medical data, demonstrated superior performance in medical text
classification (97% accuracy). Surprisingly, XLNet followed closely (96%
accuracy), demonstrating its generalizability across domains even though it was
not pre-trained on medical data. LastBERT, a custom model based on the lighter
version of BERT, also proved competitive with 87.10% accuracy (just under
BERT’s 89.33%). Our findings confirm the importance of specialized models such
as BioBERT and also support impressions around more general solutions like
XLNet and well-tuned transformer architectures with fewer parameters (in this
case, LastBERT) in medical domain tasks.
Source: http://arxiv.org/abs/2411.12712v1