Authors: Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, Michael Livermore
Abstract: Annotation and classification of legal text are central components of
empirical legal research. Traditionally, these tasks are often delegated to
trained research assistants. Motivated by the advances in language modeling,
empirical legal scholars are increasingly turning to prompting commercial
models, hoping that it will alleviate the significant cost of human annotation.
Despite growing use, our understanding of how to best utilize large language
models for legal tasks remains limited. We conduct a comprehensive study of 260
legal text classification tasks, nearly all new to the machine learning
community. Starting from GPT-4 as a baseline, we show that it has non-trivial
but highly varied zero-shot accuracy, often exhibiting performance that may be
insufficient for legal work. We then demonstrate that a lightly fine-tuned
Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by
double-digit percentage points. We find that larger models respond better to
fine-tuning than smaller models. A few tens to hundreds of examples suffice to
achieve high classification accuracy. Notably, we can fine-tune a single model
on all 260 tasks simultaneously at a small loss in accuracy relative to having
a separate model for each task. Our work points to a viable alternative to the
predominant practice of prompting commercial models. For concrete legal tasks
with some available labeled data, researchers are better off using a fine-tuned
open-source model.
Source: http://arxiv.org/abs/2407.16615v1