Authors: Nikolhaus Howe, Michał Zajac, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Pierre-Luc Bacon, Adam Gleave
Abstract: Language model capabilities predictably improve from scaling a model’s size
and training data. Motivated by this, increasingly large language models have
been trained, yielding an array of impressive capabilities. Yet these models
are vulnerable to adversarial prompts, such as “jailbreaks” that hijack models
to perform undesired behaviors, posing a significant risk of misuse. Prior work
indicates that computer vision models become more robust with model and data
scaling, raising the question: does language model robustness also improve with
scale? We study this question empirically, finding that larger models respond
substantially better to adversarial training, but there is little to no benefit
from model scale in the absence of explicit defenses.
Source: http://arxiv.org/abs/2407.18213v1