Authors: Franck Signe Talla, Herve Jegou, Edouard Grave
Abstract: We address the problem of extending a pretrained large language model to a
new domain that was not seen at training time, like adding a language for which
the original model has seen no or little training data. Popular solutions like
fine-tuning or low-rank adaptation are successful at domain adaptation, but
formally they do not add any extra capacity and degrade the performance in the
original domain.
Our paper analyzes this extension problem under three angles: data,
architecture and training procedure, which are advantageously considered
jointly. In particular, we improve adapters and make it possible to learn an
entire new language while ensuring that the output of the neural network is
almost unchanged in the original domain. For this purpose, we modify the new
residual blocks in a way that leads each new residual block to output
near-zeros in the original domain.
This solution of neutral residues, which borrows architectural components
from mixture of experts, is effective: with only 20% extra learnable weights
compared to an original model trained on English, we get results that are
significantly better than concurrent approaches (fine-tuning, low-rank or
vanilla adapters) in terms of the trade-off between learning a new language and
not forgetting English.
Source: http://arxiv.org/abs/2410.02744v1