Authors: Shangyi Geng, Wenting Zhao, Alexander M Rush
Abstract: $K$-nearest neighbor language models ($k$NN-LMs), which integrate retrieval
with next-word prediction, have demonstrated strong performance in language
modeling as well as downstream NLP benchmarks. These results have led
researchers to argue that models trained on poor quality or outdated data could
perform well by employing a $k$NN extension that has access to a higher-quality
datastore. In this work, we ask whether this improved ability to recall
information really translates into downstream abilities. We extensively
evaluate $k$NN-LMs on a diverse set of tasks, ranging from sentiment
classification and commonsense reasoning to multi-hop reasoning. Results show
that $k$NN-LMs excel at memory-intensive tasks, where utilizing the patterns in
the input is sufficient for determining the output, but struggle with reasoning
tasks that require integrating multiple pieces of information to derive new
knowledge. We further demonstrate through oracle experiments and qualitative
analysis that even with perfect retrieval, $k$NN-LMs still fail to determine
the correct answers, placing an upper bound on their reasoning performance.
Code and datastores are released at https://github.com/GSYfate/knnlm-limits/.
Source: http://arxiv.org/abs/2408.11815v1