Authors: Álvaro Rodríguez Abella, João Pedro Silvestre, Paulo Tabuada
Abstract: A key component of transformers is the attention mechanism orchestrating how
each token influences the propagation of every other token through a
transformer. In this paper we provide a rigorous, mathematical analysis of the
asymptotic properties of attention in transformers. Although we present several
results based on different assumptions, all of them point to the same
conclusion, all tokens asymptotically converge to each other, a phenomenon that
has been empirically reported in the literature. Our findings are carefully
compared with existing theoretical results and illustrated by simulations and
experimental studies using the GPT-2 model.
Source: http://arxiv.org/abs/2412.02682v1