Authors: Charles Jin
Abstract: As language models (LMs) deliver increasing performance on a range of NLP
tasks, probing classifiers have become an indispensable technique in the effort
to better understand their inner workings. A typical setup involves (1)
defining an auxiliary task consisting of a dataset of text annotated with
labels, then (2) supervising small classifiers to predict the labels from the
representations of a pretrained LM as it processed the dataset. A high probing
accuracy is interpreted as evidence that the LM has learned to perform the
auxiliary task as an unsupervised byproduct of its original pretraining
objective. Despite the widespread usage of probes, however, the robust design
and analysis of probing experiments remains a challenge. We develop a formal
perspective on probing using structural causal models (SCM). Specifically,
given an SCM which explains the distribution of tokens observed during
training, we frame the central hypothesis as whether the LM has learned to
represent the latent variables of the SCM. Empirically, we extend a recent
study of LMs in the context of a synthetic grid-world navigation task, where
having an exact model of the underlying causal structure allows us to draw
strong inferences from the result of probing experiments. Our techniques
provide robust empirical evidence for the ability of LMs to learn the latent
causal concepts underlying text.
Source: http://arxiv.org/abs/2407.13765v1