The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability

Authors: Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, Yonatan Belinkov

Abstract: Interpretability provides a toolset for understanding how and why neural
networks behave in certain ways. However, there is little unity in the field:
most studies employ ad-hoc evaluations and do not share theoretical
foundations, making it difficult to measure progress and compare the pros and
cons of different techniques. Furthermore, while mechanistic understanding is
frequently discussed, the basic causal units underlying these mechanisms are
often not explicitly defined. In this paper, we propose a perspective on
interpretability research grounded in causal mediation analysis. Specifically,
we describe the history and current state of interpretability taxonomized
according to the types of causal units (mediators) employed, as well as methods
used to search over mediators. We discuss the pros and cons of each mediator,
providing insights as to when particular kinds of mediators and search methods
are most appropriate depending on the goals of a given study. We argue that
this framing yields a more cohesive narrative of the field, as well as
actionable insights for future work. Specifically, we recommend a focus on
discovering new mediators with better trade-offs between human-interpretability
and compute-efficiency, and which can uncover more sophisticated abstractions
from neural networks than the primarily linear mediators employed in current
work. We also argue for more standardized evaluations that enable principled
comparisons across mediator types, such that we can better understand when
particular causal units are better suited to particular use cases.

Source: http://arxiv.org/abs/2408.01416v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these