Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Authors: Sukrut Rao, Sweta Mahajan, Moritz Böhle, Bernt Schiele

Abstract: Concept Bottleneck Models (CBMs) have recently been proposed to address the
‘black-box’ problem of deep neural networks, by first mapping images to a
human-understandable concept space and then linearly combining concepts for
classification. Such models typically require first coming up with a set of
concepts relevant to the task and then aligning the representations of a
feature extractor to map to these concepts. However, even with powerful
foundational feature extractors like CLIP, there are no guarantees that the
specified concepts are detectable. In this work, we leverage recent advances in
mechanistic interpretability and propose a novel CBM approach — called
Discover-then-Name-CBM (DN-CBM) — that inverts the typical paradigm: instead
of pre-selecting concepts based on the downstream classification task, we use
sparse autoencoders to first discover concepts learnt by the model, and then
name them and train linear probes for classification. Our concept extraction
strategy is efficient, since it is agnostic to the downstream task, and uses
concepts already known to the model. We perform a comprehensive evaluation
across multiple datasets and CLIP architectures and show that our method yields
semantically meaningful concepts, assigns appropriate names to them that make
them easy to interpret, and yields performant and interpretable CBMs. Code
available at https://github.com/neuroexplicit-saar/discover-then-name.