Breaking Down Bias: On The Limits of Generalizable Pruning Strategies

Authors: Sibo Ma, Alejandro Salinas, Peter Henderson, Julian Nyarko

Abstract: We employ model pruning to examine how LLMs conceptualize racial biases, and
whether a generalizable mitigation strategy for such biases appears feasible.
Our analysis yields several novel insights. We find that pruning can be an
effective method to reduce bias without significantly increasing anomalous
model behavior. Neuron-based pruning strategies generally yield better results
than approaches pruning entire attention heads. However, our results also show
that the effectiveness of either approach quickly deteriorates as pruning
strategies become more generalized. For instance, a model that is trained on
removing racial biases in the context of financial decision-making poorly
generalizes to biases in commercial transactions. Overall, our analysis
suggests that racial biases are only partially represented as a general concept
within language models. The other part of these biases is highly
context-specific, suggesting that generalizable mitigation strategies may be of
limited effectiveness. Our findings have important implications for legal
frameworks surrounding AI. In particular, they suggest that an effective
mitigation strategy should include the allocation of legal responsibility on
those that deploy models in a specific use case.

Source: http://arxiv.org/abs/2502.07771v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these