Contextualizing biological perturbation experiments through language

Authors: Menghua Wu, Russell Littman, Jacob Levine, Lin Qiu, Tommaso Biancalani, David Richmond, Jan-Christian Huetter

Abstract: High-content perturbation experiments allow scientists to probe biomolecular
systems at unprecedented resolution, but experimental and analysis costs pose
significant barriers to widespread adoption. Machine learning has the potential
to guide efficient exploration of the perturbation space and extract novel
insights from these data. However, current approaches neglect the semantic
richness of the relevant biology, and their objectives are misaligned with
downstream biological analyses. In this paper, we hypothesize that large
language models (LLMs) present a natural medium for representing complex
biological relationships and rationalizing experimental outcomes. We propose
PerturbQA, a benchmark for structured reasoning over perturbation experiments.
Unlike current benchmarks that primarily interrogate existing knowledge,
PerturbQA is inspired by open problems in perturbation modeling: prediction of
differential expression and change of direction for unseen perturbations, and
gene set enrichment. We evaluate state-of-the-art machine learning and
statistical approaches for modeling perturbations, as well as standard LLM
reasoning strategies, and we find that current methods perform poorly on
PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE,
and answeR, a simple, domain-informed LLM framework that matches or exceeds the
current state-of-the-art. Our code and data are publicly available at
https://github.com/genentech/PerturbQA.

Source: http://arxiv.org/abs/2502.21290v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these