Authors: Krzysztof Maziarz, Guoqing Liu, Hubert Misztela, Aleksei Kornev, Piotr Gaiński, Holger Hoefling, Mike Fortunato, Rishi Gupta, Marwin Segler
Abstract: Planning and conducting chemical syntheses remains a major bottleneck in the
discovery of functional small molecules, and prevents fully leveraging
generative AI for molecular inverse design. While early work has shown that
ML-based retrosynthesis models can predict reasonable routes, their low
accuracy for less frequent, yet important reactions has been pointed out. As
multi-step search algorithms are limited to reactions suggested by the
underlying model, the applicability of those tools is inherently constrained by
the accuracy of retrosynthesis prediction. Inspired by how chemists use
different strategies to ideate reactions, we propose Chimera: a framework for
building highly accurate reaction models that combine predictions from diverse
sources with complementary inductive biases using a learning-based ensembling
strategy. We instantiate the framework with two newly developed models, which
already by themselves achieve state of the art in their categories. Through
experiments across several orders of magnitude in data scale and time-splits,
we show Chimera outperforms all major models by a large margin, owing both to
the good individual performance of its constituents, but also to the
scalability of our ensembling strategy. Moreover, we find that PhD-level
organic chemists prefer predictions from Chimera over baselines in terms of
quality. Finally, we transfer the largest-scale checkpoint to an internal
dataset from a major pharmaceutical company, showing robust generalization
under distribution shift. With the new dimension that our framework unlocks, we
anticipate further acceleration in the development of even more accurate
models.
Source: http://arxiv.org/abs/2412.05269v1