Theorem Prover as a Judge for Synthetic Data Generation

Authors: Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen

Abstract: The demand for synthetic data in mathematical reasoning has increased due to
its potential to enhance the mathematical capabilities of large language models
(LLMs). However, ensuring the validity of intermediate reasoning steps remains
a significant challenge, affecting data quality. While formal verification via
theorem provers effectively validates LLM reasoning, the autoformalisation of
mathematical proofs remains error-prone. In response, we introduce iterative
autoformalisation, an approach that iteratively refines theorem prover
formalisation to mitigate errors, thereby increasing the execution rate on the
Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as
a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to
rigorously assess LLM intermediate reasoning, effectively integrating
autoformalisation with synthetic data generation. Finally, we present
Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that
replaces human annotation with theorem prover feedback in Reinforcement
Learning from Human Feedback (RLHF). Across multiple LLMs, applying
TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving
5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for
SVAMP, and 3.55% on Llama-3.1-8B for AQUA.