Authors: Bo Yang, Qingping Yang, Runtao Liu
Abstract: The evaluation of mathematical reasoning capabilities is essential for
advancing Artificial General Intelligence (AGI). While Large Language Models
(LLMs) have shown impressive performance in solving mathematical problems,
existing benchmarks such as GSM8K and MATH present limitations, including
narrow problem definitions with specific numbers and reliance on predetermined
rules that hinder accurate assessments of reasoning and adaptability. This
paper introduces the UTMath Benchmark, which robustly evaluates the models
through extensive unit tests. It consists of 1,053 problems across 9
mathematical domains, with over 68 test cases per problem.We propose an
innovative evaluation framework inspired by unit testing in software
development, focusing on both accuracy and reliability of results. Furthermore,
we introduce the Reasoning-to-Coding of Thoughts (RCoT) approach, which
encourages LLMs to perform explicit reasoning before generating code, leading
to generating more advanced solution and improved performance. Furthermore, we
are releasing not only the UTMath benchmark but also the UTMath-Train training
dataset (more than 70k samples), to support the community in further exploring
mathematical reasoning.
Source: http://arxiv.org/abs/2411.07240v1