Multi-Turn Code Generation Through Single-Step Rewards

Authors: Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury

Abstract: We address the problem of code generation from multi-turn execution feedback.
Existing methods either generate code without feedback or use complex,
hierarchical reinforcement learning to optimize multi-turn rewards. We propose
a simple yet scalable approach, $\mu$Code, that solves multi-turn code
generation using only single-step rewards. Our key insight is that code
generation is a one-step recoverable MDP, where the correct code can be
recovered from any intermediate code state in a single turn. $\mu$Code
iteratively trains both a generator to provide code solutions conditioned on
multi-turn execution feedback and a verifier to score the newly generated code.
Experimental evaluations show that our approach achieves significant
improvements over the state-of-the-art baselines. We provide analysis of the
design choices of the reward models and policy, and show the efficacy of
$\mu$Code at utilizing the execution feedback. Our code is available at
https://github.com/portal-cornell/muCode.

Source: http://arxiv.org/abs/2502.20380v1

Mar 2, 2025Leave a CommentAI

Archives

Categories

Multi-Turn Code Generation Through Single-Step Rewards

About the Author

user

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You may also like these

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

Physics-Driven Data Generation for Contact-Rich Manipulation via Trajectory Optimization

Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models

Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids