Authors: Jayesh Singla, Ananye Agarwal, Deepak Pathak
Abstract: Despite extreme sample inefficiency, on-policy reinforcement learning, aka
policy gradients, has become a fundamental tool in decision-making problems.
With the recent advances in GPU-driven simulation, the ability to collect large
amounts of data for RL training has scaled exponentially. However, we show that
current RL methods, e.g. PPO, fail to ingest the benefit of parallelized
environments beyond a certain point and their performance saturates. To address
this, we propose a new on-policy RL algorithm that can effectively leverage
large-scale environments by splitting them into chunks and fusing them back
together via importance sampling. Our algorithm, termed SAPG, shows
significantly higher performance across a variety of challenging environments
where vanilla PPO and other strong baselines fail to achieve high performance.
Website at https://sapg-rl.github.io/
Source: http://arxiv.org/abs/2407.20230v1