UP-GRPO: Unbounded Positive Group Relative Policy Optimization
🚧 Work in Progress
Abstract
Reinforcement Learning (RL) has become the cornerstone for unlocking the complex reasoning capabilities of Large Language Models (LLMs). Mainstream alignment algorithms, particularly GRPO (Group Relative Policy Optimization), rely heavily on Importance Sampling and Symmetric Clipping to constrain policy updates and...