Home

UP-GRPO: Unbounded Positive Group Relative Policy Optimization

🚧 Work in Progress Abstract Reinforcement Learning (RL) has become the cornerstone for unlocking the complex reasoning capabilities of Large Language Models (LLMs). Mainstream alignment algorithms, particularly GRPO (Group Relative Policy Optimization), rely heavily on Importance Sampling and Symmetric Clipping to constrain policy updates and...

Read more