Group-relative policy optimisation
Consider the problem of having to optimise the parameters for a policy . We know from policy gradient (Williams, 1992) that x.
We know that using standard policy gradient methods are unbiased, but have high variance. Here we like to use variance reduction methods which can be encapsulated using the generalised advantage estimation framework. However we have seen success only with learned baselines such as value function or Q-functions. What if we could skip that all together and still have low variance?
Enter GRPO in which we use rollouts that form a group and estimate the advantage of each rollout relative to the group average.
Where , . Notice that the the advantage is undefined for , which happens when either all coincide or if our group size is too small (). Note that the normalisation function is arbitrary but works well in practice.