releasing my grpo v2 repo: nano-grpo-reasoning-gym
two big changes (1) this one entirely implements the grpo training stack from just pytorch/very simple python code - but is now extended to use vLLM, the liger kernel and other optimizations that make it much quicker to train models
(2) it's built on top of the reasoning gym repo - and is built solely in mind to train and evaluate on these reasoning environments
i really like writing things from scratch to get a good intuition for how things work, and also a lot of my research interests involve doing weird/little things to the training process, and I find it much easier to do on simpler code
my previous repo was built with the same intention - but to keep it ultimately simple I didn't have any optimizations really - so while it was extremely easy to change things around, it was very slow and impractical for more serious training runs
like a lot of people I have become more interested in how models can learn in multiple environments - reasoning gym provides a nice standardized set of tasks to experiment with this. the repo makes it easy to mix different reasoning tasks, train on some, eval on others
for me this is about having a fast but simple sandbox to test ideas. for others might be useful to understand how grpo/vllm/liger work in practice, or as a starting point for your own experiments
here's a first run - training on leg_counting + family_relationships, eval on those + coin_flip
All evals are done with probabilistic pass@1 for 5 completions per problem, still noisy of course.
Leg count gets +20% performance, family relationship + 35%, coin flip (+8%? Maybe just noise?)
Github link below