Tendencias del momento
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Parece que otras personas convergen en usar vllm v1 logprob para la relación de importancia para solucionar el problema de estabilidad.
Creo que tengo PTSD de este tipo de colapso de RL.

22 ago 2025
With just a few lines of code, Feng’s (@fengyao1909) suggested fix—applying importance sampling on the behavior policy—resolved the training instability in my case (oat). I believe the result can generalize to other RL frameworks as well. Great work, Feng!

6,9K
Parte superior
Clasificación
Favoritos