Tópicos populares
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Parece que outras pessoas estão a convergir para usar o vllm v1 logprob para a razão de importância para resolver o problema de estabilidade.
Acho que tenho PTSD deste tipo de crash de RL.

22/08/2025
With just a few lines of code, Feng’s (@fengyao1909) suggested fix—applying importance sampling on the behavior policy—resolved the training instability in my case (oat). I believe the result can generalize to other RL frameworks as well. Great work, Feng!

6,9K
Top
Classificação
Favoritos