Bibliography (10):
https://x.com/rm_rafailov/status/1781145338759533016
V-STaR: Training Verifiers for Self-Taught Reasoners
Diffusion Model Alignment Using Direct Preference Optimization
Wikipedia Bibliography:
Reinforcement learning
Markov decision process
Q-learning
https://en.wikipedia.org/wiki/Bellman_equation :
https://en.wikipedia.org/wiki/Bellman_equation
Monte Carlo tree search
Beam search
End-to-end principle