Had implemented RLHF for diffusion models 2 weeks back, replicating the DDPO paper.
Here's a before/after comparison training with aesthetics reward.
This is actually the 1st RL algorithm I've coded from scratch!
Code linked in next tweet.
Explanatory blog post coming soon!
Jul 6, 2023 · 9:49 AM UTC
Hopefully it's fairly readable code. Learned a lot from this process. I am doing some more experiments with different reward functions, will share any learnings soon.
github.com/tmabraham/ddpo-py…