Tanishq Mathew Abraham, PhD (@iScienceLuvr): "Had implemented RLHF for diffusion models 2 weeks back, replicating the DDPO paper. Here's a before/after comparison training with aesthetics reward. This is actually the 1st RL algorithm I've coded from scratch! Code linked in next tweet. Explanatory blog post coming soon!" | gwitter

Tanishq Mathew Abraham, PhD

Had implemented RLHF for diffusion models 2 weeks back, replicating the DDPO paper. Here's a before/after comparison training with aesthetics reward. This is actually the 1st RL algorithm I've coded from scratch! Code linked in next tweet. Explanatory blog post coming soon!

Jul 6, 2023 · 9:49 AM UTC

Tanishq Mathew Abraham, PhD

Hopefully it's fairly readable code. Learned a lot from this process. I am doing some more experiments with different reward functions, will share any learnings soon. github.com/tmabraham/ddpo-py…

GitHub - tmabraham/ddpo-pytorch: Reproduction of DDPO paper (RLHF for diffusion)

Reproduction of DDPO paper (RLHF for diffusion). Contribute to tmabraham/ddpo-pytorch development by creating an account on GitHub.

Tanishq Mathew Abraham, PhD

Oh and here is the reward curve during training. As you can see, it goes up, which is good! 😄

SeaBerg @sbergman

Replying to @iScienceLuvr

Looking forward to seeing the process you used. Seems like the tuned model eliminates legs, for the most part.

Tanishq Mathew Abraham, PhD

I guess legs are ugly/not very aesthetic? 😄

@untitled01ipynb

Replying to @iScienceLuvr

Your RLHF in a gif

Tanishq Mathew Abraham, PhD

😂😂😂

Sebastian Raschka

Replying to @iScienceLuvr

Awesome stuff! And looking forward to the post!

Tanishq Mathew Abraham, PhD

Thank you!

Janek Mann @janekm

Replying to @iScienceLuvr

Is this with laion aesthetics as the reward? The after looks a lot like their dataset 🤔

Tanishq Mathew Abraham, PhD

Yes, the after on the right is Stable Diffusion v1.4 RL fine-tuned with the LAION aesthetics classifier as a reward function