Will this book talk about RLHF? #226
-
Great book! I read all the notebooks in this repo and here is a question. I heard RLHF(Reinforcement Learning from Human Feedback) is the core technique of ChatGPT. Is this book talking about it? I see there will be an extra material about dpo on preference fine-tuning. Is it equivalent to RLHF? What's the popular practice in the current industry after instruct-finetuning? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 7 replies
-
Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons
DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models. In the meantime, you might like my two articles here: |
Beta Was this translation helpful? Give feedback.
-
Sorry to hear and hope he has a speedy recovery. Yes, I am very excited for
the new book on reasoning/thinking models!
…On Thu, Jun 12, 2025 at 5:13 AM casinca ***@***.***> wrote:
If it can shed some light @henrythe9th <https://github.com/henrythe9th>
@d-kleine <https://github.com/d-kleine> ,in his latest blogpost,
Sebastian mentioned working on chapters about "reasoning/thinking" models.
(Yes purist will say it's not "RLHF" it's RLVR.) But honestly for having
tried implementing GRPO for preference and unfinished for RLVR, it's
approximately the same pipeline and use the same RL algos (hybrid like PPO
or pure policy gradient like GRPO or variants...) with one of the main
difference reward shaping.
What I mean is, I'm pretty sure everyone will learn a lot from these new
chapters/book for pure RLHF alignment too.
—
Reply to this email directly, view it on GitHub
<#226 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE5IUEINRTLO3WE2CHC46L3DFVG3AVCNFSM6AAAAAB6JF7WJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGNBUGU4DQMI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
--
Henry Shi
Co-Founder, Board Member
Linkedin <https://www.linkedin.com/in/henrythe9th> | X
<https://x.com/henrythe9ths> | Super.com <https://www.super.com/>
|
Beta Was this translation helpful? Give feedback.
Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons
DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models.
In the meantime, you might like my two articles here:
…