ash80/RLHF_in_notebooks: RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks

This repository provides a reference implementation for Reinforcement Learning from Human Feedback (RLHF) [Paper] framework presented in the RLHF from scratch, step-by-step, in code YouTube video.

RLHF is a method for aligning large language models (LLMs), like GPT-3 or GPT-2, to better meet users’ intents. It is essentially a reinforcement learning approach, where rather than directly getting the reward or feedback from some environemnt or human, it instead trains a reward model that learns to mimic that reward. The trained reward model is used to rank the generation from the LLM in the reinforcement learning step. The RLHF process consists of three steps:

Supervised Fine-Tuning (SFT)
Reward Model Training
Reinforcement Learning via Proximal Policy Optimisation (PPO).

To build a chatbot from a pretrained LLM, we might:

Collect a dataset of question-answer pairs (either human-written or generated by the pretrained model).
Human annotators rank these answers by quality.
Follow the three RLHF steps mentioned above:
1. SFT: Fine-tune the LLM to predict the next tokens given question-answer pairs.
2. Reward Model: Train another instance of the LLM with an added reward head to mimic human rankings.
3. PPO: Further optimize the fine-tuned model using PPO to produce answers that the reward model evaluates positively.

Implementation in this Repository

Instead of building a chatbot that would need a dataset of ranked questions and answers, we adapt the RLHF method to fine-tune GPT-2 to generate sentences expressing positive sentiments. To achieve this task we use the stanfordnlp/sst2 dataset, a collection of movie review sentences labeled as expressing positive or negative sentiment. Our goal is to leverage RLHF to optimise the pretrained GPT-2 such that it only generates sentences that are likely to express a positive sentiment.

We achieve this goal by implementing the following three notebooks, each corresponding to one step of the RLHF process:

1-SFT.ipynb: Fine-tunes GPT-2 via supervised learning on the stanfordnlp/sst2 dataset, training it to generate sentences resembling the sentences in this dataset. After fine-tuning, the model is saved as the SFT model.
2-RM Training.ipynb: Creates a Reward Model by attaching a reward head to the pretrained GPT-2. This model is trained to predict sentiment labels (positive/negative) of sentences in the stanfordnlp/sst2 dataset. After training, the reward model (GPT-2 + Reward Head) is saved.
3-RLHF.ipynb: Implements the final reinforcement learning step using PPO:
- Sampling stage: Generates sentences from the policy model (initialized from the SFT model) based on the initial few tokens and scores these sentences using the trained reward model.
- Optimization stage: Optimizes the policy model parameters using PPO to produce sentences that are more likely to receive higher rewards (positive sentiment scores).

After completing these steps, GPT-2 will generate sentences aligned specifically to convey positive sentiments.

Hugging Face Access Token: You will need an access token from Hugging Face to download the pretrained GPT-2 model. Obtain a token by following instructions on HuggingFace Quickstart Guide.

Local Setup:

Set your Hugging Face token as an environment variable named HF_TOKEN:

export HF_TOKEN='your_huggingface_token_here'

Google Colab:

Set your Hugging Face token in Colab Secrets. Or set it as an environment variable by running the following code in a cell of your Jupyter notebook.

import os
os.environ['HF_TOKEN'] = 'your_huggingface_token_here'

Open and run notebooks sequentially (1-SFT.ipynb, 2-RM Training.ipynb, then 3-RLHF.ipynb), following the step-by-step instructions provided within each notebook or in the YouTube video.

Source link

━ pricing plans

Basic Plan

Premium Plan

VIP Plan

ash80/RLHF_in_notebooks: RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks

Implementation in this Repository

Subscribe to our magazine

━ more like this

Kicking back in Catalonia: a new eco-retreat in Spain with yoga, ebikes and volcano hikes | Catalonia holidays

Mean People Fail

Arch Manning Running Out Of Polite Ways To Decline Eli’s Mentorship

Taylor Swift’s Style | PS Fashion

662: Just Break the Law

━ support

━ contact us

━ subscribe