ash80/RLHF_in_notebooks: RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks

Date:

Share:

This repository provides a reference implementation for Reinforcement Learning from Human Feedback (RLHF) [Paper] framework presented in the RLHF from scratch, step-by-step, in code YouTube video.

RLHF is a method for aligning large language models (LLMs), like GPT-3 or GPT-2, to better meet users’ intents. It is essentially a reinforcement learning approach, where rather than directly getting the reward or feedback from some environemnt or human, it instead trains a reward model that learns to mimic that reward. The trained reward model is used to rank the generation from the LLM in the reinforcement learning step. The RLHF process consists of three steps:

  1. Supervised Fine-Tuning (SFT)
  2. Reward Model Training
  3. Reinforcement Learning via Proximal Policy Optimisation (PPO).

To build a chatbot from a pretrained LLM, we might:

  • Collect a dataset of question-answer pairs (either human-written or generated by the pretrained model).
  • Human annotators rank these answers by quality.
  • Follow the three RLHF steps mentioned above:
    1. SFT: Fine-tune the LLM to predict the next tokens given question-answer pairs.
    2. Reward Model: Train another instance of the LLM with an added reward head to mimic human rankings.
    3. PPO: Further optimize the fine-tuned model using PPO to produce answers that the reward model evaluates positively.

Implementation in this Repository

Instead of building a chatbot that would need a dataset of ranked questions and answers, we adapt the RLHF method to fine-tune GPT-2 to generate sentences expressing positive sentiments. To achieve this task we use the stanfordnlp/sst2 dataset, a collection of movie review sentences labeled as expressing positive or negative sentiment. Our goal is to leverage RLHF to optimise the pretrained GPT-2 such that it only generates sentences that are likely to express a positive sentiment.

We achieve this goal by implementing the following three notebooks, each corresponding to one step of the RLHF process:

  1. 1-SFT.ipynb: Fine-tunes GPT-2 via supervised learning on the stanfordnlp/sst2 dataset, training it to generate sentences resembling the sentences in this dataset. After fine-tuning, the model is saved as the SFT model.
  2. 2-RM Training.ipynb: Creates a Reward Model by attaching a reward head to the pretrained GPT-2. This model is trained to predict sentiment labels (positive/negative) of sentences in the stanfordnlp/sst2 dataset. After training, the reward model (GPT-2 + Reward Head) is saved.
  3. 3-RLHF.ipynb: Implements the final reinforcement learning step using PPO:
    • Sampling stage: Generates sentences from the policy model (initialized from the SFT model) based on the initial few tokens and scores these sentences using the trained reward model.
    • Optimization stage: Optimizes the policy model parameters using PPO to produce sentences that are more likely to receive higher rewards (positive sentiment scores).

After completing these steps, GPT-2 will generate sentences aligned specifically to convey positive sentiments.

Hugging Face Access Token: You will need an access token from Hugging Face to download the pretrained GPT-2 model. Obtain a token by following instructions on HuggingFace Quickstart Guide.

Local Setup:

Set your Hugging Face token as an environment variable named HF_TOKEN:

export HF_TOKEN='your_huggingface_token_here'

Google Colab:

Set your Hugging Face token in Colab Secrets. Or set it as an environment variable by running the following code in a cell of your Jupyter notebook.

import os
os.environ['HF_TOKEN'] = 'your_huggingface_token_here'

Open and run notebooks sequentially (1-SFT.ipynb, 2-RM Training.ipynb, then 3-RLHF.ipynb), following the step-by-step instructions provided within each notebook or in the YouTube video.

Source link

Subscribe to our magazine

━ more like this

Microsoft Office Pro 2021 Gives Your Team the Edge at a One-Time $40 Cost

Disclosure: Our goal is to feature products and services that we think you'll find...

Error'd: Better Nate Than Lever

Happy Friday. For those of us in America, today is a political holiday. But let's avoid politics for the moment. Here's a few more...

Halter Dresses Are Summer’s Biggest Comeback Trend

We have our linen dresses, our milkmaid dresses, and our playful sardine-girl summer-ready takes to pair with sneakers, sandals, and thong kitten heels. But...

How Dominican Salons Shaped My Identity

Like most curly-haired Afro-Dominican girls, I had a complicated relationship with my hair growing up. Society loves to say that Black girls hate their...

Looking for a seaside town that’s a bit special? Try one of the UK’s best revitalised resorts | United Kingdom holidays

Llandudno, ConwySome British resorts are about the beach. In others it’s walking along the prom. The fashionable ones push gastronomy, drink, street art, culture....