(InstructGPT) Training language models to follow instructions with human feedback


  • ChatGPT를 가기 위한 기초논문
  • paper: Training language models to follow instructions with human feedback.pdf
  • 결국 real-world의 prompts가 엄청 중요하고, 그걸 labelers를 통해 잘 demonstration을 적어놔야되고, output rankings을 통해 RM을 잘 굽고 plm obj를 특정 비율로 유지시키면서 PPO를 잘 굽는게 핵심..!



  • Making language models bigger does not inherently make them better at following a user’s intent
    • 모델 크기만 키우는게 유저의 의도를 따라가는 관점에서는 더 낫게해주진 않음
  • large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user
    • 믿기 어렵거나 toxic하거나 하는 문장도 생성해내기 때문
  • 이런 모델들은 not aligned with their users!
  • show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.
    • 이 논문에서 LM을 유저의 의도에 맞게 파인튜닝하고 human feedback을 줘서 align했을때 효과를 보여줄 것임
  • Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning.
    • GPT-3 튜닝할 목적으로 labeler demonstrations 데이터셋 수집
  • collect a dataset of rankings of model outputs
    • 모델 아웃풋에 대한 랭킹으로 수집함
  • which we use to further fine-tune this supervised model using reinforcement learning from human feedback
    • RLHF로 파인튜닝하기 위해서 사용할 것
  • call the resulting models InstructGPT
    • 학습한 모델을 InstructGPT로 부를 것임
  • In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
    • Human 평가에서 보면 1.3B InstructGPT가 175 GPT-3보다 결과가 좋음
  • InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets
    • truthfulness나 toxic 관점에서도 꽤 개선이 보였음
  • RLHF기반 finetuning이 aligning LMs with human intents 관점에서 꽤 promising direction임을 보임


  • 기존 LM은 prompted로 여러 NLP 태스크를 수행해왔음
    • 하지만 단점도 존재함
      • However, these models often express unintended behaviors such as making up facts, generating biased or toxic text or simply not following user instructions
  • recent LLMs의 objective의 근본적인 문제
    • predicting the next token on a webpage from the internet—is different from the objective “follow the user’s instructions helpfully and safely
    • language modeling objective is misaligned
  • 유저의 의도에 맞게 LM을 aligning하는게 중요함
    • make progress on aligning language models by training them to act in accordance with the user’s intention
  • use reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune GPT-3 to follow a broad class of written instructions
  • image
  • finetuning할 때 human preferences as a reward signal로 주기 위한 테크닉임
  • 진행 순서
    • first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more details)
      • 기존에 OpenAI에 제출된 prompt도 쓰고, 새로 사람 고용해서 쓰기도해서 prompts를 모은 후에 prompts에 대한 human-written demonstrations 을 쓰게해서 데이터 모음
      • collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and use this to train our supervised learning baselines
    • Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts
      • 모델에서 나온 결과랑 사람이 쓴 결과를 비교하는 데이터셋 만듬
    • then train a reward model (RM) on this dataset to predict which model output our labelers would prefer
    • Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm
  • We call the resulting models InstructGPT.
  • 평가
    • mainly evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out customers
    • also conduct automatic evaluations on a range of public NLP datasets
  • 3가지 모델 학습함 (1.3B, 6B, and 175B parameters)
  • Main Findings:
    • Labelers significantly prefer InstructGPT outputs over outputs from GPT-3
      • InstructGPT are preferred to 175B GPT-3 outputs 85 ± 3% of the time, and preferred 71 ± 4% of the time to few-shot 175B GPT-3
    • InstructGPT models show improvements in truthfulness over GPT-3
      • On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3
      • InstructGPT shows small improvements in toxicity over GPT-3, but not bias
        • InstructGPT models generate about 25% fewer toxic outputs than GPT-3
    • We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.
      • observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD (Rajpurkar et al., 2018), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), and WMT 2015 French to English translation (Bojar et al., 2015)
        • This is an example of an “alignment tax”
      • We can greatly reduce the performance regressions on these datasets by mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx), without compromising labeler preference scores.
        • pretraining distribution의 log likelihood를 살려주면서 mixing PPO update를 한다는게 뭘까, 신기하네
    • Our models generalize to the preferences of “held-out” labelers that did not produce any training data.
    • Public NLP datasets are not reflective of how our language models are used
    • InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution.
    • InstructGPT still makes simple mistakes.

Methods and experimental details

  • Step 1: Collect demonstration data, and train a supervised policy
    • the desired behavior on the input prompt distribution
    • fine-tune a pretrained GPT-3 model on this data using supervised learning.
  • Step 2: Collect comparison data, and train a reward model.
    • collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input (여러개 모은거죠)
    • train a reward model to predict the human-preferred output.
  • Step 3: Optimize a policy against the reward model using PPO.
    • use the output of the RM as a scalar reward
    • fine-tune the supervised policy to optimize this reward using the PPO algorithm (Schulman et al., 2017).

High-level methodology

  • Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy
  • In practice, most of our comparison data comes from our supervised policies(Step 1), with some coming from our PPO policies.


  • 수집
    • consists primarily of text prompts submitted to the OpenAI API, on the Playground interface (플레이 그라운드에서 썼던 데이터쓰고, 상용꺼는 쓰지 않음, 머리좋네..)
  • 정제
    • heuristically deduplicate prompts by checking for prompts that share a long common prefix, and we limit the number of prompts to 200 per user ID
    • also create our train, validation, and test splits based on user ID
      • validation and test sets contain no data from users whose data is in the training set
      • 유저별로 쪼개서 tr, val, ts셋 만듬
    • filter all prompts in the training split for personally identifiable information (PII).
  • labelers에게서 prompts 따로 모으기, three kinds of prompts:
    • Plain: come up with an arbitrary task
    • Few-shot: come up with an instruction, and multiple query/response pairs for that instruction.
    • User-based: come up with prompts corresponding to these use cases (OpenAI API) -> 이건 잘 이해가 안가네
  • produce three different datasets used in our fine-tuning procedure:
    • (1) our SFT dataset, with labeler demonstrations used to train our SFT models
      • SFT dataset contains about 13k training prompts (from the API and labeler-written)
    • (2) our RM dataset, with labeler rankings of model outputs used to train our RMs
      • RM dataset has 33k training prompts (from the API and labeler-written) (답변 포함하면 몇개일까..랭킹인데.. 프롬프트별에 따른 답변에 대한 랭킹이 아니라 하나의 프롬프트에 대한 답변들의 랭킹인걸로 기억하는데.. 체크해보자! -> 후자가 맞네 K개(4~9) 뽑음 -> (K,2) pair로 )
    • (3) our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning
      • PPO dataset has 31k training prompts (only from the API)


  • API prompts 데이터셋의 use-case에 대한 카테고리를 따로 정리해놓음
    • 대부분은 generative쪽이 많음


  • Training tasks are from two sources:
    • (1) a dataset of prompts written by our labelers
    • (2) a dataset of prompts submitted to early InstructGPT models on our API (see Table 6)
  • These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks
  • Our dataset is over 96% English

Human data collection

  • hired a team of about 40 contractors on Upwork and through ScaleAI
    • Upwork와 ScaleAI 라는 업체가 있나봄
  • 이전에는 주로 요약 데이터를 수집했지만 이번에는 다양한 태스크와 논란있는 주제도 종종 포함시킴
    • our inputs span a much broader range of tasks, and can occasionally include controversial and sensitive topics
    • 민감한 주제에 대해서 분류하는 그룹을 따로 셋팅한듯
      • Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful.
    • labeler에 대한 screening test를 실행애서, 기준에 맞는 lablers를 선택했음 (measured agreement between us and labelers)
      • conducted a screening test designed to measure labeler performance on these axes. We selected labelers who performed well on this test; for more information about our selection procedure and labeler demographics
  • 다른 labelers에게도 generalized된 결과를 모델이 보여주는지 연구하기 위해 따로 그룹을 분리해서 추가 고용
    • hire a separate set of labelers who do not produce any of the training data. These labelers are sourced from the same vendors, but do not undergo a screening test.
  • inter-annotator agreement rates are quite high: training labelers agree with each-other 72.6 ± 1.5% of the time, while for held-out labelers this number is 77.3 ± 1.3%. For comparison, in the summarization work of Stiennon et al. (2020) researcher-researcher agreement was 73 ± 4%.


  • 처음에는 GPT-3로 시작함
  • 이후에 3가지 다른 방법으로 모델들을 학습함
    • Supervised fine-tuning (SFT)
      • fine-tune GPT-3 on our labeler demonstrations using supervised learning
      • hparams
        • trained for 16 epochs,
        • using a cosine learning rate decay, and
        • residual dropout of 0.2
      • do our final SFT model selection based on the RM score on the validation set
      • 오버핏 나지만 RM score, human preference ratings은 좋음
        • find that our SFT models overfit on validation loss after 1 epoch; however, we find that training for more epochs helps both the RM score and human preference ratings, despite this overfitting.
    • Reward modeling (RM)
      • trained a model to take in a prompt and response, and output a scalar reward.
      • use 6B RMs, as this saves a lot of compute,
      • we found that 175B RM training could be unstable and thus was less suitable to be used as the value function during RL (see Appendix C for more details).
      • RM is trained on a dataset of comparisons between two model outputs on the same input.
      • use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler
      • present labelers with anywhere between K = 4 and K = 9 responses to rank
        • This produces (�K, 2) comparisons for each prompt shown to a labeler
      • if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.
        • Instead, we train on all �(K, 2) comparisons from each prompt as a single batch element (싱글 배치 안에 조합을 모두 넣는게 오버핏 막는 아주 중요한 포인트)
        • This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than �(K, 2) forward passes for K completions) -> 한번의 배치 안에서 처리하면 되니까 이런거라는 걸까..?! 흩어져있으면 경우의 수만큼 여러번의 배치를 돌려야되고? (주석 5번 참고)
      • image
      • Finally, since the RM loss is invariant to shifts in reward, we normalize the reward model using a bias so that the labeler demonstrations achieve a mean score of 0 before doing RL.
    • Reinforcement learning (RL)
      • fine-tuned the SFT model on our environment using PPO (Schulman et al., 2017)
      • environment is a bandit(슬롯머신) environment which presents a random customer prompt and expects a response to the prompt
      • Given the prompt and response, it produces a reward determined by the reward model and ends the episode
      • add a per-token KL penalty from the SFT model at each token to mitigate over- optimization of the reward model
        • per-token KL penalty가 뭔지 아는게 중요할듯
      • value function is initialized from the RM. We call these models “PPO.”
      • experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets -> “PPO-ptx.”
      • image
    • Baselines
      • compare the performance of our PPO models to our SFT models and GPT-3
      • also compare to GPT-3 when it is provided a few-shot prefix to ‘prompt’ it into an instruction-following mode (GPT-3-prompted). This prefix is prepended to the user-specified instruction
      • compare InstructGPT to fine-tuning 175B GPT-3 on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets, which both consist of a variety of NLP tasks


  • To evaluate how “aligned” our models are, we first need to clarify what alignment means in this context
    • The definition of alignment has historically been a vague and confusing topic
  • To summarize, we can divide our quantitative evaluations into two separate parts:
    • Evaluations on API distribution
      • Our main metric is human preference ratings on a held out set of prompts from the same source as our training distribution
      • 학습에 사용되지 않은 prompts중에서 instructGPT용과 GPT-3용 prompts를 둘다 평가함
      • image
    • Evaluations on public NLP datasets
      • evaluate on two types of public datasets:
        • capture an aspect of language model safety, particularly truthfulness, toxicity, and bias (언어의 유해성, 편견성, 정확성 등 평가)
        • capture zero-shot performance on traditional NLP tasks like question answering, reading comprehension, and summarization (기존 downstream task 잘하나 평가)
      • 관련된 내용의 NLP task 샘플은 레포에 공개함


  • sorted into three parts:
    • results on the API prompt distribution,
    • results on public NLP datasets, and
    • qualitative results

Results on the API distribution

  • Labelers significantly prefer InstructGPT outputs over outputs from GPT-3
  • found that our results do not change significantly when evaluated on prompts submitted to GPT-3 models on the API (see Figure 3)
    • image
  • In Figure 4 we show that labelers also rate InstructGPT outputs favorably along several more concrete axes
    • image
  • Our models generalize to the preferences of “held-out” labelers that did not produce any training data.
    • Held-out labelers have similar ranking preferences as workers who we used to produce training data (see Figure 3)
    • ran an experiment where we split our labelers into 5 groups, and train 5 RMs (with 3 different seeds) using 5-fold cross validation (training on 4 of the groups, and evaluating on the held-out group)
  • Public NLP datasets are not reflective of how our language models are used.
    • these datasets are not sufficiently diverse to improve performance on our API prompt distribution. (FLAN, T0로 튜닝해봐도 부족하다는 뜻)
    • We believe our InstructGPT model outperforms FLAN and T0 for two reasons.
      • First, public NLP datasets are designed to capture tasks that are easy to evaluate with automatic metrics, such as classification, question answering, and to a certain extent summarization and translation However, classification and QA are only a small part (about 18%) whereas open-ended generation and brainstorming consist of about 57% of our prompt dataset according to labelers (see Table 1)
      • Second, it can be difficult for public NLP datasets to obtain a very high diversity of inputs (at least, on the kinds of inputs that real-world users would be interested in using)
        • 결국 real-world users가 입력한 양질의 실전 프롬프트의 힘이 꽤 대단하는 것.. -> 이걸로 SFT 했으니, 당연 데이터가 더 좋은거기도하고 -> 하이퍼클로바는 꽤 데이터가 많겠군..?! -> 사내 오픈베타라도 해야되지않을까 (양질의 프롬프트를 모으는 것도 결국 엄청 중요하겠네)
    • image

Results on public NLP datasets

  • InstructGPT models show improvements in truthfulness over GPT-3
    • on the TruthfulQA dataset, our PPO models show small but significant improvements in generating truthful and informative outputs compared to GPT-3 (see Figure 6)
    • we also give a helpful “Instruction+QA” prompt that instructs the model to respond with “I have no comment” when it is not certain of the correct answer. In this case, our PPO models err on the side of being truthful and uninformative rather than confidently saying a falsehood; the baseline GPT-3 model aren’t as good at this
      • 살짝 가이드 줬더니 PPO 계열이 잘하더라
    • image
  • InstructGPT shows small improvements in toxicity over GPT-3, but not bias
    • Our results are in Figure 7. We find that, when instructed to produce a safe and respectful output (“respectful prompt”), InstructGPT models generate less toxic outputs than those from GPT-3 according to the Perspective API
      • image
    • This advantage disappears when the respectful prompt is removed (“no prompt”).
    • Interestingly, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than those from GPT-3 (see Figure 39).
      • 하지만 prompt가 toxic하면 더 toxic한 결과 내더라
        • image
  • We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.
    • By default, when we train a PPO model on our API distribution, it suffers from an “alignment tax”, as its performance on several public NLP datasets decreases
    • We want an alignment procedure that avoids an alignment tax, because it incentivizes the use of models that are unaligned but more capable on these tasks.
    • In Figure 29 we show that adding pretraining updates to our PPO fine-tuning (PPO-ptx) mitigates these performance regressions on all datasets, and even surpasses GPT-3 on HellaSwag
    • Mixing in pretraining updates performs better than the simpler solution of increasing the KL co-efficient.
pretraining loss coefficient KL reward coefficient
image image

Qualitative results

  • InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution.
    • find that InstructGPT shows ability to follow instructions in non-English languages, and perform summarization and question-answering for code
      • interesting because non-English languages and code form a tiny minority of our fine-tuning data
        • it suggests that, in some cases, alignment methods could generalize to producing the desired behavior on inputs that humans did not directly supervise (generalized 된다는거죠!)
      • image
  • InstructGPT still makes simple mistakes.
  • To give a few examples:
    • (1) when given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true,
    • (2) the model can overly hedge; when given a simple question, it can sometimes say that there is no one answer to the question and give multiple possible answers, even when there is one fairly clear answer from the context, and
    • (3) the model’s performance degrades when instructions contain multiple explicit constraints (e.g. “list 10 movies made in the 1930’s set in France”) or when constraints can be challenging for language models (e.g. writing a summary in a specified number of sentences).
  • image


Implications for alignment research

  • Our approach to alignment research in this work is iterative
    • PLM이 생각보다 크게 작용하지 않는다

Who are we aligning to?

  • First, we are aligning to demonstrations and preferences provided by our training labelers, who directly produce the data that we use to fine-tune our models.
  • Second, we are aligning to our preferences, as the researchers designing this study (and thus by proxy to our broader research organization, OpenAI): we write the labeling instructions that labelers use as a guide when writing demonstrations and choosing their preferred output, and we answer their questions about edge cases in a shared chat room
  • Third, our training data is determined by prompts sent by OpenAI customers to models on the OpenAI API Playground, and thus we are implicitly aligning to what customers think is valuable and, in some cases, what their end-users think is valuable to currently use the API for
  • Fourth, OpenAI’s customers are not representative of all potential or current users of language models—let alone of all individuals and groups impacted by language model use



  • The behavior of our InstructGPT models is determined in part by the human feedback obtained from our contractors
  • However, this group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions.
  • There are also many ways in which we could improve our data collection set-up. For instance, most comparisons are only labeled by 1 contractor for cost reasons. Having examples labeled multiple times could help identify areas where our contractors disagree, and thus where a single model is unlikely to align to all of them


  • Our models are neither fully aligned nor fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting.
  • InstructGPT generates more toxic outputs than equivalently-sized GPT-3 models. We discuss potential mitigations in the following sections.

Open questions

  • While we mainly focus on RLHF, there are many other algorithms that could be used to train policies on our demonstration and comparison data to get even better results. For example, one could explore expert iteration (Anthony et al., 2017; Silver et al., 2017), or simpler behavior cloning methods that use a subset of the comparison data. One could also try constrained optimization approaches (Achiam et al., 2017) that maximize the score from a reward model conditioned on generating a small number of harmful behaviors.

Broader impacts

  • This work is motivated by our aim to increase the positive impact of large language models by training them to do what a given set of humans want them to do.
  • By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do.
  • Our results indicate that our techniques hold promise for making language models more helpful, truthful, and harmless.
  • In the longer term, alignment failures could lead to more severe consequences, particularly if these models are deployed in safety-critical situations.
  • We expect that as model scaling continues, greater care has to be taken to ensure that they are aligned with human intentions (Bostrom, 2014).


  • Web interface
    • In Figure 12, we show screenshots of our labeling interface, that all of our labelers (and researchers) use to label data.
    • image

C Additional model details


Details of RM training


Details of the initialization models for RLHF and RLHF training


FLAN and T0 models


F Model Samples


(InstructGPT) Training language models to follow instructions with human feedback



Joosung Yoon

Posted on


Updated on


Licensed under