(FLAN) Finetuned Language Models Are Zero-Shot Learners



  • Jason Wei∗, Maarten Bosma∗, Vincent Y. Zhao∗, Kelvin Guu∗, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le
    • Google Research


  • explores a simple method for improving the zero-shot learning abilities of language models.
  • instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks.
  • 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types.


  • We take a pretrained language model of 137B parameters and perform instruction tuning—finetuning the model on a mixture of more than 60 NLP datasets expressed via natural language instructions. We refer to this resulting model as FLAN, for Finetuned Language Net.
  • FLAN’s zero-shot also outperforms 175B-parameter GPT-3’s zero-shot on 20 of 25 datasets that we evaluate, and even outperforms GPT-3’s few-shot by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.



  • The motivation of instruction tuning is to improve the ability of language models to respond to NLP instructions. The idea is that by using supervision to teach an LM to perform tasks described via instructions, the LM will learn to follow instructions and do so even for unseen tasks.
  • To evaluate performance on unseen tasks, we group datasets into clusters by task type and hold out each task cluster for evaluation while instruction tuning on all remaining clusters.


  • We aggregate 62 text datasets that are publicly available on Tensorflow Datasets, including both language understanding and language generation tasks, into a single mixture.


  • While most of the ten templates describe the original task, to increase diversity, for each dataset we also include up to three templates that “turned the task around,” (e.g., for sentiment classification we include templates asking to generate a movie review). We then instruction tune a pretrained language model on the mixture of all datasets, with examples in each dataset formatted via a randomly selected instruction template for that dataset
    • 템플릿은 랜덤하게 선택하는군
    • 어떤 템플릿이 좋다 이런건 없나?



  • to evaluate zero-shot FLAN on c task clusters, we instruction tune c models, where each model holds out a different task cluster for evaluation.
    • c task clusters에 대해서 평가하고 싶으면 c task가 없는 클러스터에 대해서 학습하고 평가해라


  • For classification tasks, prior work (Brown et al., 2020) used a rank classification approach where, for example, only two outputs (“yes” and “no”) are considered and the higher probability one is taken as the model’s prediction
    • 답변의 분포 때문에! 논리적으로 보이지만 완벽하지 않다!
      • logically sound, it is imperfect in that the probability mass for answers may have an undesired distribution among ways of saying each answer
    • OPTIONS suffix 추가
      • Therefore, we include an options suffix, in which we append the token OPTIONS to the end of a classification task along with a list of the output classes for that task
        • This makes the model aware of which choices are desired when responding to classification tasks
        • 분포문제 때문에 이상한 답변 주지 않게 트릭을 쓰는거네! 그럴듯하다


  • Model architecture and pretraining
    • use LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters
    • pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary using the SentencePiece library
      • 굳이 computer code를 언급하는건 codex쪽 영향일까? 암튼 좋은 선택인듯
      • 토큰이 2.5T면..?! 175B가 3.7T 필요하고 67B가 1.5T 필요하니까 약간? 부족한 정도인듯
    • Around 10% of the pretraining data was non-English. Note that LaMDA-PT only has language model pretraining (c.f. LaMDA, which was finetuned for dialog)
      • LaMDA-PT는 그냥 LM이고 LaMDA가 대화모델!, 10%정도 다른언어면 우리도 넣으면 좋지않을려나
  • Instruction tuning procedure
    • FLAN is the instruction-tuned version of LaMDA-PT, FLAN 자체도 그럼 엄청 좋은 모델을 Backbone으로 쓰고 있던거네..?!
    • Our instruction tuning pipeline mixes all datasets and randomly samples from each dataset.
    • To balance the different sizes of datasets, we limit the number of training examples per dataset to 30k and follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3k
      • Q) 이거 궁금했는데, 한 데이터셋 내에서 개수를 3만개로 제한하고…가 아니라 weight를 제한한다는건데 이거 보자
        • In this mixing scheme, a mixing rate maximum of 3,000 means that a dataset does not receive additional sampling weight for examples in excess of 3,000.
          • 잘 이해가 안가네, 중복 포함일까? 그냥 weight를 조절한다는건가 적절히..?
    • hparams
      • finetune all models for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5
        • Q) batch size가 tokens로 나오네, instruction으로 다시 만들어진 데이터셋을 LM 튜닝하듯이 하는건가?
          • The input and target sequence lengths used in finetuning are 1024 and 256, respectively
          • input과 target으로 나뉘어지니 꼭 그런건 아닌거같고, generation based로 하되 나눠서 하나보네
      • We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token.
        • input target 구분 위해서 사용한거! 기억
      • This instruction tuning takes around 60 hours on a TPUv3 with 128 cores.
        • 60시간이면 꽤 걸렸네..?!
      • final checkpoint trained for 30k steps


  • Overall, we observe that instruction tuning is very effective on tasks naturally verbalized as instructions (e.g., NLI, QA, translation, struct-to-text) and is less effective on tasks directly formulated as language modeling, where instructions would be largely redundant (e.g., commonsense reasoning and coreference resolution tasks that are formatted as finishing an incomplete sentence or paragraph).
    • 리던던트한게 잘 안되는구나
  • 결과 보니까 BERT, T5 같은 모델이 그래도 잘하긴 잘하네..



  • 이거 분석 좋네
  • we examine how performance is affected by the number of clusters and tasks used in instruction tuning
  • 평가쪽
    • we hold out NLI, closed-book QA, and commonsense reasoning as evaluation clusters
  • 결과
    • We show results for one to seven instruction tuning clusters, where clusters are added in decreasing order of number of tasks per cluster.
    • implying that performance may further improve with even more clusters added to instruction tuning
      • 이건 약간 T0와는 다른 결과인거 같기도한데 체크해봐야겠음


  • small model은 capacity가 낮으므로 instruction tuning하면 튜닝한거 배우느라 다 capa를 써버려서 unseen에 대해서 낮게 나온다 라고 해석함



  • 모델의 능력이 instructions에서 오는지, 아니면 원래 있는건지 확인
    • we explore the role of instructions during finetuning, as one possibility is that performance gains come entirely from multi-task finetuning and the model could perform just as well without instructions
    • We hence consider two finetuning setups without instructions.
      • [1] In a no template setup, only inputs and outputs were given to the model (e.g., for translation the input would be “The dog runs.” and the output would be “Le chien court.”).
      • [2] In a dataset name setup, each input is prepended with the name of the task and dataset (e.g., for translation to French, the input would be “[Translation: WMT’14 to French] The dog runs.”).
    • compare these two ablations to FLAN’s finetuning procedure, which used natural instructions (e.g., “Please translate this sentence to French: ‘The dog runs.’”)
  • FLAN이 성능이 제일 좋긴한데, 데이터셋이름을 앞에 붙이는 것도 성능이 생각보다 괜찮네?



  • fewshot 넣으면 조금 더 좋아지긴한다


  • Prompt tuning 결과도 더 좋긴한데, 이건 이미 한번 봐서 그런건 아닌가? 아 읽어보니 아니네, 그러면 잘하는거 맞을듯



  • Our instruction-tuned model, FLAN, improves performance against an untuned model and surpasses zero-shot GPT-3 on the majority of tasks that we evaluate on.
  • Ablation studies reveal that performance on unseen tasks improves with the number of instruction tuning task clusters, and, interestingly, that performance improvements from instruction tuning emerge only with sufficient model scale.
  • Moreover, instruction tuning can be combined with other prompting methods such as few-shot prompting and prompt tuning.


  • This paper has explored a simple method for improving the ability of language models at scale to perform zero-shot tasks based purely on instructions. Our instruction-tuned model, FLAN, compares favorably against GPT-3 and signals the potential ability for language models at scale to follow instructions.



  • Multiple FLAN outputs hparams
    • Multiple FLAN outputs are generated via random sampling with a temperature of 0.9 and top k of 40.
  • examples
    • 보통 change, recommend, generate, suggest, make up, answer in Langauge 등 마지막에 여러 명령(?)으로 해결할 수 있게 해놓음
example 1 example 2
image image
image image

Joosung Yoon

Posted on


Updated on


Licensed under