LLaMA (Open and Efficient Foundation Language Models)

Note

  • Meta가 쏘아올린 작은공 LLaMA
  • 꽤 잘 만든 모델들, 이전의 OPT와는 다르다, 들리는 소문으로는 실험을 꽤 많이 했을 것!
  • 공개함

Author

  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., … & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
    • Meta AI

Summary

Abastract

  • LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B

Introduction

  • recent work from Hoffmann et al. (2022)(친칠라 논문/딥마인드) shows that, for a given compute budget, the best performances are not achieved by the largest models, but by smaller models trained on more data.
  • For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.
    • 토큰수 많을수록 성능은 계속 증가하는 편이더라~
  • Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “Books – 2TB” or “Social media conversations”).
  • In the rest of this paper, we present an overview of the modifications we made to the transformer architecture (Vaswani et al., 2017), as well as our training method.

Approach

Pre-training Data

  • Our training dataset is a mixture of several sources, reported in Table 1
    image

English CommonCrawl [67%]

  • preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline
  • This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n-gram language model
  • we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages
    • discarded pages not classified as references.
      • Wikipedia 스타일을 살렸다는걸까? 뭐지

C4 [15%]

  • During exploratory experiments, we observed that using diverse pre-processed CommonCrawl datasets improves performance.
    • 실험해보니 C4 넣으면 성능이 올라가더라
  • The preprocessing of C4 also contains deduplication and language identification steps:
    • the main difference with CCNet is the quality filtering, which mostly relies on heuristics such as presence of punctuation marks or the number of words and sentences in a webpage.
      • 단어수, 문장수, .이 있냐없냐 등등을 기준으로 휴리스틱하게 퀄리티 평가

Github [4.5%]

  • We use the public GitHub dataset available on Google BigQuery
  • We only kept projects that are distributed under the Apache, BSD and MIT licenses
    • 라이센스별로 필터링
  • Additionally, we filtered low quality files with heuristics based on the line length or proportion of alphanumeric characters, and removed boilerplate, such as headers, with regular expressions.
    • km^2 같은 캐릭터 비율이나 보일러플레이트나 헤더등을 정규식으로 파악 후 파일 제외시킴

Wikipedia [4.5%]

  • Wikipedia [4.5%]. We add Wikipedia dumps from the June-August 2022 period, covering 20 languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk.
  • We process the data to remove hyperlinks, comments and other formatting boilerplate. (링크, 댓글제거)

Gutenberg and Books3 [4.5%]

  • We perform deduplication at the book level, removing books with more than 90% content overlap.
    • 90퍼 겹치면 책 제거 (책 레벨에서 어떻게 했을까 양이 많은데)

ArXiv [2.5%]

  • We process arXiv Latex files to add scientific data to our dataset.
  • Following Lewkowycz et al. (2022), we removed everything before the first section, as well as the bibliography. We also removed the comments from the .tex files, and inline-expanded definitions and macros written by users to increase consistency across papers.

Stack Exchange [2%]

  • a website of high quality questions and answers that covers a diverse set of domains, ranging from computer science to chemistry.
  • We kept the data from the 28 largest websites, removed the HTML tags from text and sorted the answers by score

Tokenizer

  • byte-pair encoding (BPE) algorithm
    • using the implementation from SentencePiece (Kudo and Richardson, 2018)
  • we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters
  • Overall, our entire training dataset contains roughly 1.4T tokens after tokenization.
  • For most of our training data, each token is used only once dur- ing training, with the exception of the Wikipedia and Books domains, over which we perform approximately two epochs.
    • 거의 한번만 보고 위키나 책 몇개는 두번정도 본다

Architecture

  • We leverage various improvements that were subsequently proposed, and used in different models such as PaLM. Here are the main difference with the original architec- ture, and where we were found the inspiration for this change (in bracket):

Pre-normalization [GPT3]

  • To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output
  • We use the RMSNorm normalizing function, introduced by Zhang and Sennrich (2019)
    • 이거 T5쪽에서도 apex 쓰면 사용되던건데

SwiGLU activation function [PaLM]

  • We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020)
  • We use a dimension of 2/3 * 4d instead of 4d as in PaLM (이건 나중에 코드 봐야할듯)

Rotary Embeddings [GPTNeo]

  • We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE)

Optimizer

  • Our models are trained using the AdamW optimizer (Loshchilov and Hutter, 2017), with the following hyper-parameters: β1 = 0.9, β2 = 0.95.
  • We use a cosine learning rate schedule, such that the final learning rate is equal to 10% of the maximal learning rate
  • weight decay of 0.1 and gradient clipping of 1.0
    • clipping 값이 꽤 높다?! 요즘 0.3을 많이 쓰는거 같던데
  • use 2,000 warmup steps, and vary the learning rate and batch size with the size of the model (see Table 2 for details).
    image

Efficient Implementation

  • First, we use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This implementation, available in the xformers library
    • xformers library is inspired by Rabe and Staats (2021 / self-attention does not need o(n2) memory.) and uses the backward from Dao et al. (2022 / Flashattention: Fast and memory-efficient exact attention with io-awareness)
    • This is achieved by not storing the attention weights and not computing the key/query scores that are masked due to the causal nature of the language modeling task.
      • 마스킹되는 애들은 계산하지 않는다?!!로 구현됨
        • image
  • To further improve training efficiency, we reduced the amount of activations that are recomputed during the backward pass with checkpointing
    • we save the activations that are expensive to compute, such as the outputs of linear layers
    • This is achieved by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd. (수작업!!)
    • reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022)
    • we also overlap the computation of activations and the communication between GPUs over the network (due to all_reduce operations) as much as possible. (GPU들 최대한 잘 통신할 수 있게 했다?)
  • When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

Main results

  • zero-shot and few-shot(1~64) tasks, and report results on a total of 20 benchmarks:
    • Zero-shot: provides an answer using open-ended generation, or ranks the proposed answers.
    • Few-shot: (between 1 and 64) and a test example. The model takes this text as input and generates the answer or ranks different options
  • We evaluate LLaMA on free-form generation tasks and multiple choice tasks
  • We follow Gao et al. (2021) and use the likelihood normalized by the number of characters in the completion, except for certain datasets (OpenBookQA, BoolQ), for which we follow Brown et al. (2020), and select a completion based on the likelihood normalized by the likelihood of the completion given “Answer:” as context: P (completion|context)/P(completion|“Answer:”)

Common Sense Reasoning

  • consider eight standard common sense reasoning benchmarks: BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), ARC easy and challenge (Clark et al., 2018) and OpenBookQA (Mihaylov et al., 2018).
  • These datasets include Cloze and Winograd style tasks, as well as multiple choice question answering
  • We evaluate in the zero-shot setting as done in the language modeling community
    image
  • First, LLaMA-65B outperforms Chinchilla-70B on all reported benchmarks but BoolQ.
  • Similarly, this model surpasses PaLM- 540B everywhere but on BoolQ and WinoGrande
  • LLaMA-13B model also outperforms GPT-3 on most benchmarks despite being 10× smaller.

Closed-book Question Answering

  • compare LLaMA to existing large language models on two closed-book question answering benchmarks: Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017)
  • LLaMA-65B achieve state-of-the-arts performance in the zero-shot and few-shot settings
  • More importantly, the LLaMA-13B is also competitive on these benchmarks with GPT-3 and Chinchilla, despite being 5-10× smaller. This model runs on a single V100 GPU during inference
Table 4 Table 5
image image

Reading Comprehension

  • RACE reading comprehension benchmark
    • English reading comprehension exams designed for middle and high school Chinese students.

image

Mathematical reasoning

  • MATH is a dataset of 12K middle school and high school mathematics problems written in `LaTeX``
  • GSM8k is a set of middle school mathematical problems
  • Minerva is a series of PaLM models finetuned on 38.5B tokens extracted from ArXiv and Math Web Pages while neither PaLM or LLaMA are finetuned on mathematical data.
  • we compare with and without maj1@k. maj1@k denotes evaluations where we generate k samples for each problem and perform a majority voting (Wang et al., 2022)
  • LLaMA- 65B outperforms Minerva-62B, although it has not been fine-tuned on mathematical data.
  • 이정도 성능이면 진짜 좋은편이긴함

image

Code generation

  • model receives a description of the program in a few sentences, as well as a few input-output examples
  • In HumanEval, it also receives a function signature, and the prompt is formatted as natural code with the textual description and tests in a docstring
    • The model needs to generate a Python program that fits the description and satisfies the test cases.
  • we compare the pass@1 scores of our models with existing language models that have not been finetuned on code, namely PaLM and LaMDA (Thoppilan et al., 2022). PaLM and LLaMA were trained on datasets that contain a similar number of code tokens.
  • LLaMA outperforms other general models such as LaMDA and PaLM, which are not trained or finetuned specifically for code.
  • 추가적인 파인튜닝통해서 개선도 가능
    • PaLM-Coder (Chowdhery et al., 2022) increases the pass@1 score of PaLM on HumanEval from 26.2% for PaLM to 36%
  • metric은 논문 참조 (Evaluating Large Language Models Trained on Code)
    • k는 문제당 생성되는 k개의 코드 샘플을 의미함 (pass@k metric, where k code samples are generated per problem)
    • image

image

Massive Multitask Language Understanding (MMLU)

  • The massive multitask language understanding benchmark, or MMLU
    • consists of multiple choice questions covering various domains of knowledge, including humanities, STEM and social sciences (중고등학교 문제같은것들)
  • evaluate our models in the 5-shot setting, using the examples provided by the benchmark
  • A potential explanation is that we have used a limited amount of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models(Chinchilla- 70B and PaLM-540B) were trained on up to 2TB of books

image

Evolution of performance during training

  • During training, we tracked the performance of our models on a few question answering and common sense benchmarks
    image
  • the performance improves steadily, and correlates with the training perplexity of the model (see Figure 1).
  • 해석
    • on SIQA, we observe a lot of variance in performance, that may indicate that this benchmark is not reliable
    • On WinoGrande, the performance does not correlate as well with training perplexity: the LLaMA-33B and LLaMA-65B have similar performance during the training.

image

Instruction Finetuning

  • show that briefly finetuning on instructions data rapidly leads to improvements on MMLU
  • Since this is not the focus of this paper, we only conducted a single experiment following the same protocol as Chung et al. (2022) to train an instruct model, LLaMA-I (Instruction 튜닝이 논문 목표는 아니라서 짧게만 보겠다)
    image

image

Bias, Toxicity and Misinformation

  • Large language models have been showed to reproduce and amplify biases that are existing in the training data and to generate toxic or offensive content
  • we evaluate on different benchmarks that measure toxic content production and stereotypes detection

RealToxicityPrompts

  • RealToxicityPrompts consists of about 100k prompts that the model must complete
  • toxicity score is automatically evaluated by making a request to PerspectiveAPI
  • The score per prompt ranges from 0 (non-toxic) to 1 (toxic)
  • These scores are “comparable” with what we observe in the literature (e.g., 0.087 for Chinchilla) but the methodologies differ between these work and ours
    image

CrowS-Pairs

  • This dataset allows to measure biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance and socioeconomic status.
  • Each example is composed of a stereotype and an anti-stereotype, we measure the model preference for the stereotypical sentence using the perplexity of both sentences in a zero-shot setting
  • 평균값 자체는 우위에 있는데 앞서는 영역 개수는 GPT3가 제일 나아보이기도함
  • Pythia에서도 사용한 벤치마크셋

image

WinoGender

  • WinoGender benchmark (Rudinger et al., 2018), a co-reference resolution dataset. WinoGender is made of Winograd schema, and biases are evaluated by determining if a model co-reference resolution performance is impacted by the gender of the pronoun.
  • 대명사에 대한 젠더 바이어스를 corefence 문제로 측정함 Pythia에서도 사용한 벤치마크셋
  • More precisely, each sentence has three mentions: an “occupation”, a “participant”, and a “pronoun” where the pronoun is co-referencing either the occupation or participant
  • “The nurse notified the patient that his shift would be ending in an hour.”, which is followed by ‘His’ refers to. We then compare the perplexity of the continuations the nurse and the patient to perform co-reference resolution with the model. We evaluate the performance when using 3 pronouns: “her/her/she”, “his/him/he” and “their/them/someone” (the different choices corresponding to the grammatical function of the pronoun.
    image

TruthfulQA

  • measure the truthfulness of a model, i.e., its ability to identify when a claim is true.
  • the definition of “true” in the sense of “literal truth about the real world”
  • This benchmark can evaluate the risks of a model to generate misinformation or false claims. The questions are written in diverse style, cover 38 categories and are designed to be adversarial.
  • Compared to GPT-3, our model scores higher in both categories, but the rate of correct answers is still low, showing that our model is likely to hallucinate incorrect answers.
  • 수박씨 문제 같은 것! 을 잘 대답할 수 있는지 측정하는 벤치마크
    image

Carbon footprint

  • The training of our models have consumed a massive quantity of energy, responsible for the emission of carbon dioxide
  • Wh = GPU-h×(GPU power consumption)×Power Usage Effectiveness(PUE -> 1.1)
  • tCO2eq = MWh × 0.385

image

Conclusion

  • presented a series of language models that are released openly, and competitive with state-of-the-art foundation models
  • LLaMA-13B outperforms GPT-3 while being more than 10× smaller, and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B
  • Unlike previous studies, we show that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data (이게 중요하지~)
  • Finally, we plan to release larger models trained on larger pretraining corpora in the future

Appendix

image
image
image

LLaMA (Open and Efficient Foundation Language Models)

https://eagle705.github.io/llama/

Author

Joosung Yoon

Posted on

2023-05-09

Updated on

2023-05-09

Licensed under

댓글