2023-05-09 게시 됨2023-05-09 업데이트 됨22분안에 읽기 (약 3231 단어)

LLaMA (Open and Efficient Foundation Language Models)

Note

Meta가 쏘아올린 작은공 LLaMA
꽤 잘 만든 모델들, 이전의 OPT와는 다르다, 들리는 소문으로는 실험을 꽤 많이 했을 것!
공개함
- https://github.com/facebookresearch/llama

Author

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., … & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Meta AI

Summary

Abastract

LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B

Introduction

recent work from Hoffmann et al. (2022)(친칠라 논문/딥마인드) shows that, for a given compute budget, the best performances are not achieved by the largest models, but by smaller models trained on more data.
For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.
- 토큰수 많을수록 성능은 계속 증가하는 편이더라~
Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “Books – 2TB” or “Social media conversations”).
In the rest of this paper, we present an overview of the modifications we made to the transformer architecture (Vaswani et al., 2017), as well as our training method.

Approach

Pre-training Data

Our training dataset is a mixture of several sources, reported in Table 1

English CommonCrawl [67%]

preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline
This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n-gram language model
we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages
- discarded pages not classified as references.
  - Wikipedia 스타일을 살렸다는걸까? 뭐지

C4 [15%]

During exploratory experiments, we observed that using diverse pre-processed CommonCrawl datasets improves performance.
- 실험해보니 C4 넣으면 성능이 올라가더라
The preprocessing of C4 also contains deduplication and language identification steps:
- the main difference with CCNet is the quality filtering, which mostly relies on heuristics such as presence of punctuation marks or the number of words and sentences in a webpage.
  - 단어수, 문장수, .이 있냐없냐 등등을 기준으로 휴리스틱하게 퀄리티 평가

Github [4.5%]

We use the public GitHub dataset available on Google BigQuery
We only kept projects that are distributed under the Apache, BSD and MIT licenses
- 라이센스별로 필터링
Additionally, we filtered low quality files with heuristics based on the line length or proportion of alphanumeric characters, and removed boilerplate, such as headers, with regular expressions.
- km^2 같은 캐릭터 비율이나 보일러플레이트나 헤더등을 정규식으로 파악 후 파일 제외시킴

Wikipedia [4.5%]

Wikipedia [4.5%]. We add Wikipedia dumps from the June-August 2022 period, covering 20 languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk.
We process the data to remove hyperlinks, comments and other formatting boilerplate. (링크, 댓글제거)

Gutenberg and Books3 [4.5%]

We perform deduplication at the book level, removing books with more than 90% content overlap.
- 90퍼 겹치면 책 제거 (책 레벨에서 어떻게 했을까 양이 많은데)

ArXiv [2.5%]

We process arXiv Latex files to add scientific data to our dataset.
Following Lewkowycz et al. (2022), we removed everything before the first section, as well as the bibliography. We also removed the comments from the .tex files, and inline-expanded definitions and macros written by users to increase consistency across papers.

Stack Exchange [2%]

a website of high quality questions and answers that covers a diverse set of domains, ranging from computer science to chemistry.
We kept the data from the 28 largest websites, removed the HTML tags from text and sorted the answers by score

Tokenizer

byte-pair encoding (BPE) algorithm
- using the implementation from SentencePiece (Kudo and Richardson, 2018)
we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters
Overall, our entire training dataset contains roughly 1.4T tokens after tokenization.
For most of our training data, each token is used only once dur- ing training, with the exception of the Wikipedia and Books domains, over which we perform approximately two epochs.
- 거의 한번만 보고 위키나 책 몇개는 두번정도 본다

Architecture

We leverage various improvements that were subsequently proposed, and used in different models such as PaLM. Here are the main difference with the original architec- ture, and where we were found the inspiration for this change (in bracket):

Pre-normalization [GPT3]

To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output
We use the RMSNorm normalizing function, introduced by Zhang and Sennrich (2019)
- 이거 T5쪽에서도 apex 쓰면 사용되던건데

SwiGLU activation function [PaLM]

We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020)
We use a dimension of 2/3 * 4d instead of 4d as in PaLM (이건 나중에 코드 봐야할듯)

Rotary Embeddings [GPTNeo]

We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE)

Optimizer

Our models are trained using the AdamW optimizer (Loshchilov and Hutter, 2017), with the following hyper-parameters: β1 = 0.9, β2 = 0.95.
We use a cosine learning rate schedule, such that the final learning rate is equal to 10% of the maximal learning rate
weight decay of 0.1 and gradient clipping of 1.0
- clipping 값이 꽤 높다?! 요즘 0.3을 많이 쓰는거 같던데
use 2,000 warmup steps, and vary the learning rate and batch size with the size of the model (see Table 2 for details).

Efficient Implementation

First, we use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This implementation, available in the xformers library
- xformers library is inspired by Rabe and Staats (2021 / self-attention does not need o(n2) memory.) and uses the backward from Dao et al. (2022 / Flashattention: Fast and memory-efficient exact attention with io-awareness)
- This is achieved by not storing the attention weights and not computing the key/query scores that are masked due to the causal nature of the language modeling task.
  - 마스킹되는 애들은 계산하지 않는다?!!로 구현됨
To further improve training efficiency, we reduced the amount of activations that are recomputed during the backward pass with checkpointing
- we save the activations that are expensive to compute, such as the outputs of linear layers
- This is achieved by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd. (수작업!!)
- reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022)
- we also overlap the computation of activations and the communication between GPUs over the network (due to all_reduce operations) as much as possible. (GPU들 최대한 잘 통신할 수 있게 했다?)
When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

Main results

zero-shot and few-shot(1~64) tasks, and report results on a total of 20 benchmarks:
- Zero-shot: provides an answer using open-ended generation, or ranks the proposed answers.
- Few-shot: (between 1 and 64) and a test example. The model takes this text as input and generates the answer or ranks different options
We evaluate LLaMA on free-form generation tasks and multiple choice tasks
We follow Gao et al. (2021) and use the likelihood normalized by the number of characters in the completion, except for certain datasets (OpenBookQA, BoolQ), for which we follow Brown et al. (2020), and select a completion based on the likelihood normalized by the likelihood of the completion given “Answer:” as context: P (completion|context)/P(completion|“Answer:”)

Common Sense Reasoning

consider eight standard common sense reasoning benchmarks: BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), ARC easy and challenge (Clark et al., 2018) and OpenBookQA (Mihaylov et al., 2018).
These datasets include Cloze and Winograd style tasks, as well as multiple choice question answering
We evaluate in the zero-shot setting as done in the language modeling community
First, LLaMA-65B outperforms Chinchilla-70B on all reported benchmarks but BoolQ.
Similarly, this model surpasses PaLM- 540B everywhere but on BoolQ and WinoGrande
LLaMA-13B model also outperforms GPT-3 on most benchmarks despite being 10× smaller.

Closed-book Question Answering

compare LLaMA to existing large language models on two closed-book question answering benchmarks: Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017)
LLaMA-65B achieve state-of-the-arts performance in the zero-shot and few-shot settings
More importantly, the LLaMA-13B is also competitive on these benchmarks with GPT-3 and Chinchilla, despite being 5-10× smaller. This model runs on a single V100 GPU during inference

Table 4	Table 5

Reading Comprehension

RACE reading comprehension benchmark
- English reading comprehension exams designed for middle and high school Chinese students.

Mathematical reasoning

MATH is a dataset of 12K middle school and high school mathematics problems written in `LaTeX``
GSM8k is a set of middle school mathematical problems
Minerva is a series of PaLM models finetuned on 38.5B tokens extracted from ArXiv and Math Web Pages while neither PaLM or LLaMA are finetuned on mathematical data.
we compare with and without maj1@k. maj1@k denotes evaluations where we generate k samples for each problem and perform a majority voting (Wang et al., 2022)
LLaMA- 65B outperforms Minerva-62B, although it has not been fine-tuned on mathematical data.
이정도 성능이면 진짜 좋은편이긴함

Code generation

model receives a description of the program in a few sentences, as well as a few input-output examples
In HumanEval, it also receives a function signature, and the prompt is formatted as natural code with the textual description and tests in a docstring
- The model needs to generate a Python program that fits the description and satisfies the test cases.
we compare the pass@1 scores of our models with existing language models that have not been finetuned on code, namely PaLM and LaMDA (Thoppilan et al., 2022). PaLM and LLaMA were trained on datasets that contain a similar number of code tokens.
LLaMA outperforms other general models such as LaMDA and PaLM, which are not trained or finetuned specifically for code.
추가적인 파인튜닝통해서 개선도 가능
- PaLM-Coder (Chowdhery et al., 2022) increases the pass@1 score of PaLM on HumanEval from 26.2% for PaLM to 36%
metric은 논문 참조 (Evaluating Large Language Models Trained on Code)
- k는 문제당 생성되는 k개의 코드 샘플을 의미함 (pass@k metric, where k code samples are generated per problem)

Massive Multitask Language Understanding (MMLU)

The massive multitask language understanding benchmark, or MMLU
- consists of multiple choice questions covering various domains of knowledge, including humanities, STEM and social sciences (중고등학교 문제같은것들)
evaluate our models in the 5-shot setting, using the examples provided by the benchmark
A potential explanation is that we have used a limited amount of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models(Chinchilla- 70B and PaLM-540B) were trained on up to 2TB of books

Evolution of performance during training

During training, we tracked the performance of our models on a few question answering and common sense benchmarks
the performance improves steadily, and correlates with the training perplexity of the model (see Figure 1).
해석
- on SIQA, we observe a lot of variance in performance, that may indicate that this benchmark is not reliable
- On WinoGrande, the performance does not correlate as well with training perplexity: the LLaMA-33B and LLaMA-65B have similar performance during the training.

Instruction Finetuning

show that briefly finetuning on instructions data rapidly leads to improvements on MMLU
Since this is not the focus of this paper, we only conducted a single experiment following the same protocol as Chung et al. (2022) to train an instruct model, LLaMA-I (Instruction 튜닝이 논문 목표는 아니라서 짧게만 보겠다)

Bias, Toxicity and Misinformation

Large language models have been showed to reproduce and amplify biases that are existing in the training data and to generate toxic or offensive content
we evaluate on different benchmarks that measure toxic content production and stereotypes detection

RealToxicityPrompts

RealToxicityPrompts consists of about 100k prompts that the model must complete
toxicity score is automatically evaluated by making a request to PerspectiveAPI
The score per prompt ranges from 0 (non-toxic) to 1 (toxic)
These scores are “comparable” with what we observe in the literature (e.g., 0.087 for Chinchilla) but the methodologies differ between these work and ours

CrowS-Pairs

This dataset allows to measure biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance and socioeconomic status.
Each example is composed of a stereotype and an anti-stereotype, we measure the model preference for the stereotypical sentence using the perplexity of both sentences in a zero-shot setting
평균값 자체는 우위에 있는데 앞서는 영역 개수는 GPT3가 제일 나아보이기도함
Pythia에서도 사용한 벤치마크셋

WinoGender

WinoGender benchmark (Rudinger et al., 2018), a co-reference resolution dataset. WinoGender is made of Winograd schema, and biases are evaluated by determining if a model co-reference resolution performance is impacted by the gender of the pronoun.
대명사에 대한 젠더 바이어스를 corefence 문제로 측정함 Pythia에서도 사용한 벤치마크셋
More precisely, each sentence has three mentions: an “occupation”, a “participant”, and a “pronoun” where the pronoun is co-referencing either the occupation or participant
“The nurse notified the patient that his shift would be ending in an hour.”, which is followed by ‘His’ refers to. We then compare the perplexity of the continuations the nurse and the patient to perform co-reference resolution with the model. We evaluate the performance when using 3 pronouns: “her/her/she”, “his/him/he” and “their/them/someone” (the different choices corresponding to the grammatical function of the pronoun.

TruthfulQA

measure the truthfulness of a model, i.e., its ability to identify when a claim is true.
the definition of “true” in the sense of “literal truth about the real world”
This benchmark can evaluate the risks of a model to generate misinformation or false claims. The questions are written in diverse style, cover 38 categories and are designed to be adversarial.
Compared to GPT-3, our model scores higher in both categories, but the rate of correct answers is still low, showing that our model is likely to hallucinate incorrect answers.
수박씨 문제 같은 것! 을 잘 대답할 수 있는지 측정하는 벤치마크

Carbon footprint

The training of our models have consumed a massive quantity of energy, responsible for the emission of carbon dioxide
Wh = GPU-h×(GPU power consumption)×Power Usage Effectiveness(PUE -> 1.1)
tCO2eq = MWh × 0.385

Conclusion

presented a series of language models that are released openly, and competitive with state-of-the-art foundation models
LLaMA-13B outperforms GPT-3 while being more than 10× smaller, and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B
Unlike previous studies, we show that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data (이게 중요하지~)
Finally, we plan to release larger models trained on larger pretraining corpora in the future

Appendix

LLaMA (Open and Efficient Foundation Language Models)

https://eagle705.github.io/llama/

Author

Joosung Yoon

Posted on

2023-05-09

Updated on

2023-05-09

Licensed under

#nlp

LLaMA (Open and Efficient Foundation Language Models)

Note

Author

Summary

Abastract

Introduction

Approach

Pre-training Data

English CommonCrawl [67%]

C4 [15%]

Github [4.5%]

Wikipedia [4.5%]

Gutenberg and Books3 [4.5%]

ArXiv [2.5%]

Stack Exchange [2%]

Tokenizer

Architecture

Pre-normalization [GPT3]

SwiGLU activation function [PaLM]

Rotary Embeddings [GPTNeo]

Optimizer

Efficient Implementation

Main results

Common Sense Reasoning

Closed-book Question Answering

Reading Comprehension

Mathematical reasoning

Code generation

Massive Multitask Language Understanding (MMLU)

Evolution of performance during training

Instruction Finetuning

Bias, Toxicity and Misinformation

RealToxicityPrompts

CrowS-Pairs

WinoGender

TruthfulQA

Carbon footprint

Conclusion

Appendix

Author

Posted on

Updated on

Licensed under

댓글

카테고리

아카이브

태그

광고

카탈로그

최근 글