LLaMA (Open and Efficient Foundation Language Models)
Note
- Meta가 쏘아올린 작은공 LLaMA
- 꽤 잘 만든 모델들, 이전의 OPT와는 다르다, 들리는 소문으로는 실험을 꽤 많이 했을 것!
- 공개함
Author
- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., … & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Meta AI
Summary
Abastract
- LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B
Introduction
- recent work from Hoffmann et al. (2022)(친칠라 논문/딥마인드) shows that, for a given compute budget, the best performances are not achieved by the largest models, but by smaller models trained on more data.
- For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model
continues
to improve even after 1T tokens.- 토큰수 많을수록 성능은 계속 증가하는 편이더라~
- Unlike Chinchilla, PaLM, or GPT-3,
we only use publicly available data
, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “Books – 2TB” or “Social media conversations”). - In the rest of this paper, we present an overview of the modifications we made to the transformer architecture (Vaswani et al., 2017), as well as our training method.
Approach
Pre-training Data
- Our training dataset is a mixture of several sources, reported in Table 1
English CommonCrawl [67%]
- preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline
- This process deduplicates the data at the
line level
, performslanguage identification
with a fastText linear classifier to remove non-English pages and filterslow quality content with an n-gram language model
- we trained a
linear model
to classify pages used asreferences
inWikipedia v.s. randomly sampled pages
- discarded pages not classified as references.
- Wikipedia 스타일을 살렸다는걸까? 뭐지
- discarded pages not classified as references.
C4 [15%]
- During exploratory experiments, we observed that using diverse pre-processed CommonCrawl datasets improves performance.
- 실험해보니 C4 넣으면 성능이 올라가더라
- The preprocessing of C4 also contains deduplication and language identification steps:
- the main difference with CCNet is the
quality filtering
, which mostly relies onheuristics
such aspresence of punctuation marks
or thenumber of words
andsentences
in a webpage
.- 단어수, 문장수,
.
이 있냐없냐 등등을 기준으로 휴리스틱하게 퀄리티 평가
- 단어수, 문장수,
- the main difference with CCNet is the
Github [4.5%]
- We use the public GitHub dataset available on Google BigQuery
- We only kept projects that are distributed under the Apache, BSD and MIT licenses
- 라이센스별로 필터링
- Additionally, we filtered low quality files with heuristics based on the line length or proportion of alphanumeric characters, and removed boilerplate, such as headers, with regular expressions.
- km^2 같은 캐릭터 비율이나 보일러플레이트나 헤더등을 정규식으로 파악 후 파일 제외시킴
Wikipedia [4.5%]
- Wikipedia [4.5%]. We add Wikipedia dumps from the June-August 2022 period, covering 20 languages, which use either the Latin or Cyrillic scripts:
bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk.
- We process the data to remove hyperlinks, comments and other formatting boilerplate. (링크, 댓글제거)
Gutenberg and Books3 [4.5%]
- We perform deduplication at the
book level
,removing books with more than 90% content overlap
.- 90퍼 겹치면 책 제거 (책 레벨에서 어떻게 했을까 양이 많은데)
ArXiv [2.5%]
- We process arXiv
Latex files
to add scientific data to our dataset. - Following Lewkowycz et al. (2022), we
removed everything before the first section
,as well as the bibliography
. We alsoremoved the comments from the .tex files
, and inline-expanded definitions and macros written by users to increase consistency across papers.
Stack Exchange [2%]
- a website of high quality questions and answers that covers a diverse set of domains, ranging from computer science to chemistry.
- We kept the data from the 28 largest websites,
removed the HTML tags
from text andsorted the answers by score
Tokenizer
- byte-pair encoding (BPE) algorithm
- using the implementation from SentencePiece (Kudo and Richardson, 2018)
- we
split all numbers into individual digits
, andfallback to bytes
todecompose unknown UTF-8 characters
- Overall, our
entire training dataset contains roughly 1.4T tokens
after tokenization. - For most of our training data, each token is used only once dur- ing training, with the exception of the Wikipedia and Books domains, over which we perform approximately two epochs.
- 거의 한번만 보고 위키나 책 몇개는 두번정도 본다
Architecture
- We leverage various improvements that were subsequently proposed, and used in different models such as PaLM. Here are the main difference with the original architec- ture, and where we were found the inspiration for this change (in bracket):
Pre-normalization [GPT3]
- To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output
- We use the RMSNorm normalizing function, introduced by Zhang and Sennrich (2019)
- 이거 T5쪽에서도 apex 쓰면 사용되던건데
SwiGLU activation function [PaLM]
- We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020)
- We use a dimension of
2/3 * 4d
instead of4d as in PaLM
(이건 나중에 코드 봐야할듯)
Rotary Embeddings [GPTNeo]
- We
remove the absolute positional embeddings
, and instead, add rotary positional embeddings (RoPE)
Optimizer
- Our models are trained using the AdamW optimizer (Loshchilov and Hutter, 2017), with the following hyper-parameters: β1 = 0.9, β2 = 0.95.
- We use a
cosine learning rate schedule
, such that thefinal learning rate is equal to 10% of the maximal
learning rate weight decay of 0.1
andgradient clipping of 1.0
- clipping 값이 꽤 높다?! 요즘 0.3을 많이 쓰는거 같던데
- use 2,000 warmup steps, and vary the learning rate and batch size with the size of the model (see Table 2 for details).
Efficient Implementation
- First, we use an
efficient implementation
of thecausal multi-head attention
to reduce memory usage and runtime
. This implementation, available in the xformers library- xformers library is inspired by Rabe and Staats (2021 /
self-attention does not need o(n2) memory
.) and uses the backward from Dao et al. (2022 /Flashattention
: Fast and memory-efficient exact attention with io-awareness) - This is achieved by not storing the attention weights and not computing the key/query scores that are masked due to the causal nature of the language modeling task.
- 마스킹되는 애들은 계산하지 않는다?!!로 구현됨
- 마스킹되는 애들은 계산하지 않는다?!!로 구현됨
- xformers library is inspired by Rabe and Staats (2021 /
- To further improve training efficiency, we reduced the amount of activations that are recomputed during the backward pass with checkpointing
- we save the activations that are expensive to compute, such as the outputs of linear layers
- This is achieved by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd. (수작업!!)
- reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022)
- we also overlap the computation of activations and the communication between GPUs over the network (due to all_reduce operations) as much as possible. (GPU들 최대한 잘 통신할 수 있게 했다?)
- When training a 65B-parameter model, our code processes around
380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM
. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.
Main results
- zero-shot and few-shot(1~64) tasks, and report results on a total of 20 benchmarks:
- Zero-shot: provides an answer using open-ended generation, or ranks the proposed answers.
- Few-shot: (between 1 and 64) and a test example. The model takes this text as input and generates the answer or ranks different options
- We evaluate LLaMA on free-form generation tasks and multiple choice tasks
- We follow Gao et al. (2021) and use the likelihood normalized by the number of characters in the completion, except for certain datasets (OpenBookQA, BoolQ), for which we follow Brown et al. (2020), and select a completion based on the
likelihood normalized by the likelihood of the completion given “Answer:” as context: P (completion|context)/P(completion|“Answer:”)
Common Sense Reasoning
- consider
eight standard common sense reasoning benchmarks
:BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), ARC easy and challenge (Clark et al., 2018) and OpenBookQA (Mihaylov et al., 2018)
. - These datasets include
Cloze
andWinograd style
tasks, as well asmultiple choice question answering
- We evaluate in the
zero-shot setting
as done in the language modeling community - First,
LLaMA-65B outperforms Chinchilla-70B
on all reported benchmarks but BoolQ. - Similarly, this model surpasses PaLM- 540B everywhere but on BoolQ and WinoGrande
- LLaMA-13B model also outperforms GPT-3 on most benchmarks despite being 10× smaller.
Closed-book Question Answering
- compare LLaMA to existing large language models on
two closed-book question answering
benchmarks:Natural Questions
(Kwiatkowski et al., 2019) andTriviaQA
(Joshi et al., 2017) - LLaMA-65B achieve state-of-the-arts performance in the zero-shot and few-shot settings
- More importantly, the LLaMA-13B is also competitive on these benchmarks with GPT-3 and Chinchilla, despite being 5-10× smaller. This model runs on a single V100 GPU during inference
Table 4 | Table 5 |
---|---|
Reading Comprehension
- RACE reading comprehension benchmark
- English reading comprehension exams designed for middle and high school Chinese students.
Mathematical reasoning
- MATH is a dataset of 12K
middle school and high school mathematics problems
written in `LaTeX`` - GSM8k is a set of
middle
school mathematical problems Minerva is a series of PaLM models
finetuned on 38.5B tokens extracted from ArXiv and Math Web Pages while neither PaLM or LLaMA are finetuned on mathematical data.- we compare with and without
maj1@k. maj1@k
denotes evaluations where we generatek samples
for each problem and perform amajority voting
(Wang et al., 2022) LLaMA- 65B outperforms
Minerva-62B, although it hasnot been fine-tuned on mathematical data
.- 이정도 성능이면 진짜 좋은편이긴함
Code generation
- model receives a
description of the program in a few sentences
, as well as a fewinput-output examples
- In HumanEval, it also receives a function signature, and the prompt is formatted as natural code with the textual description and tests in a docstring
- The model needs to generate a Python program that fits the description and satisfies the test cases.
- we compare the pass@1 scores of our models with existing language models that have not been finetuned on code, namely PaLM and LaMDA (Thoppilan et al., 2022). PaLM and LLaMA were trained on datasets that contain a
similar number of code tokens
. - LLaMA outperforms other general models such as LaMDA and PaLM, which are not trained or finetuned specifically for code.
- 추가적인 파인튜닝통해서 개선도 가능
- PaLM-Coder (Chowdhery et al., 2022) increases the pass@1 score of PaLM on HumanEval from 26.2% for PaLM to 36%
- metric은 논문 참조 (Evaluating Large Language Models Trained on Code)
- k는 문제당 생성되는 k개의 코드 샘플을 의미함 (pass@k metric, where k code samples are generated per problem)
Massive Multitask Language Understanding (MMLU)
- The massive multitask language understanding benchmark, or MMLU
- consists of multiple choice questions covering various domains of knowledge, including humanities, STEM and social sciences (중고등학교 문제같은것들)
- evaluate our models in the
5-shot setting
, using the examples provided by the benchmark - A potential explanation is that we have used a
limited amount
of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models(Chinchilla- 70B and PaLM-540B) were trained on up to 2TB of books
Evolution of performance during training
- During training, we tracked the performance of our models on a few question answering and common sense benchmarks
- the performance improves steadily, and correlates with the training perplexity of the model (see Figure 1).
- 해석
- on SIQA, we observe a lot of variance in performance, that may indicate that this benchmark
is not reliable
- On WinoGrande, the performance does not correlate as well with training perplexity: the LLaMA-33B and LLaMA-65B have similar performance during the training.
- on SIQA, we observe a lot of variance in performance, that may indicate that this benchmark
Instruction Finetuning
- show that briefly
finetuning on instructions
data rapidly leads toimprovements on MMLU
- Since this is
not the focus
of this paper, we only conducted a single experiment following the same protocol as Chung et al. (2022)to train an instruct model, LLaMA-I
(Instruction 튜닝이 논문 목표는 아니라서 짧게만 보겠다)
Bias, Toxicity and Misinformation
- Large language models have been showed to reproduce and amplify biases that are existing in the training data and to generate toxic or offensive content
- we evaluate on different benchmarks that
measure toxic content production and stereotypes detection
RealToxicityPrompts
- RealToxicityPrompts consists of about
100k prompts
that the modelmust complete
- toxicity score is automatically evaluated by making a request to PerspectiveAPI
- The score per prompt ranges from 0 (non-toxic) to 1 (toxic)
- These scores are “comparable” with what we observe in the literature (e.g., 0.087 for Chinchilla) but the methodologies differ between these work and ours
CrowS-Pairs
- This dataset allows to measure biases in 9 categories:
gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance and socioeconomic status
. - Each example is composed of a
stereotype and an anti-stereotype
, we measure the modelpreference
for the stereotypical sentenceusing the perplexity
of both sentences in azero-shot
setting - 평균값 자체는 우위에 있는데 앞서는 영역 개수는 GPT3가 제일 나아보이기도함
- Pythia에서도 사용한 벤치마크셋
WinoGender
- WinoGender benchmark (Rudinger et al., 2018), a co-reference resolution dataset. WinoGender is made of Winograd schema, and biases are evaluated by determining if a model
co-reference resolution
performance is impacted by thegender
of thepronoun
. - 대명사에 대한 젠더 바이어스를 corefence 문제로 측정함 Pythia에서도 사용한 벤치마크셋
- More precisely, each sentence has three mentions: an “occupation”, a “participant”, and a “pronoun” where the pronoun is co-referencing either the occupation or participant
- “The nurse notified the patient that his shift would be ending in an hour.”, which is followed by ‘His’ refers to. We then compare the perplexity of the continuations the nurse and the patient to perform co-reference resolution with the model. We evaluate the performance when using 3 pronouns: “her/her/she”, “his/him/he” and “their/them/someone” (the different choices corresponding to the grammatical function of the pronoun.
TruthfulQA
- measure the truthfulness of a model, i.e., its ability to identify when a claim is true.
- the definition of “true” in the sense of “literal truth about the real world”
- This benchmark can evaluate the risks of a model to generate misinformation or false claims. The questions are written in diverse style,
cover 38 categories
and are designed to be adversarial. - Compared to GPT-3, our model scores higher in both categories, but the rate of correct answers is still low, showing that our model is likely to
hallucinate incorrect answers
. - 수박씨 문제 같은 것! 을 잘 대답할 수 있는지 측정하는 벤치마크
Carbon footprint
- The training of our models have consumed a massive quantity of energy, responsible for the emission of carbon dioxide
Wh = GPU-h×(GPU power consumption)×Power Usage Effectiveness(PUE -> 1.1)
tCO2eq = MWh × 0.385
Conclusion
- presented a series of language models that are released openly, and competitive with state-of-the-art foundation models
- LLaMA-13B outperforms GPT-3 while being more than 10× smaller, and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B
- Unlike previous studies, we show that it is possible to achieve state-of-the-art performance by training
exclusively on publicly available data
(이게 중요하지~) - Finally, we plan to release larger models trained on larger pretraining corpora in the future
Appendix
LLaMA (Open and Efficient Foundation Language Models)