(Chinchilla) Training Compute-Optimal Large Language Models
Note
- paper file: Training Compute-Optimal Large Language Models.pdf
- 최적 모델 크기와 데이터 크기, FLOPs를 알기위한 함수를 estimate했던 논문
- 데이터 스케일링도 모델스케일링만큼 중요하다!
Author
- Jordan Hoffmann★, Sebastian Borgeaud★, Arthur Mensch★, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals and Laurent Sifre★ (★Equal contributions)
- DeepMind
Abstract
- investigate the optimal model size and number of tokens for training a transformer language model
- By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, find that for compute-optimal training, the model size and the number of training tokens should be scaled equally
- 모델 사이즈와 학습 토큰의 스케일은 비례함
- for every doubling of model size the number of training tokens should also be doubled
- test this hypothesis by training a predicted compute-optimal model,
Chinchilla
, that uses thesame compute budget as Gopher
but with70B parameters and 4× more more data
- Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
- 친칠라가 다운스트림태스크에서 다 이겼다?
- Chinchilla reaches a state-of-the-art average accuracy of
67.5% on the MMLU benchmark
, greater than a 7% improvement over Gopher
Introduction
- LLMs을 학습하면서 생기는 이슈들
- accurately estimating the best model hyperparameters for a given compute budget is critical
- Chinchilla는 이번에 알게된 연구 내용을 토대로 Gopher보다 모델 4배 줄이고 토큰 4배 늘렸다!
- Chinchilla는 이번에 알게된 연구 내용을 토대로 Gopher보다 모델 4배 줄이고 토큰 4배 늘렸다!
- accurately estimating the best model hyperparameters for a given compute budget is critical
- revisit the question: Given a fixed FLOPs budget,1 how should one trade-off model size and the number of training tokens?
- 400개 이상의 모델에 대해서 여러 파라미터와 데이터 사이즈로 실험해서 FLOPs(N,D)=C 제한 아래서 L(N, D)를 가장 낮추는 모델파라미터_N, 학습토큰_D에 대한 함수를 측정함
- we predict that for the compute budget used to train Gopher, an optimal model should be 4 times smaller, while being training on 4 times more tokens
- verify this by training a more
compute-optimal 70B model
, called Chinchilla, on 1.4 trillion tokens
Figure 1 | Figure A3 |
---|---|
Estimating the optimal parameter/training tokens allocation
- Research Question)
Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?
- 사실 난 이게 더 궁금하긴함
- 모델 파라미터와 토큰은 동일하게 비율적으로 올라가야
- parameter count and number of training tokens should be increased equally with more compute3— with proportions reported in Table 2
3.1. Approach 1: Fix model sizes and vary number of training tokens
- 시간을 파라미터로 쓰기 애매하니까 FLOPs로 처리 해버린걸까? 왜 굳이 FLOPs를 써야하는지 의문이다
- 모델 사이즈 범위내에서 픽스해놓고 (ranging from 70M to over 10B parameters), FLOPs(𝑁, 𝐷) = 𝐶 측정
- At 1500 logarithmically spaced FLOP values, we find which model size
achieves the lowest loss
of all modelsalong with the required number of training tokens
- fit power laws to estimate the optimal model size and number of training tokens for any given amount of compute
(see the center and right panels of Figure 2)- obtaining a relationship 𝑁𝑜𝑝𝑡 ∝ 𝐶^𝑎 and 𝐷𝑜𝑝𝑡 ∝ 𝐶^𝑏
- We find that 𝑎 = 0.50 and 𝑏 = 0.50 —as summarized in Table 2.
3.2. Approach 2: IsoFLOP profiles
- vary the model size for a
fixed set of 9 different training FLOP counts
(ranging from 6 × 1018 to 3 × 1021 FLOPs), and consider the final training loss for each point - in contrast with Approach 1 that considered points (𝑁, 𝐷, 𝐿) along the entire training runs. This allows us to directly answer the question: For a given FLOP budget, what is the optimal parameter count?
- 토큰이 많을 수록 Loss가 낮아진다 (같은 FLOPs 일지라도!)
- fit a parabola to each IsoFLOPs curve to directly estimate at what model size the minimum loss is achieved (Figure 3 (left))
- fit a power law between FLOPs and loss-optimal model size and number of training tokens, shown in
Figure 3 (center, right). Again, we fit exponents of the form 𝑁𝑜𝑝𝑡 ∝ 𝐶^𝑎 and 𝐷𝑜𝑝𝑡 ∝ 𝐶^𝑏. 𝑎 = 0.49 and 𝑏 = 0.51—as summarized in Table 2.
3.3. Approach 3: Fitting a parametric loss function
- model all final losses from experiments in Approach 1 & 2 as a parametric function of model parameter count and the number of seen tokens
3.4. Optimal model scaling
기존 논문(Kaplan et al.(2020) 과는 달리 파라미터와 데이터가 거의 equal한 스케일링을 보임
사실 이 논문에서는 아래표가 제일 중요했다
4. Chinchilla
4.1. Model and training details
- train Chinchilla on MassiveText (the same dataset as Gopher) but use a slightly different subset distribution (shown in Table A1) to account for the increased number of training tokens
- AdamW (Loshchilov and Hutter, 2019) for Chinchilla
- train Chinchilla with a slightly modified SentencePiece (Kudo and Richardson, 2018)
tokenizer thatdoes not apply NFKC normalisation
(이거 왜 안했지?)- find that this particularly helps with the representation of
mathematics and chemistry
(MMLU 같은 곳에는 도움될수도?!) - vocabulary is very similar– 94.15% of tokens are the same as those used for training Gopher
- find that this particularly helps with the representation of
- forward and backward pass are
computed in bfloat16
, westore a float32
copy of the weights
4.2. Results
4.2.1. Lanugage modeling
- bits-per-byte(bpb)가 뭐지
4.2.2. MMLU
- 이하 생략
Discussion & Conclusion
- The trend so far in large language model training has been to increase the model size, often without increasing the number of training tokens
- 모델크기만 키우고 토큰은 안키웠던 트렌드가 있었음, 근데 잘못됨
- propose three predictive approaches towards optimally setting model size and training dura- tion, based on the outcome of over 400 training runs
- 실험 많이함
- Though there has been significant recent work allowing larger and larger models to be trained, our analysis suggests an increased focus on dataset scaling is needed
- 데이터 스케일링도 필요하다고! (물론 퀄리티가 뒷받침되야함)
- Larger datasets will require extra care to ensure train-test set overlap is properly accounted for, both in the language modelling loss but also with downstream tasks
- LM loss와 downstream task 다 잘되게 하려면 데이터 양 신경쓰자
- Chinchilla does suffer from bias and toxicity but interestingly it seems less affected than Gopher
- 친칠라도 bias와 toxicity 문제가 있었지만 고퍼보다 덜했다.
Appendix
학습셋
D.3. Predicted compute optimal frontier for all three methods
(Chinchilla) Training Compute-Optimal Large Language Models
https://eagle705.github.io/Chinchilla-Training-Compute-Optimal-Large-Language-Models/