(Chinchilla) Training Compute-Optimal Large Language Models



  • Jordan Hoffmann★, Sebastian Borgeaud★, Arthur Mensch★, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals and Laurent Sifre★ (★Equal contributions)
    • DeepMind


  • investigate the optimal model size and number of tokens for training a transformer language model
  • By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, find that for compute-optimal training, the model size and the number of training tokens should be scaled equally
    • 모델 사이즈와 학습 토큰의 스케일은 비례함
  • for every doubling of model size the number of training tokens should also be doubled
  • test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data
  • Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
    • 친칠라가 다운스트림태스크에서 다 이겼다?
  • Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher


  • LLMs을 학습하면서 생기는 이슈들
    • accurately estimating the best model hyperparameters for a given compute budget is critical
      • Chinchilla는 이번에 알게된 연구 내용을 토대로 Gopher보다 모델 4배 줄이고 토큰 4배 늘렸다!
  • revisit the question: Given a fixed FLOPs budget,1 how should one trade-off model size and the number of training tokens?
  • 400개 이상의 모델에 대해서 여러 파라미터와 데이터 사이즈로 실험해서 FLOPs(N,D)=C 제한 아래서 L(N, D)를 가장 낮추는 모델파라미터_N, 학습토큰_D에 대한 함수를 측정함
  • we predict that for the compute budget used to train Gopher, an optimal model should be 4 times smaller, while being training on 4 times more tokens
  • verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens
Figure 1 Figure A3
image image

Estimating the optimal parameter/training tokens allocation

  • Research Question) Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?
    • 사실 난 이게 더 궁금하긴함
  • 모델 파라미터와 토큰은 동일하게 비율적으로 올라가야
    • parameter count and number of training tokens should be increased equally with more compute3— with proportions reported in Table 2

3.1. Approach 1: Fix model sizes and vary number of training tokens

  • 시간을 파라미터로 쓰기 애매하니까 FLOPs로 처리 해버린걸까? 왜 굳이 FLOPs를 써야하는지 의문이다
  • 모델 사이즈 범위내에서 픽스해놓고 (ranging from 70M to over 10B parameters), FLOPs(𝑁, 𝐷) = 𝐶 측정
  • At 1500 logarithmically spaced FLOP values, we find which model size achieves the lowest loss of all models along with the required number of training tokens
  • fit power laws to estimate the optimal model size and number of training tokens for any given amount of compute
    (see the center and right panels of Figure 2)
    • obtaining a relationship 𝑁𝑜𝑝𝑡 ∝ 𝐶^𝑎 and 𝐷𝑜𝑝𝑡 ∝ 𝐶^𝑏
    • We find that 𝑎 = 0.50 and 𝑏 = 0.50 —as summarized in Table 2.

3.2. Approach 2: IsoFLOP profiles

  • vary the model size for a fixed set of 9 different training FLOP counts (ranging from 6 × 1018 to 3 × 1021 FLOPs), and consider the final training loss for each point
  • in contrast with Approach 1 that considered points (𝑁, 𝐷, 𝐿) along the entire training runs. This allows us to directly answer the question: For a given FLOP budget, what is the optimal parameter count?
  • 토큰이 많을 수록 Loss가 낮아진다 (같은 FLOPs 일지라도!)
  • fit a parabola to each IsoFLOPs curve to directly estimate at what model size the minimum loss is achieved (Figure 3 (left))
  • fit a power law between FLOPs and loss-optimal model size and number of training tokens, shown in
    Figure 3 (center, right). Again, we fit exponents of the form 𝑁𝑜𝑝𝑡 ∝ 𝐶^𝑎 and 𝐷𝑜𝑝𝑡 ∝ 𝐶^𝑏. 𝑎 = 0.49 and 𝑏 = 0.51—as summarized in Table 2.


3.3. Approach 3: Fitting a parametric loss function

  • model all final losses from experiments in Approach 1 & 2 as a parametric function of model parameter count and the number of seen tokens

3.4. Optimal model scaling

  • 기존 논문(Kaplan et al.(2020) 과는 달리 파라미터와 데이터가 거의 equal한 스케일링을 보임

  • 사실 이 논문에서는 아래표가 제일 중요했다

4. Chinchilla

4.1. Model and training details

  • train Chinchilla on MassiveText (the same dataset as Gopher) but use a slightly different subset distribution (shown in Table A1) to account for the increased number of training tokens
  • AdamW (Loshchilov and Hutter, 2019) for Chinchilla
  • train Chinchilla with a slightly modified SentencePiece (Kudo and Richardson, 2018)
    tokenizer that does not apply NFKC normalisation (이거 왜 안했지?)
    • find that this particularly helps with the representation of mathematics and chemistry (MMLU 같은 곳에는 도움될수도?!)
    • vocabulary is very similar– 94.15% of tokens are the same as those used for training Gopher
  • forward and backward pass are computed in bfloat16, we store a float32 copy of the weights

4.2. Results


4.2.1. Lanugage modeling

  • bits-per-byte(bpb)가 뭐지

4.2.2. MMLU

  • 이하 생략

Discussion & Conclusion

  • The trend so far in large language model training has been to increase the model size, often without increasing the number of training tokens
    • 모델크기만 키우고 토큰은 안키웠던 트렌드가 있었음, 근데 잘못됨
  • propose three predictive approaches towards optimally setting model size and training dura- tion, based on the outcome of over 400 training runs
    • 실험 많이함
  • Though there has been significant recent work allowing larger and larger models to be trained, our analysis suggests an increased focus on dataset scaling is needed
    • 데이터 스케일링도 필요하다고! (물론 퀄리티가 뒷받침되야함)
  • Larger datasets will require extra care to ensure train-test set overlap is properly accounted for, both in the language modelling loss but also with downstream tasks
    • LM loss와 downstream task 다 잘되게 하려면 데이터 양 신경쓰자
  • Chinchilla does suffer from bias and toxicity but interestingly it seems less affected than Gopher
    • 친칠라도 bias와 toxicity 문제가 있었지만 고퍼보다 덜했다.





D.3. Predicted compute optimal frontier for all three methods


(Chinchilla) Training Compute-Optimal Large Language Models



Joosung Yoon

Posted on


Updated on


Licensed under