Pythia (A Suite for Analyzing Large Language Models Across Training and Scaling)


  • 이후 연구와도 비교해보자
  • 23/4/3 버전


  • Stella Biderman * 1 2 Hailey Schoelkopf * 1 3 Quentin Anthony 1 Herbie Bradley 1 4 Kyle O’Brien 1 Eric Hallahan 1 Mohammad Aflah Khan 5 Shivanshu Purohit 1 USVSN Sai Prashanth 1 Edward Raff 2 Aviya Skowron 1 Lintang Sutawika 1 6 Oskar van der Wal 7
    • EleutherAI


  • Research Questions
    • How do large language models (LLMs) develop and evolve over the course of training?
    • How do these patterns change as models scale?
  • introduce Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters
  • provide public access to 154 checkpoints for each one of the 16 models
  • present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias
  • Trained models, analysis code, training code, and training data can be found at


  • Critical to understanding the functioning of transformers is better understanding how these models behave along two axes: training and scaling. (트랜스포머의 기능을 이해하는 데 중요한 것은 이러한 모델이 훈련과 스케일링이라는 두 축을 따라 어떻게 작동하는지 더 잘 이해하는 것)
    • It is well established that there are regular and predictable patterns in the behavior of trained language models as they scale, but prior work connecting these “Scaling Laws” to the learning dynamics of language models is minimal.
      • Scaling Laws에 따라 크기에 따른 모델의 예측가능한 패턴에 대해서는 연구 되었지만, 학습 자체에 대한 연구는 미미했다(?) -> 잘모르겠네 이 부분은
    • non-public한 모델들이 많기 때문에 Pythia 같은 프로젝트를 해봄
  • The Pythia suite is the only publicly released suite of LLMs that satisfies three key properties:
    • Models span several orders of magnitude of model scale.
    • All models were trained on the same data in the same order.
    • The data and intermediate checkpoints are publicly available for study.
  • We train 8 model sizes each on both the Pile (Gao et al., 2020; Biderman et al., 2022) and the Pile after deduplication, providing 2 copies of the suite which can be compared.
    • 디듑의 효과도 확인예정

Mitigating Gender Bias

  • 다양한 선행연구들
    • Some work has explored finetuning’s effects on bias in language models
    • the relationship between the corpus statistics and the measured bias
  • researchers have generally lacked the tools to study the role of the training data on the learning dynamics of bias in large language models of different sizes.
    • 연구자들은 학습데이터가 모델크기나 학습되는 과정에 끼치는 영향을 분석한 툴들이 많이 없다? 그러니까 Pythia 연구 봐라
  • we analyze whether deliberately modifying the frequency of gendered terms in the pretraining data of a language model can have an impact on its downstream behavior and biases.
    • 당연한 얘기 같기도..
  • We leverage the known pretraining data and public training codebase of our model suite, and counterfactually retrain models such that the last 7% and 21% of model training has a majority of pronouns modified such that their grammatical gender is feminine rather than masculine.
    • 모델 훈련의 마지막 7%와 21%가 문법적 성별이 남성이 아닌 여성이 되도록 수정된 대명사를 갖도록 변경 후 재학습
  • We demonstrate that such interventions are successful at reducing bias measures on a targeted benchmark

Memorization is a Poisson Point Process

  • Research Questions
    • does the location of a particular sequence in the training dataset influence the likelihood of it being memorized?
      • Leveraging Pythia’s reproducible dataloader setup we answer this question in the negative, and furthermore find that a poisson point process is a very good model for the occurrence of memorized sequences over the course of training.
        • 학습셋 내에서 순서자체는 의미가 없다로 본다!

Emergence of the Impact of Pretraining Frequencies

  • Recent work has identified the frequency of specific facts within a corpus as an important factor in how likely a model is capable of applying that fact in response to a natural language question
  • Existing work has been heavily dependent on the handful of models trained on public data, such as GPT-J (Wang & Komatsuzaki, 2021) and BLOOM (Scao et al., 2022), which lack frequent intermediate checkpoints, so none of these papers are able to look at the fine-grained evolution of this phenomenon over the course of training.
    • To address this gap in the literature, we examine how the role of pretraining term frequencies changes over the course of training
    • We find that significant phase change occurs after 65,000 training steps (45% through training): the models with 2.8 billion parameters or more start to exhibit a correlation between task accuracy and occurrence of task-relevant terms

The Pythia Suite

  • we prioritize consistency in model design and controlling for as much potential sources of variation as possible rather than trying to eek out the most performance from each model.
    • 성능을 끌어올리는것보단 모델링 요소를 제어하면서 디자인
  • For example we use the parallel attention and feedforward approach for all models, as it is becoming widely used for the largest models, even though it is generally not recommended for models with less than 2.7B parameters
    • 어텐션이랑 feedforward 동시에 계산하는 기법인데 이거 나중에 코드로 같이보면 좋을듯

Requirements for a Scientific Suite of LLMs

  • Pythia is envisioned as a suite for enabling and empowering scientific research on the capacities and limitations of large language models
  • we found no existing suites of models which satisfied all the following conditions:
    • Public Access (Model, data)
    • Training Provenance
      • Intermediate checkpoints are available for analysis, all models are trained with the same data ordering, and intermediate checkpoints can be linked with the exact data seen up to that checkpoint. Training procedure as well as model and training hyperparameters are well-documented.
    • Consistency Across Scale
      • Model scaling sequences should have self-consistent design decisions that reasonably adhere to common practice for training state-of-the-art large models


Training Data

  • We train our models on the Pile (Gao et al., 2020; Bi- derman et al., 2022), a curated collection of English language datasets for training large language models that is popular for training large autoregressive transformers.
  • This dataset has three major benefits over its competitors:
    • first, it is freely and publicly available;
    • second, it reports a higher downstream performance (Le Scao et al., 2022) than popular crawl-based datasets C4 (Raffel et al., 2020; Dodge et al., 2021) and OSCAR (Sua ́rez et al., 2019); and
    • third, it has been widely used by state-of-the-art models including GPT-J-6B (Wang & Komatsuzaki, 2021), GPT-NeoX-20B (Black et al., 2022), Jurassic-1 (Lieberet al., 2021)1, Megatron-Turing NLG 530B (Smith et al., 2022), OPT (Zhang et al., 2022), and WuDao (Tang, 2021).
  • We use the tokenizer developed by Black et al. (2022), which is a BPE tokenizer that is trained specifically on the Pile.
  • 다국어에 대해서 고려했다가 다음의 이유로 다국어를 사용하진 않았음
    • While we are confident that we are generally aware of the contents and quality of the Pile, we cannot say the same for multilingual datasets (Pile은 퀄리티 좋지만 다른 언어 코퍼스도 퀄리티 좋겠냐?! 확인 안된다!)
    • As this framework is intended to be used as a baseline for future research, we feel it is important to stay close to currently accepted common practices. (이 연구자체가 future research의 베이스라인 목적이라 단순하게 함)
    • We do not have access to a multilingual evaluation framework that is anywhere near as comprehensive as Gao et al. (2021).
      • lm-evaluation-harness만큼 포괄적인 다국어 평가 프레임워크가 없다..?! 흠.. 여기에 언어 붙이면 안되나? 잘 이해가 안됨
  • We train 2 copies of the Pythia suite using identical architectures. (그대로 학습 vs 디듑학습 비교예정)
    • Each suite contains 8 models spanning 8 different sizes.
    • We train one suite of 8 models on the Pile, and the other on a copy of the Pile after applying near-deduplication with MinHashLSH and a threshold of 0.87, following the advice that LLMs trained on deduplicated data are better and memorize less of their data (Lee et al., 2021).


  • Our model architecture and hyperparameters largely follow Brown et al. (2020)(GPT-3 얘기하는 것), with a few notable deviations based on recent advances in best practices for large scale language modeling (GPT-3 스타이로 하되 몇가지만 바꿈)
    • Brown et al. (2020) describes using sparse and dense attention layers in alternation, while we follow all sub- sequent work and use fully dense layers for our models
    • We use Flash Attention (Dao et al., 2022) during training for improved device throughput.
    • We use rotary embeddings introduced by Su et al. (2021) and now in widespread use (Black et al., 2022; Chowdhery et al., 2022; Zeng et al., 2022) as our positional embedding type of choice.
      • image
      • 그림 출처: Black et al. (2022)(GPT-NeoX-20B)
        • simply rotate the affine-transformed word embedding vector by amount of angle multiples of its
          position index and thus interprets the intuition behind Rotary Position Embedding
    • We use the parallelized attention and feedforward technique and model initialization methods introduced by Wang & Komatsuzaki (2021) and adopted by (Black et al., 2022; Chowdhery et al., 2022), because they improve training efficiency and do not harm performance.
class GPTJBlock(nn.Module):
def __init__(self, config):
inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd
self.ln_1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
self.attn = GPTJAttention(config)
self.mlp = GPTJMLP(inner_dim, config)

def forward(
hidden_states: Optional[torch.FloatTensor],
layer_past: Optional[Tuple[torch.Tensor]] = None,
attention_mask: Optional[torch.FloatTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = False,
output_attentions: Optional[bool] = False,
) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]:
residual = hidden_states
hidden_states = self.ln_1(hidden_states)
attn_outputs = self.attn(
attn_output = attn_outputs[0] # output_attn: a, present, (attentions)
outputs = attn_outputs[1:]

feed_forward_hidden_states = self.mlp(hidden_states)
hidden_states = attn_output + feed_forward_hidden_states + residual # 이 부분이 포인트

if use_cache:
outputs = (hidden_states,) + outputs
outputs = (hidden_states,) + outputs[1:]

return outputs # hidden_states, present, (attentions)
Tuned Lens What is a Lens?
image image


  • We train our models using the open source library GPT- NeoX (Andonian et al., 2021) developed by EleutherAI
    • using Adam and leverage the Zero Redundancy Optimizer (ZeRO) to efficiently scale to multi-machine set-ups
    • additionally leverage data parallelism (Goyal et al., 2017) and tensor parallelism (Shoeybi et al., 2019)
    • use Flash Attention (Dao et al., 2022) for improved hardware throughput
  • The most notable divergence from standard training procedures is that we use a much larger batch size than what is standard for training small language models
    • using larger batch sizes is desirable, but that smaller LLMs require smaller batch sizes to avoid convergence issues.
      • 큰 모델이면 배치사이즈 크면 좋다고
  • Consequently, we use a batch size of 1024 samples with a sequence length of 2048 (2,097,152 tokens) for all models
  • A maximum batch size therefore directly implies a minimum wall-clock training time and maximum number of compute-saturated GPUs. By inflating batch sizes beyond previous standards, we achieve wall- clock speed-ups of factors as large as 10× compared with standard batch sizes on our smaller models (Table 6).
    • image
  • 모델 저장 주기
    • We save model checkpoints at initialization and every 2,097,152,000 tokens (or 1,000 iterations), resulting in 144 checkpoints evenly spaced throughout training
    • Additionally, we save log-spaced checkpoints early in training at iterations {1, 2, 4, 8, 16, 32, 64, 128, 256, 512} (This gives a total of 154 checkpoints per model)
  • We train all models for 299,892,736,000 ≈ 300B tokens
    • This equates to 1 epoch on the original Pile, and ≈1.5 epochs on the deduplicated Pile, which is 207B tokens in size


  • we find that Pythia and Pythia (Deduplicated) perform very similarly to OPT and BLOOM models on a variety of NLP benchmarks
  • We use the Language Model Evaluation Harness (Gao et al., 2021) to run evaluations on eight common language modeling benchmarks:
    • OpenAI’s LAMBADA variant,
    • PIQA,
    • the Winograd Schema Challenge,
    • Wino Grande,
    • ARC (easy and challenge sets separately),
    • SciQ, and
    • LogiQA


Novel Observations in Evaluation

  • We find three interesting phenomena that run counter to the prevailing narratives in the literature
    • Firstly, we find that deduplication of our training data has no clear benefit on language modeling performance.
      • This is consistent with the results of Black et al. (2022)(GPT-NeoX-20B), but inconsistent with other papers.
      • This may indicate that the upsampling of certain subsets of the Pile does not accord with conventional assumptions about duplicated data, or that the general tendency of deduplicated data to outperform non-deduplicated data is primarily a statement about the quality of the data used in other works
        • 사용한 Pile 데이터가 보통 말하는 디듑해야되는 데이터랑은 안맞는거 아닌가 혹은 디듑이 결국 데이터 퀄리티 이슈는 아닌가라고 질문을 던짐
    • Secondly, we find that we achieve (equi-token and equi-parameter) performance on-par with OPT despite the use of parallel attention + MLP sublayers at all model scales
      • parallel attention + MLP sublayers 써도성능 잘 나오더라! 하지만 6B 이하에서는 퍼포먼스가 감소함
        • Both Black et al. (2022) and Chowdhery et al. (2022) state that this architecture choice causes a per- formance regression at scales < 6B parameters
    • Thirdly, we find a minimal and inconsistent “curse of multilinguality” (Conneau et al., 2020; Pfeiffer et al., 2022) for BLOOM
      • BLOOM이 성능이 떨어지는 태스크도 있지만 아닌것도(WinoGrande, ARC-easy, ARC-challenge, SciQ, and LogiQA) 있기 때문에 다국어저주에 대해서 재검토해야된다고 주장


Public Release and Reproducibility

  • we use the open source GPT-NeoX and DeepSpeed libraries for training
  • For evaluating our models we use the Language Model Evaluation Harness
  • We release all of our models and checkpoints to the public under the Apache 2.0 license via the HuggingFace Hub
  • In addition to training our models on the public Pile dataset, we also provide a tool for downloading the pre-tokenized data files utilized by our dataloader in the GPT-NeoX library

Case Studies

How Does Data Bias Influence Learned Behaviors?

  • We seek to investigate a counterfactual claim—if we were to train our models on a corpus with different properties, how would these models’ properties change downstream?
    • To test the effects of corpus statistics on the biases learned by language models, we repeat segments of pretraining on specific models, with altered corpus statistics
    • (중간꺼 떼서 단어표현 바꿔치기해서 학습 쭉 돌려봄) In particular, for the Pythia-70M-deduped, Pythia-400M-deduped, Pythia- 1.4B-deduped, and Pythia-6.9B-deduped models, we take a checkpoint and optimizer state 21B tokens (7%) prior to the end of training, and resume training of the model such that it sees the exact same data until the end of training, but with morphologically masculine pronouns replaced by their feminine counterparts
    • We also repeat this intervention for 63B tokens (21%) prior to the end of training on just the Pythia- 1.4B-deduped model. We then measure model performance on the WinoBias (Zhao et al., 2018)(coreference resolution benchmark) benchmark and the English subset of the multilingual CrowS-Pairs (Ne ́ve ́ol et al., 2022)(stereotype benchmark)
다운스트림성능유지 스테레오타입 낮춤 젠더바이어스 낮춤
image image image

Does Training Order Influence Memorization?

  • In this experiment we test whether training order influences memorization.
  • 가정 (멘탈모델)
    • This mental model predicts that data encountered later in training will be memorized more, as the model has had less time to incorporate it more fully into its representation space. If true, this would potentially be highly useful for mitigating the memorization of sequences for which verbatim memorization would be undesirable, by intentionally modifying a model’s training data order prior to training.
  • To test our hypothesis, we measure the memorization of an initial segment (first 64 tokens) of each sequence in the training corpus.
    • 선행연구중에 나온 memorization 정의 ( Carlini et al. (2021) )
      • In their context, a string is (k, l)-memorized if prompting the model with a string of length k from the training data induces the model to generate the next l tokens from the training data correctly.
  • We choose k = l = 32 largely arbitrarily, and note that doing all reasonable pairs of (k, l) would have a computational cost comparable to retraining all of our models from scratch
  • To avoid potential covariate effects, we only use the first 64 tokens from each context seen during training.
  • Surprisingly, we find that a Poisson model fits the data extremely well (Figure 3), indicating that training order has little impact on memorization.
    • This model implies that memorized sequences are not spaced more densely toward the beginning or end of training, and that between each checkpoint roughly the same number of memorized sequences can be found.


  • The Poisson process here describes an event of the occurrence of a memorized sequence within a batch of training data
  • This finding is important for practitioners seeking to control which sequences are memorized by a model. It implies that one cannot simply place sequences that are undesir able to memorize at the beginning or end of training and successfully reduce the chance of memorization

Do Pretraining Term Frequencies Influence Task Performance Throughout Training?

  • By charting the performance of a arithmetic task given an input operand and the frequency at which it is found in the pretraining corpus, they concluded that accuracy tends to be higher for terms that are found more frequently compared to terms that are less frequent
  • 수학추론 제한범위 참고하면 좋을 예시
    • Following Razeghi et al. (2022), the formulation of the arithmetic task consists of input operands x1 ∈ [0, 99] and x2 ∈ [1, 50] and an output y
  • We observe that for both arithmetic and QA experiments, model sizes affect the correlation between average performance and the term frequencies, indicating that this correlation is an emergent property in larger models
    • 큰 모델일 수록 term frequencies가 영향을 주는듯 보였다?
    • Smaller models(below 1B) rarely produce accurate results on the task despite being given up to 16 few-shot examples, as shown in Figure 4
  • 수학추론
    • 모델 클수록 효과가 크고
    • 같은 모델 크기라도 스텝이 높으면 좋고
    • 같은 스텝이라도 term freq가 높으면 좋다
  • 곱하기 같은 경우는 결과가 다음과 같음
    • For the multiplication arithmetic task, we also calculate the performance discrepancy between the top 10% most fre- quent input operands and the bottom 10% least frequent input operands also following Razeghi et al. (2022)
  • Similar patterns can be seen in Figure 5 where performance increase as training progresses mainly happens for larger models only


  • We release Pythia, a suite of language models trained with consistent data ordering and model architecture across multiple orders of magnitude of scale
  • experiments at unprecedented levels of detail for a public model suite by presenting novel analyses and results on gender debiasing, memorization, and term frequency effects


  • Full Configuration Details

Pythia (A Suite for Analyzing Large Language Models Across Training and Scaling)


Joosung Yoon

Posted on


Updated on


Licensed under