(IA3) Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Note

Author

  • Haokun Liu∗ Derek Tam∗ Mohammed Muqeeth∗ Jay Mohta Tenghao Huang Mohit Bansal Colin Raffel
    • Department of Computer Science
    • University of North Carolina at Chapel Hill {haokunl,dtredsox,muqeeth,craffel}@cs.unc.edu

Summary

  • LoRA(2021 from MS)가 요즘 핫하지만..?! IA3의 장점이 많다 (싸고, 성능이 좋다)

Abstract

  • Few-shot in-context learning (ICL)이 PLM을 튜닝하지 않고도 unseen task를 잘 할 수 있게 해줬지만 substantial computational, memory, and storage costs 유발
    • because it involves processing all of the training examples every time a prediction is made.
  • Parameter-efficient fine-tuning (PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
  • we rigorously compare few-shot ICL and PEFT and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs
  • we introduce a new PEFT method called (IA)3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters.
  • propose a simple recipe based on the T0 model [1] called T-Few that can be applied to new tasks without task-specific tuning or modifications.
    • applying it to the RAFT benchmark [2], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute.

Introduction

  • NLP 발전 관련 내용
    • Pre-trained language models have become a cornerstone of natural language processing, thanks to the fact that they can dramatically improve data efficiency on tasks of interest
    • Notably, ICL requires no gradient-based training and therefore allows a single model to immediately perform a wide variety of tasks. Performing ICL therefore solely relies on the capabilities that a model learned during pre-training
  • Despite the practical benefits of ICL, it has several major drawbacks.
    • First, processing all prompted input-target pairs every time the model makes a prediction incurs significant compute costs.
    • Second, ICL typically produces inferior performance compared to fine-tuning [4].
    • Finally, the exact formatting of the prompt (including the wording [11] and ordering of examples [12]) can have significant and unpredictable impact on the model’s performance, far beyond inter-run variation of fine-tuning.
  • Recent work has also demonstrated that ICL can perform well even when provided with incorrect labels, raising questions as to how much learning is taking place at all
  • An additional paradigm for enabling a model to perform a new task with minimal updates is parameter-efficient fine-tuning (PEFT), where a pre-trained model is fine-tuned by only updating a small number of added or selected parameters.
    • Recent methods have matched the performance of fine-tuning the full model while only updating or adding a small fraction (e.g. 0.01%) of the full model’s parameters
    • Furthermore, certain PEFT methods allow mixed-task batches where different examples in a batch are processed differently [14], making both PEFT and ICL viable for multitask models
      • Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  • PEFT 문제제기
    • While the benefits of PEFT address some shortcomings of fine-tuning (when compared to ICL), there has been relatively little focus on whether PEFT methods work well when very little labeled data is available
  • Our primary goal in this paper is to close this gap by proposing a recipe
    • Specifically, we base our approach on the T0 model [1], a variant of T5 [15] fine-tuned on a multitask mixture of prompted datasets.
    • To improve performance on classification and multiple-choice tasks, we add unlikelihood [16, 17] and length normalization-based [4] loss terms
      • 이건 적용 안해봤는데, 적용을 검토해봐야겠다!
    • In addition, we develop (IA)3, a PEFT method that multiplies intermediate activations by learned vectors.
      • (IA)3 attains stronger performance than full-model fine-tuning while updating up to 10,000× fewer parameters
    • Finally, we demonstrate the benefits of pre-training the (IA)3 parameters before fine-tuning [18, 19].
  • Our overall recipe, which we dub “T-Few”, performs significantly better than ICL (even against 16× larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT [2] while requiring dramatically less compute and allowing for mixed-task batches during inference.

Background

Parameter-efficient fine-tuning

  • Early methods proposed adding adapters [22–24], which are small trainable feed-forward networks inserted between the layers in the fixed pre-trained model
  • methods that choose a sparse subset of parameters to train [25, 26]
    • produce low-rank updates [13]
    • perform optimization in a lower-dimensional subspace [27]
    • add low-rank adapters using hypercomplex multiplication [28]
    • Relatedly, prompt tuning [14] and prefix tuning [29] concatenate learned continuous embeddings to the model’s input or activations to induce it to perform a task

요즘 핫한 LoRA만 조금 더 살펴보자!

  • LoRA(2021 from MS)가 요즘 핫함!
  • 원리
    image
  • pre- trained language models have a low “instrisic dimension” and can still learn efficiently despite a low-dimensional reparametrization.
  • We limit our study to only changing the attention weights for downstream tasks and freeze the MLP modules
    image
    image
    image
    image
    image
    image

Designing the T-Few Recipe

Model and Datasets

  • In preliminary experiments applying PEFT methods to different pre-trained models, we attained the best performance with T0
    • T0 was released in three billion and eleven billion parameter variants, referred to as “T0-3B” and simply “T0” respectively
      • 11B를 T0로 표기하겠다는 것
  • we use T0-3B to reduce computational costs
  • we use the same number of few-shot training examples for each dataset as Brown et al. [4], which varies from 20 to 70
  • we train our model for 1K steps with a batch size of 8 and report performance at the end of training.

Unlikelihood Training and Length Normalization

  • LM의 few-shot finetuning을 개선하기 위해 2가지 loss를 추가적으로 검토함
  • LM은 기본적으로 cross-entropy loss를 가짐
  • 기본적으로 평가는 Rank classification (likelihood가 더 높은게 이기는)를 쓸거기 때문에 Unlikelihood training을 통해 incorrect choices의 확률을 낮춰서 개선시키려함
    • Unlikelihood -> N개의 incorrect choices에 대해서 loss를 합해주는데(제일 바깥쪽 시그마), 각각을 계산할땐(안쪽 시그마) 1-p(y|x, y<t)로 incorrect token이 나올 확률을 1에서 빼줌으로써 학습해야할 방향을 잘못된 토큰생성에서 올바른 토큰생성쪽으로 역전시켜줌
      • Unlikelihood Training을 위한 샘플도 따로 뽑아놔야겠네
        image
  • 이번엔 Length Normalization쪽을 보자
    • The possible target sequences for a given training example can have significantly different lengths
      • 타겟의 길이가 다 다르다는 것!
    • Ranking each choice based on probability can therefore “favor” shorter choices because the model’s assigned probability to each token is ≤ 1.
      • 이건 랭킹쪽에서 이슈가 되는데, 토큰 생성 확률이 1 이하기 때문에 짧은게 더 유리하다는 것! (아마 곱하기하면 점점 더 작아져서 그런게 아닌가)
      • To rectify this, we consider using length normalization when performing rank classification, which divides the model’s score on each possible answer choice by the number of tokens in the choice
      • 먼저 length-normalized log prob 계산! (First, we compute the length-normalized log probability of a given output sequence)
      • softmax cross entropy loss를 줄여서 length-normalized log prob 최대화! (we maximize the length-normalized log probability of the correct answer choice by minimizing the softmax cross-entropy loss)
        • 정확히는 왜 softmax cross-entropy로 재구성해서 추가해야하는지 모르겠다. 그냥 LM loss와 유사해보이는데..?!

image

  • 결과적으론 위 loss term을 추가할 수록 좋아짐
    • We find that adding L_LN improves the accuracy from 60.7% to 62.71% and including both L_UL and L_LN provides a further improvement to 63.3%
  • 파이썬 코드 예제
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    choices_scores = (
    F.cross_entropy(model_output.logits.flatten(0, 1), lm_target.flatten(0, 1), reduction="none")
    .view(bs, num_choices, -1)
    .sum(dim=-1)
    )
    if self.config.length_norm > 0: // 타겟 길이 정규화
    choices_scores = choices_scores / torch.pow(
    (choices_ids != self.tokenizer.pad_token_id).sum(dim=-1), self.config.length_norm
    )
    lm_loss = F.cross_entropy(
    model_output.logits.view(bs, num_choices, *model_output.logits.size()[1:])[range(bs), labels].flatten(
    0, 1
    ),
    lm_target.view(bs, num_choices, -1)[range(bs), labels].flatten(0, 1),
    )

    tensorboard_logs = {"lm_loss": lm_loss.item()}
    if self.config.mc_loss > 0:
    mc_loss = F.cross_entropy(-choices_scores, labels)
    tensorboard_logs["mc_loss"] = mc_loss.item()
    else:
    mc_loss = 0.0

    if self.config.unlikely_loss > 0: // Unlikelihood loss 추가
    cand_loglikely = -F.cross_entropy(
    model_output.logits.flatten(0, 1), lm_target.flatten(0, 1), reduction="none"
    ).view(bs, num_choices, -1)
    cand_loglikely += (lm_target < 0).view(bs, num_choices, -1) * -100
    cand_loglikely[range(bs), labels] = -100
    unlikely_loss = -torch.log(1 - torch.exp(cand_loglikely) + 1e-2).sum() / (cand_loglikely != -100).sum()
    tensorboard_logs["unlikely_loss"] = unlikely_loss.item()
    else:
    unlikely_loss = 0.0

    loss = lm_loss + mc_loss * self.config.mc_loss + unlikely_loss * self.config.unlikely_loss
    tensorboard_logs["loss"] = loss.item()

Parameter-efficient fine-tuning with (IA)3

image

  • we need a PEFT method that has the following properties:
    • First, it must add or update as few parameters as possible to avoid incurring storage and memory costs.
    • Second, it should achieve strong accuracy after few-shot training on new tasks.
    • Finally, it must allow for mixed-task batches, since that is a capability of ICL.
      • (In order to easily enable mixed-task batches, a PEFT method should ideally not modify the model itself.)
  • A more convenient alternative is provided by methods that directly modify the activations of the model since this can be done independently and cheaply to each example in the batch according to which task the example corresponds to.
  • As an alternative, we explored element-wise multiplication (i.e. rescaling) of the model’s activations against a learned vector.
  • Activation을 이용한 방법들이 있고 그중에 sequence of Activation에 learnable vector를 element-wise로 곱해서 adaptation!를 할수도있음
  • 실험해보니 each set of activations에 learned rescaling vector를 사용할 필요가 없고 대신
    • self-attention + encoder-decoder attention에 있는 key, value쪽에 rescaling vectors를 도입 및
    • position-wise feed-forward networks의 intermediate activation에도 도입하면 충분했음
  • 즉 3가지 learnable vector를 제안하는 것임! l_k, l_v, l_ff (처음엔 다 ones vector로 init됨)
    • (특이한건 query에는 따로 처리해주는게 없음, 왜 없을까? 학습을 다 끝내고나면 저 l_k, l_v, l_ff 벡터들은 그대로 남는걸까? -> 아마 남는 형태가 되면 체크포인트 저장은 어떤식일까? 모델 자체가 변경되는건가?)
    • attention mechanism을 보면 key쪽에 리스케일 한번 하고 value쪽에서 리스케일 한번함 (bias가 아니라 그냥 rescaling 벡터같은데..근데 이게 layer마다 필요한건 아니고 하나만들어 놓으면 layer에 쭉 재활용하는건가? 그럼 싸게 먹히긴하는데 -> 식보니까 재활용 아니네..)
  • 인코더에서는 L(d_k+d_v+d_ff)개만큼 필요하고 디코더에서는 L(2d_k+2d_v+d_ff)개만큼 필요함
    • layer 개수만큼이니까 뭔가 그렇게 싸게 먹히는것 같진 않은데.. 흠
  • We call our method (IA)3, which stands for “Infused Adapter by Inhibiting and Amplifying Inner Activations”
  • (IA)3 makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector
  • 그런데 이 부분이 꽤 흥미롭다! 이 벡터들을 가중치 행렬에 영구적용할 수 있어서 아키텍쳐 변경이 필요하지 않을 수 있다는 것! (체크포인트 저장이 쉬워질수있겠군) -> orginal 모델에서 추가적인 계산 코스트가 없다!
    • We also note that, in the event that a model will only be used on a single task, the modifications introduced by (IA)3 can also be applied to weight matrices permanently so that no elementwise multiplication is required and the model’s architecture remains unchanged.

image

  • T0-3B로 9개의 PEFT methods에 대해서 검증해봄
  • 9 strong PEFT methods
    • BitFit [47] which updates only the bias parameters;
    • Adapters [23] which introduce task-specific layers after the self-attention and position-wise feed-forward networks;
    • Compacter and Compacter++ [28] which improve upon adapters by using low-rank matrices and hypercomplex mul- tiplication;
    • prompt tuning [14] which learns task-specific prompt embeddings that are concatenated to the model’s input;
    • FISH Mask [26] which chooses a subset of parameters to update based on their ap- proximate Fisher information;
    • Intrinsic SAID [27] which performs optimization in a low-dimensional subspace;
    • prefix-tuning [29] which learns task-specific vectors that are concatenated to the model’s activations;
    • LoRA [13] which assigns low-rank updates to parameter matrices.
      image

Pre-training (IA)3

  • pre-training the prompt embeddings in prompt tuning can improve performance when fine-tuning on downstream few-shot tasks. For pre- training, Gu et al. [18] use a suite of self-supervised tasks applied to unlabeled text data, and Vu et al. [19] consider using embeddings from a separate task or multitask mixture.
  • We follow Vu et al. [19] and simply pre-train the new parameters introduced by (IA)3 on the same multitask mixture used to train T0. We pre-train for 100,000 steps with a batch size of 16 before fine-tuning the (IA)3 parameters on each individual downstream dataset
  • We find that pre-training improves fine-tuned accuracy from 64.6 to 65.8 and therefore add it to our recipe
    • 일단 성능은 오른건데, 잘 이해를 못하겠는데 어차피 matrix에 영구적으로 리스케일할건데 왜 굳이 또 이걸 pretraining에 학습을 해야돼?
      image

Combining the ingredients

  • the T-Few recipe is defined as follows:
    • We use the T0 model as a backbone.
    • We add (IA)3 for downstream task adaptation
    • use parameters initialized from pre-training (IA)3 on the same multitask mixture for T0
      • T0 자체가 이미 instruction tuning 된건데 여기에 pre-training한다는게 도대체 뭐지.. T0 학습셋에 대해서 학습시킨걸 뜻하는건가…뭐지 ㅠ
    • As an objective, we use the sum of a standard language modeling loss L_LM, an unlikelihood loss L_UL for incorrect choices, and a length-normalized loss L_LN.
    • We train for 1,000 steps with a batch size of 8 sequences using the Adafactor optimizer [49] with a learning rate of 3e−3 and a linear decay schedule with a 60-step warmup
    • apply prompt templates to downstream datasets during training and inference to convert each example into an instructive text-to-text format.
    • we apply this recipe to every downstream dataset in exactly the same way without per-dataset hyperparameter tuning or modifications.

Outperforming ICL with T-Few

Performance on T0 tasks

image

Comparing computational costs

  • we now compare the relative costs of each few-shot learning approach.
  • For simplicity, we use the FLOPs-per-token estimates for Transformer-based language models introduced by Kaplan et al. [20].
    • Specifically, we estimate that a decoder-only Transformer (e.g. the GPT series) with N parameters uses 2N FLOPs per token for inference and 6N FLOPs per token for training.
      • 왜지?
    • Encoder-decoder models like T0 and T5 (where the encoder and decoder have the same number of layers and layer sizes) only process each token with either the encoder or decoder (each having roughly half the parameters of the full model), so the FLOPs per token estimates are halved to N and 3N FLOPs per token for inference and training.
  • FLOPs 측정의 단점?
    • We note that FLOPs are not a direct measurement of real-world computational cost because latency, power usage, and other costs can vary significantly depending on hardware and other factors
  • 그럼에도 불구하고 선택한 이유
    • However, we focus on FLOPs because it is a hardware-independent metric that closely with real-world costs the hardware setup used for running the different methods we consider would likely vary significantly across methods

Inference cost.

  • Processing a single input and all target choices with T-Few requires 11e9 × 103 = 1.1e12 FLOPs, whereas few-shot ICL with GPT-3 175B requires 2 × 175e9 × (41 × 98 + 103) = 1.4e15 FLOPs– more than 3 orders of magnitude more.

Training cost

  • T-Few is the only method that involves updating parameters, it is the only method that incurs a training cost.
  • Training an eleven billion parameter encoder-decoder model for 1,000 steps with a batch size of 8 length-103 sequences requires approximately 3 × 11e9 × 1, 000 × 8 × 103 = 2.7e16 FLOPs
  • While not insignificant, this is only about 20 times larger than the FLOPs required to process a single example with few-shot ICL using GPT-3 175B. In other words, training T-Few costs as much as using GPT-3 175B to process 20 examples with few-shot ICL.
  • We also found that fine-tuning T0 with T-Few on a single dataset only takes about a half an hour on a single NVIDIA A100 GPU. As of writing, this would cost about $2 USD using Microsoft Azure

Storage cost

  • T-Few also incurs the largest storage cost.
  • When stored as single-precision floats, the parameters added by (IA)3 take up 4.2 MB of space on disk
  • In contrast, ICL methods only require storing the tokenized in-context examples (typically stored as 32-bit integers), resulting in a smaller 41 × 98 × 32 bits = 16 kB disk space requirement.
  • However, we note that 4.2 MB is dwarfed by the on-disk size of the model checkpoints themselves – storing the (IA)3 adaptation vectors for 10,000 tasks would take about as much space as the T0 checkpoint (41.5 GB).

Memory usage

  • T-Few will incur a lower memory cost during inference. Additional memory costs are incurred when training T-Few due to the need to cache intermediate activations for backpropagation and for the gradient accumulator variables in Adafactor.
  • it is possible to use the T-Few recipe on a single 80GB A100 GPU.

Performance on Real-world Few-shot Tasks (RAFT)

  • 기존 평가가 few-shot learning에 적합하지 않다
    • So far, we have evaluated performance on a collection of datasets that were not explicitly designed for benchmarking few-shot learning
  • To better evaluate T-Few’s performance in the real world, we evaluated our approach on the RAFT benchmark
  • RAFT consists of 11 “economically valuable” tasks that aim to mirror real-world applications
  • Importantly, each RAFT datasets has only 50 training examples with no validation set and a (larger) test set with no public labels, so it is impossible to “cheat” by tuning on an unrealistically-large validation set or by peeking at the test set [32, 31].

image

image

Ablation experiments

  • While the gains from adding each ingredient does not always significant increase the accuracy on each individual dataset, each ingredient consistently improves the average performance across datasets: Removing pre-training decreases accuracy by 1.6%, removing unlikelihood training and length normalization decreases accuracy by 4.1%, and removing both pre-training and our additional loss terms reduces accuracy by 2.5%.

image

Conclusion

  • We introduced T-Few, a parameter-efficient few-shot learning recipe that attains higher accuracy than few-shot ICL at a lower computational cost
  • T-Few uses (IA)3, a new PEFT method that rescales inner activations with learned vectors.
  • Using (IA)3 produces better performance than fine-tuning the full model while only introducing a tiny amount of additional parameters.
  • T-Few also uses two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices
  • we found that T-Few uses over 1,000× fewer FLOPs during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU

Appendix

Full Unlikelihood Training and Length Normalization Results

image

Full PEFT Results

with L_UL, L_N without L_UL, L_N
image image

Comparing T-Few with few-shot ICL methods

image

구현

  • LoRA쪽 구현체를 변경하는 식으로 되어있음
  • t-few/configs/ia3.json
1
2
3
4
5
6
7
8
9
10
11
{
"lora_scaling_rank": 1,
"lora_rank": 0,
"lora_init_scale": 0.0,
"lora_modules": ".*SelfAttention|.*EncDecAttention|.*DenseReluDense",
"lora_layers": "k|v|wi_1.*",
"trainable_param_names": ".*lora_b.*",
"model_modifier": "lora",
"lr": 3e-3,
"num_steps": 1000
}
  • lora기반 ia3 구현체 스니펫

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    def modify_with_lora(transformer, config):
    for m_name, module in dict(transformer.named_modules()).items():
    if re.fullmatch(config.lora_modules, m_name):
    for c_name, layer in dict(module.named_children()).items():
    if re.fullmatch(config.lora_layers, c_name):
    assert isinstance(
    layer, nn.Linear
    ), f"LoRA can only be applied to torch.nn.Linear, but {layer} is {type(layer)}."
    setattr(
    module,
    c_name,
    LoRALinear(layer, config.lora_rank, config.lora_scaling_rank, config.lora_init_scale),
    )
    return transformer

    class LoRALinear(nn.Module):
    def __init__(self, linear_layer, rank, scaling_rank, init_scale):
    super().__init__()
    self.in_features = linear_layer.in_features
    self.out_features = linear_layer.out_features
    self.rank = rank
    self.scaling_rank = scaling_rank
    self.weight = linear_layer.weight
    self.bias = linear_layer.bias
    if self.rank > 0:
    self.lora_a = nn.Parameter(torch.randn(rank, linear_layer.in_features) * init_scale)
    if init_scale < 0:
    self.lora_b = nn.Parameter(torch.randn(linear_layer.out_features, rank) * init_scale)
    else:
    self.lora_b = nn.Parameter(torch.zeros(linear_layer.out_features, rank))
    if self.scaling_rank:
    self.multi_lora_a = nn.Parameter(
    torch.ones(self.scaling_rank, linear_layer.in_features)
    + torch.randn(self.scaling_rank, linear_layer.in_features) * init_scale
    )
    if init_scale < 0:
    self.multi_lora_b = nn.Parameter(
    torch.ones(linear_layer.out_features, self.scaling_rank)
    + torch.randn(linear_layer.out_features, self.scaling_rank) * init_scale
    )
    else:
    self.multi_lora_b = nn.Parameter(torch.ones(linear_layer.out_features, self.scaling_rank))

    def forward(self, input):
    if self.scaling_rank == 1 and self.rank == 0:
    # parsimonious implementation for ia3 and lora scaling
    if self.multi_lora_a.requires_grad:
    # 이 부분에서 IA3의 곱하기가 일어나는듯!!
    hidden = F.linear((input * self.multi_lora_a.flatten()), self.weight, self.bias)
    else:
    hidden = F.linear(input, self.weight, self.bias)
    if self.multi_lora_b.requires_grad:
    hidden = hidden * self.multi_lora_b.flatten()
    return hidden
    else:
    # general implementation for lora (adding and scaling)
    weight = self.weight
    if self.scaling_rank:
    weight = weight * torch.matmul(self.multi_lora_b, self.multi_lora_a) / self.scaling_rank
    if self.rank:
    weight = weight + torch.matmul(self.lora_b, self.lora_a) / self.rank
    return F.linear(input, weight, self.bias)
  • F.linear 관련 내용
    image

  • 아래는 LoRA의 config인데 차이는 조금 있음

1
2
3
4
5
6
7
8
class LoRAConfig:
def __init__(self):
self.lora_rank = 4
self.lora_init_scale = 0.01
self.lora_modules = ".*SelfAttention|.*EncDecAttention"
self.lora_layers = "q|k|v|o"
self.trainable_param_names = ".*layer_norm.*|.*lora_[ab].*"
self.lora_scaling_rank = 1

(IA3) Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

https://eagle705.github.io/ia3/

Author

Joosung Yoon

Posted on

2023-05-09

Updated on

2023-05-09

Licensed under

댓글