(IA3) Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
Note
- 아직은 Preprint. Under review!
- github repo: https://github.com/r-three/t-few
- IA3 논문
Infused Adapter
byInhibiting and Amplifying
Inner Activations
- LoRA 관련 논문 발표영상(추천)
Author
- Haokun Liu∗ Derek Tam∗ Mohammed Muqeeth∗ Jay Mohta Tenghao Huang Mohit Bansal Colin Raffel
- Department of Computer Science
- University of North Carolina at Chapel Hill {haokunl,dtredsox,muqeeth,craffel}@cs.unc.edu
Summary
- LoRA(2021 from MS)가 요즘 핫하지만..?! IA3의 장점이 많다 (싸고, 성능이 좋다)
Abstract
- Few-shot in-context learning (ICL)이 PLM을 튜닝하지 않고도 unseen task를 잘 할 수 있게 해줬지만 substantial computational, memory, and storage costs 유발
- because it involves processing all of the training examples every time a prediction is made.
Parameter-efficient fine-tuning (PEFT)
(e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers analternative paradigm
where a small set of parameters are trained to enable a model to perform the new task.- we rigorously compare few-shot ICL and PEFT and demonstrate that the
latter offers better accuracy as well as dramatically lower computational costs
- we introduce a new PEFT method called
(IA)3
thatscales activations by learned vectors
, attaining stronger performance while only introducing a relatively tiny amount of new parameters. - propose a simple recipe based on the T0 model [1] called
T-Few
that can be applied to new tasks without task-specific tuning or modifications.- applying it to the RAFT benchmark [2], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute.
Introduction
- NLP 발전 관련 내용
- Pre-trained language models have become a
cornerstone
of natural language processing, thanks to the fact that they can dramatically improve data efficiency on tasks of interest - Notably, ICL requires no gradient-based training and therefore allows a single model to immediately perform a wide variety of tasks. Performing ICL therefore solely relies on the capabilities that a model learned during pre-training
- Pre-trained language models have become a
- Despite the practical benefits of
ICL, it has several major drawbacks
.- First, processing
all prompted input-target pairs every time
the model makes a prediction incurssignificant compute costs
. - Second, ICL typically produces
inferior performance compared to fine-tuning
[4]. - Finally, the exact
formatting of the prompt
(including the wording [11] and ordering of examples [12]) can have significant and unpredictable impact on the model’s performance, far beyond inter-run variation of fine-tuning.
- First, processing
- Recent work has also demonstrated that
ICL can perform well
even when provided with incorrect labels
, raising questions as to how much learning is taking place at all - An
additional paradigm
for enabling a model to perform a new task with minimal updates isparameter-efficient fine-tuning (PEFT)
, where a pre-trained model is fine-tuned by only updating a small number of added or selected parameters.- Recent methods have
matched the performance of fine-tuning
the full model whileonly updating or adding a small fraction (e.g. 0.01%)
of the full model’s parameters - Furthermore, certain PEFT methods allow mixed-task batches where different examples in a batch are processed differently [14], making both PEFT and ICL viable for multitask models
- Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Recent methods have
- PEFT 문제제기
- While the benefits of PEFT address some shortcomings of fine-tuning (when compared to ICL), there has been relatively little focus on whether PEFT methods work well
when very little labeled data is available
- While the benefits of PEFT address some shortcomings of fine-tuning (when compared to ICL), there has been relatively little focus on whether PEFT methods work well
- Our primary goal in this paper is to close this gap by proposing a recipe
- Specifically, we base our approach on the T0 model [1], a variant of T5 [15] fine-tuned on a multitask mixture of prompted datasets.
- To improve performance on classification and multiple-choice tasks, we
add unlikelihood [16, 17]
andlength normalization-based [4] loss terms
- 이건 적용 안해봤는데, 적용을 검토해봐야겠다!
- In addition, we develop
(IA)3, a PEFT method that multiplies intermediate activations by learned vectors
.- (IA)3 attains stronger performance than full-model fine-tuning while updating up to 10,000× fewer parameters
- Finally, we demonstrate the benefits of pre-training the (IA)3 parameters before fine-tuning [18, 19].
- Our overall recipe, which we dub “T-Few”, performs significantly better than ICL (even against 16× larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT [2] while requiring dramatically less compute and allowing for mixed-task batches during inference.
Background
Parameter-efficient fine-tuning
- Early methods proposed adding
adapters
[22–24], which are small trainable feed-forward networksinserted between the layers in the fixed pre-trained model
- methods that
choose a sparse subset of parameters to train
[25, 26]- produce
low-rank updates
[13] - perform optimization in a lower-dimensional subspace [27]
- add
low-rank adapters
using hypercomplex multiplication [28] - Relatedly,
prompt tuning
[14] andprefix tuning
[29] concatenate learned continuous embeddings to the model’s input or activations to induce it to perform a task
- produce
요즘 핫한 LoRA만 조금 더 살펴보자!
- LoRA(2021 from MS)가 요즘 핫함!
- 원리
- pre- trained language models have a low “instrisic dimension” and can still learn efficiently despite a low-dimensional reparametrization.
- We limit our study to
only changing the attention weights
for downstream tasks and freeze the MLP modules
Designing the T-Few Recipe
Model and Datasets
In preliminary experiments
applying PEFT methods to different pre-trained models,we attained the best performance with T0
- T0 was released in three billion and eleven billion parameter variants, referred to as “T0-3B” and simply “T0” respectively
- 11B를 T0로 표기하겠다는 것
- T0 was released in three billion and eleven billion parameter variants, referred to as “T0-3B” and simply “T0” respectively
- we use
T0-3B
to reduce computational costs - we use the
same number of few-shot training examples
for each dataset as Brown et al. [4], which varies from20 to 70
- we train our model for
1K steps
witha batch size of 8
and report performance at the end of training.
Unlikelihood Training and Length Normalization
- LM의 few-shot finetuning을 개선하기 위해 2가지 loss를 추가적으로 검토함
- LM은 기본적으로 cross-entropy loss를 가짐
- 기본적으로 평가는 Rank classification (likelihood가 더 높은게 이기는)를 쓸거기 때문에 Unlikelihood training을 통해 incorrect choices의 확률을 낮춰서 개선시키려함
- Unlikelihood -> N개의 incorrect choices에 대해서 loss를 합해주는데(제일 바깥쪽 시그마), 각각을 계산할땐(안쪽 시그마) 1-p(y|x, y<t)로 incorrect token이 나올 확률을 1에서 빼줌으로써 학습해야할 방향을 잘못된 토큰생성에서 올바른 토큰생성쪽으로 역전시켜줌
- Unlikelihood Training을 위한 샘플도 따로 뽑아놔야겠네
- Unlikelihood Training을 위한 샘플도 따로 뽑아놔야겠네
- Unlikelihood -> N개의 incorrect choices에 대해서 loss를 합해주는데(제일 바깥쪽 시그마), 각각을 계산할땐(안쪽 시그마) 1-p(y|x, y<t)로 incorrect token이 나올 확률을 1에서 빼줌으로써 학습해야할 방향을 잘못된 토큰생성에서 올바른 토큰생성쪽으로 역전시켜줌
- 이번엔 Length Normalization쪽을 보자
- The possible target sequences for a given training example can have significantly different lengths
- 타겟의 길이가 다 다르다는 것!
- Ranking each choice based on probability can therefore “favor” shorter choices because the model’s assigned probability to each token is ≤ 1.
- 이건 랭킹쪽에서 이슈가 되는데, 토큰 생성 확률이 1 이하기 때문에 짧은게 더 유리하다는 것! (아마 곱하기하면 점점 더 작아져서 그런게 아닌가)
- To rectify this, we consider
using length normalization
when performing rank classification, which divides the model’s score on each possible answer choice by the number of tokens in the choice - 먼저 length-normalized log prob 계산! (First, we compute the length-normalized log probability of a given output sequence)
- softmax cross entropy loss를 줄여서 length-normalized log prob 최대화! (we maximize the length-normalized log probability of the correct answer choice by minimizing the softmax cross-entropy loss)
- 정확히는 왜 softmax cross-entropy로 재구성해서 추가해야하는지 모르겠다. 그냥 LM loss와 유사해보이는데..?!
- The possible target sequences for a given training example can have significantly different lengths
- 결과적으론 위 loss term을 추가할 수록 좋아짐
- We find that adding L_LN improves the accuracy from 60.7% to 62.71% and including both L_UL and L_LN provides a further improvement to 63.3%
- 파이썬 코드 예제
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36choices_scores = (
F.cross_entropy(model_output.logits.flatten(0, 1), lm_target.flatten(0, 1), reduction="none")
.view(bs, num_choices, -1)
.sum(dim=-1)
)
if self.config.length_norm > 0: // 타겟 길이 정규화
choices_scores = choices_scores / torch.pow(
(choices_ids != self.tokenizer.pad_token_id).sum(dim=-1), self.config.length_norm
)
lm_loss = F.cross_entropy(
model_output.logits.view(bs, num_choices, *model_output.logits.size()[1:])[range(bs), labels].flatten(
0, 1
),
lm_target.view(bs, num_choices, -1)[range(bs), labels].flatten(0, 1),
)
tensorboard_logs = {"lm_loss": lm_loss.item()}
if self.config.mc_loss > 0:
mc_loss = F.cross_entropy(-choices_scores, labels)
tensorboard_logs["mc_loss"] = mc_loss.item()
else:
mc_loss = 0.0
if self.config.unlikely_loss > 0: // Unlikelihood loss 추가
cand_loglikely = -F.cross_entropy(
model_output.logits.flatten(0, 1), lm_target.flatten(0, 1), reduction="none"
).view(bs, num_choices, -1)
cand_loglikely += (lm_target < 0).view(bs, num_choices, -1) * -100
cand_loglikely[range(bs), labels] = -100
unlikely_loss = -torch.log(1 - torch.exp(cand_loglikely) + 1e-2).sum() / (cand_loglikely != -100).sum()
tensorboard_logs["unlikely_loss"] = unlikely_loss.item()
else:
unlikely_loss = 0.0
loss = lm_loss + mc_loss * self.config.mc_loss + unlikely_loss * self.config.unlikely_loss
tensorboard_logs["loss"] = loss.item()
Parameter-efficient fine-tuning with (IA)3
- we need a PEFT method that has the following properties:
- First, it must add or update
as few parameters as possible
to avoid incurring storage and memory costs. - Second, it should achieve strong accuracy after few-shot training on new tasks.
- Finally, it must allow for mixed-task batches, since that is a capability of ICL.
- (In order to easily enable mixed-task batches, a PEFT method should ideally not modify the model itself.)
- First, it must add or update
- A more convenient alternative is provided by methods that directly modify the
activations of the model
since this can be done independently and cheaply to each example in the batch according to which task the example corresponds to. - As an alternative, we explored
element-wise multiplication (i.e. rescaling)
of the model’s activations against a learned vector. - Activation을 이용한 방법들이 있고 그중에 sequence of Activation에 learnable vector를 element-wise로 곱해서 adaptation!를 할수도있음
- 실험해보니 each set of activations에 learned rescaling vector를 사용할 필요가 없고 대신
- self-attention + encoder-decoder attention에 있는 key, value쪽에 rescaling vectors를 도입 및
- position-wise feed-forward networks의 intermediate activation에도 도입하면 충분했음
- 즉 3가지 learnable vector를 제안하는 것임!
l_k, l_v, l_ff
(처음엔 다 ones vector로 init됨)- (특이한건 query에는 따로 처리해주는게 없음, 왜 없을까? 학습을 다 끝내고나면 저
l_k, l_v, l_ff
벡터들은 그대로 남는걸까? -> 아마 남는 형태가 되면 체크포인트 저장은 어떤식일까? 모델 자체가 변경되는건가?) - attention mechanism을 보면 key쪽에 리스케일 한번 하고 value쪽에서 리스케일 한번함 (bias가 아니라 그냥 rescaling 벡터같은데..근데 이게 layer마다 필요한건 아니고 하나만들어 놓으면 layer에 쭉 재활용하는건가? 그럼 싸게 먹히긴하는데 -> 식보니까 재활용 아니네..)
- (특이한건 query에는 따로 처리해주는게 없음, 왜 없을까? 학습을 다 끝내고나면 저
인코더에서는 L(d_k+d_v+d_ff)
개만큼 필요하고디코더에서는 L(2d_k+2d_v+d_ff)
개만큼 필요함- layer 개수만큼이니까 뭔가 그렇게 싸게 먹히는것 같진 않은데.. 흠
- We call our method (IA)3, which stands for “Infused Adapter by Inhibiting and Amplifying Inner Activations”
- (IA)3 makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector
- 그런데 이 부분이 꽤 흥미롭다! 이 벡터들을 가중치 행렬에 영구적용할 수 있어서 아키텍쳐 변경이 필요하지 않을 수 있다는 것! (체크포인트 저장이 쉬워질수있겠군) -> orginal 모델에서 추가적인 계산 코스트가 없다!
- We also note that, in the event that a model will only be used on a single task,
the modifications introduced by (IA)3 can also be applied to weight matrices permanently
so thatno elementwise multiplication is required and the model’s architecture remains unchanged
.
- We also note that, in the event that a model will only be used on a single task,
- T0-3B로 9개의 PEFT methods에 대해서 검증해봄
- 9 strong PEFT methods
- BitFit [47] which updates only the bias parameters;
- Adapters [23] which introduce task-specific layers after the self-attention and position-wise feed-forward networks;
- Compacter and Compacter++ [28] which improve upon adapters by using low-rank matrices and hypercomplex mul- tiplication;
- prompt tuning [14] which learns task-specific prompt embeddings that are concatenated to the model’s input;
- FISH Mask [26] which chooses a subset of parameters to update based on their ap- proximate Fisher information;
- Intrinsic SAID [27] which performs optimization in a low-dimensional subspace;
- prefix-tuning [29] which learns task-specific vectors that are concatenated to the model’s activations;
- LoRA [13] which assigns low-rank updates to parameter matrices.
Pre-training (IA)3
- pre-training the prompt embeddings in prompt tuning can improve performance when fine-tuning on downstream few-shot tasks. For pre- training, Gu et al. [18] use a suite of self-supervised tasks applied to unlabeled text data, and Vu et al. [19] consider using embeddings from a separate task or multitask mixture.
- We follow Vu et al. [19] and simply pre-train the new parameters introduced by (IA)3 on the same multitask mixture used to train T0. We pre-train for 100,000 steps with a batch size of 16 before fine-tuning the (IA)3 parameters on each individual downstream dataset
- We find that pre-training improves fine-tuned accuracy from 64.6 to 65.8 and therefore add it to our recipe
- 일단 성능은 오른건데, 잘 이해를 못하겠는데 어차피 matrix에 영구적으로 리스케일할건데 왜 굳이 또 이걸 pretraining에 학습을 해야돼?
- 일단 성능은 오른건데, 잘 이해를 못하겠는데 어차피 matrix에 영구적으로 리스케일할건데 왜 굳이 또 이걸 pretraining에 학습을 해야돼?
Combining the ingredients
- the T-Few recipe is defined as follows:
- We use the T0 model as a backbone.
- We add (IA)3 for downstream task adaptation
- use parameters initialized from pre-training (IA)3 on the same multitask mixture for T0
- T0 자체가 이미 instruction tuning 된건데 여기에 pre-training한다는게 도대체 뭐지.. T0 학습셋에 대해서 학습시킨걸 뜻하는건가…뭐지 ㅠ
- As an objective, we use the sum of a standard language modeling loss L_LM, an unlikelihood loss L_UL for incorrect choices, and a length-normalized loss L_LN.
- We train for 1,000 steps with a batch size of 8 sequences using the Adafactor optimizer [49] with a learning rate of 3e−3 and a linear decay schedule with a 60-step warmup
- apply prompt templates to downstream datasets during training and inference to convert each example into an instructive text-to-text format.
- we apply this recipe to every downstream dataset in exactly the same way without per-dataset hyperparameter tuning or modifications.
Outperforming ICL with T-Few
Performance on T0 tasks
Comparing computational costs
- we now compare the relative costs of each few-shot learning approach.
- For simplicity, we use the
FLOPs-per-token
estimates for Transformer-based language models introduced by Kaplan et al. [20].- Specifically, we estimate that a
decoder-only Transformer (e.g. the GPT series) with N parameters uses 2N FLOPs per token for inference
and6N FLOPs per token for training
.- 왜지?
Encoder-decoder models like T0 and T5
(where the encoder and decoder have the same number of layers and layer sizes) only process each token with either the encoder or decoder (each having roughly half the parameters of the full model), so the FLOPs per token estimates are halved to N and3N FLOPs per token
for inference and training.
- Specifically, we estimate that a
- FLOPs 측정의 단점?
- We note that FLOPs are not a direct measurement of real-world computational cost because latency, power usage, and other costs can vary significantly depending on hardware and other factors
- 그럼에도 불구하고 선택한 이유
- However, we focus on FLOPs because it is a hardware-independent metric that closely with real-world costs the hardware setup used for running the different methods we consider would likely vary significantly across methods
Inference cost.
- Processing a single input and all target choices with
T-Few
requires11e9 × 103 = 1.1e12 FLOPs
, whereasfew-shot ICL with GPT-3 175B
requires2 × 175e9 × (41 × 98 + 103) = 1.4e15 FLOPs
– more than 3 orders of magnitude more.
Training cost
- T-Few is the only method that involves updating parameters, it is the only method that incurs a training cost.
- Training an eleven billion parameter encoder-decoder model for 1,000 steps with a batch size of 8 length-103 sequences requires approximately 3 × 11e9 × 1, 000 × 8 × 103 = 2.7e16 FLOPs
- While not insignificant, this is only about 20 times larger than the FLOPs required to process a single example with few-shot ICL using GPT-3 175B. In other words, training T-Few costs as much as using GPT-3 175B to process 20 examples with few-shot ICL.
- We also found that fine-tuning T0 with T-Few on a single dataset only takes about a
half an hour
on asingle NVIDIA A100 GPU
. As of writing, this would cost about$2 USD using Microsoft Azure
Storage cost
- T-Few also incurs the largest storage cost.
- When stored as single-precision floats, the parameters added by (IA)3 take up 4.2 MB of space on disk
- In contrast, ICL methods only require storing the tokenized in-context examples (typically stored as 32-bit integers), resulting in a smaller 41 × 98 × 32 bits = 16 kB disk space requirement.
- However, we note that
4.2 MB
is dwarfed by the on-disk size of the model checkpoints themselves – storing the (IA)3 adaptation vectors for 10,000 tasks would take about as much space as the T0 checkpoint(41.5 GB)
.
Memory usage
- T-Few will incur a lower memory cost during inference. Additional memory costs are incurred when training T-Few due to the need to cache intermediate activations for backpropagation and for the gradient accumulator variables in Adafactor.
- it is possible to use the T-Few recipe on a single 80GB A100 GPU.
Performance on Real-world Few-shot Tasks (RAFT)
- 기존 평가가 few-shot learning에 적합하지 않다
- So far, we have evaluated performance on a collection of datasets that were not explicitly designed for benchmarking few-shot learning
- To better evaluate T-Few’s performance in the real world, we evaluated our approach on the RAFT benchmark
- RAFT consists of
11
“economically valuable” tasks that aim to mirror real-world applications - Importantly, each RAFT datasets has
only 50 training examples
with no validation set and a (larger) test set with no public labels, so it isimpossible to “cheat”
by tuning on an unrealistically-large validation set or by peeking at the test set [32, 31].
- RAFT leaderborad
Ablation experiments
- While the gains from adding each ingredient does not always significant increase the accuracy on each individual dataset, each ingredient consistently improves the average performance across datasets: Removing pre-training decreases accuracy by 1.6%, removing unlikelihood training and length normalization decreases accuracy by 4.1%, and removing both pre-training and our additional loss terms reduces accuracy by 2.5%.
Conclusion
- We introduced T-Few, a parameter-efficient few-shot learning recipe that attains higher accuracy than few-shot ICL at a lower computational cost
- T-Few uses (IA)3, a new PEFT method that rescales inner activations with learned vectors.
- Using (IA)3 produces better performance than fine-tuning the full model while only introducing a tiny amount of additional parameters.
- T-Few also uses two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices
- we found that T-Few uses over 1,000× fewer FLOPs during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU
Appendix
Full Unlikelihood Training and Length Normalization Results
Full PEFT Results
with L_UL, L_N | without L_UL, L_N |
---|---|
Comparing T-Few with few-shot ICL methods
구현
- LoRA쪽 구현체를 변경하는 식으로 되어있음
- t-few/configs/ia3.json
1 | { |
lora기반 ia3 구현체 스니펫
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62def modify_with_lora(transformer, config):
for m_name, module in dict(transformer.named_modules()).items():
if re.fullmatch(config.lora_modules, m_name):
for c_name, layer in dict(module.named_children()).items():
if re.fullmatch(config.lora_layers, c_name):
assert isinstance(
layer, nn.Linear
), f"LoRA can only be applied to torch.nn.Linear, but {layer} is {type(layer)}."
setattr(
module,
c_name,
LoRALinear(layer, config.lora_rank, config.lora_scaling_rank, config.lora_init_scale),
)
return transformer
class LoRALinear(nn.Module):
def __init__(self, linear_layer, rank, scaling_rank, init_scale):
super().__init__()
self.in_features = linear_layer.in_features
self.out_features = linear_layer.out_features
self.rank = rank
self.scaling_rank = scaling_rank
self.weight = linear_layer.weight
self.bias = linear_layer.bias
if self.rank > 0:
self.lora_a = nn.Parameter(torch.randn(rank, linear_layer.in_features) * init_scale)
if init_scale < 0:
self.lora_b = nn.Parameter(torch.randn(linear_layer.out_features, rank) * init_scale)
else:
self.lora_b = nn.Parameter(torch.zeros(linear_layer.out_features, rank))
if self.scaling_rank:
self.multi_lora_a = nn.Parameter(
torch.ones(self.scaling_rank, linear_layer.in_features)
+ torch.randn(self.scaling_rank, linear_layer.in_features) * init_scale
)
if init_scale < 0:
self.multi_lora_b = nn.Parameter(
torch.ones(linear_layer.out_features, self.scaling_rank)
+ torch.randn(linear_layer.out_features, self.scaling_rank) * init_scale
)
else:
self.multi_lora_b = nn.Parameter(torch.ones(linear_layer.out_features, self.scaling_rank))
def forward(self, input):
if self.scaling_rank == 1 and self.rank == 0:
# parsimonious implementation for ia3 and lora scaling
if self.multi_lora_a.requires_grad:
# 이 부분에서 IA3의 곱하기가 일어나는듯!!
hidden = F.linear((input * self.multi_lora_a.flatten()), self.weight, self.bias)
else:
hidden = F.linear(input, self.weight, self.bias)
if self.multi_lora_b.requires_grad:
hidden = hidden * self.multi_lora_b.flatten()
return hidden
else:
# general implementation for lora (adding and scaling)
weight = self.weight
if self.scaling_rank:
weight = weight * torch.matmul(self.multi_lora_b, self.multi_lora_a) / self.scaling_rank
if self.rank:
weight = weight + torch.matmul(self.lora_b, self.lora_a) / self.rank
return F.linear(input, weight, self.bias)F.linear 관련 내용
아래는 LoRA의 config인데 차이는 조금 있음
1 | class LoRAConfig: |
(IA3) Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning