2022-06-20 게시 됨2022-08-30 업데이트 됨paper15분안에 읽기 (약 2211 단어)

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

논문파일

COCO-LM- Correcting and Contrasting Text Sequences for Language Model Pretraining.pdf

Ref

github: https://github.com/microsoft/COCO-LM
발표슬라이드: COCO_LM_220622.pdf

Author

저자: Yu Meng1∗, Chenyan Xiong2, Payal Bajaj2, Saurabh Tiwary2, Paul Bennett2, Jiawei Han1, Xia Song2
1 University of Illinois at Urbana-Champaign 2 Microsoft
- NeurIPS 2021 논문

요약

ELECTRA 개선 버전 논문
- 논문 나온 순서는 대략 ELECTRA - COCO-LM - SimCSE 가 됨
- RTD를 copy mechanism에 녹여서 multi-task learning으로 All-token MLM 사용
sentence similarity를 PLM self-supervised learning 안에 추가함
- present a self-supervised learning framework, COCO-LM, that pretrains Language Models by COrrecting and COntrasting corrupted text sequences

논문 순서는 대략 ELECTRA - COCO-LM - SimCSE 가 됨
약간 ELECTRA 논문의 연장선, 혹은 변형
- ELECTRA의 contribution -> 15% 계산량 -> 100%로 늘려서 효율 높임
  - input token copy mechanism 실험
- [MASK] token 대신 generator가 생성한 토큰을 써서 [MASK] 토큰이 일으키는 pre-train/fine-tune discrepancy 제거
- Replaced Token Detection (RTD) 태스크 제안

Abstract

present a self-supervised learning framework, COCO-LM, that pretrains Language Models by COrrecting and COntrasting corrupted text sequences
Following ELECTRA-style pretraining, COCO-LM employs an auxiliary language model to corrupt text sequences, upon which it constructs two new tasks for pretraining the main model
- The first token-level task, Corrective Language Modeling, is to detect and correct tokens replaced by the auxiliary model, in order to better capture token-level semantics.
- The second sequence-level task, Sequence Contrastive Learning, is to align text sequences originated from the same source input while ensuring uniformity in the representation space
- achieves the MNLI accuracy of ELECTRA with 50% of its pretraining GPU hours. With the same pretraining steps of standard base/large-sized models, COCO-LM outperforms the previous best models by 1+ GLUE average points.

Introduction

(무슨말이지..) ELECTRA, that uses an auxiliary language model (“generator”) to replace tokens in input texts and pretrains the main Transformer (“discriminator”) to detect replaced tokens. This improves the pretraining efficiency and effectiveness, but pretraining via binary classification hinders the model’s usage on applications requiring language modeling capability (e.g., prompt-based learning [15, 28, 46]). It could further distort the representation space as the Transformers are pretrained to output the same “non-replacement” label for all actual tokens.
present a new self-supervised learning approach, COCO-LM, that pretrains Lan guage Models by COrrecting and COntrasting corrupted text sequences
Following ELECTRA-style pretraining, COCO-LM employs an auxiliary model to corrupt the input texts, upon which it introduces two new pretraining tasks for the main Transformer, one at token level and one at sequence level.
The token-level task, corrective language modeling (CLM), pretrains the main Transformer to detect and correct the tokens in the corrupted sequences. It uses a multi-task setup to combine the benefits of replaced token detection and language modeling.
The sequence-level task, sequence contrastive learning (SCL), pretrains the model to align text sequences originated from the same source sequence and enforce uniformity of the representation space
GLUE [54] and SQuAD [41] benchmarks, COCO-LM not only outperforms state-of-the-art pretraining approaches in effectiveness, but also significantly improves the pretraining efficiency

Empirically, MLM is still among the most effective tasks to pretrain encoders
Instead of randomly altering texts, ELECTRA [7] uses a smaller auxiliary Transformer pretrained by MLM to replace some tokens in the text sequences using its language modeling probability, and pretrains the main Transformer to detect the replaced tokens. ELECTRA achieves state-of-the-art accuracy in many language tasks [7]. Later, Clark et el. [6] developed ELECTRIC, which pretrains encoders by contrasting original tokens against negatives sampled from a cloze model. ELECTRIC re-enables the language modeling capability but underperforms ELECTRA in downstream tasks.
Our work is also related to contrastive learning which has shown great success in visual representation learning [4, 22, 34]. Its effectiveness of in language is more observed in the fine-tuning stage, for example, in sentence representation [16], dense retrieval [60], and GLUE fine-tuning [19].

Method

We present the preliminaries of PLMs, their challenges, and the new COCO-LM framework.

Preliminary on Language Model Pretraining

In this work we focus on pretraining BERT-style bidirectional Transformer encoders
first recap the masked language modeling (MLM) task introduced by BERT [11] and then discuss the pretraining framework of ELECTRA

BERT Pretraining

ELECTRA Pretraining

Challenges of ELECTRA-Style Pretraining

Missing Language Modeling Benefits.
- classification task in ELECTRA is simpler and more stable [61], but raises two challenges.
  - first is the lack of language modeling capability which is a necessity in some tasks [6]. For example, prompt-based learning requires a language model to generate labels
  - second is that the binary classification task may not be sufficient to capture certain word-level semantics that are critical for token-level tasks
Squeezing Representation Space
- the representations from Transformer-based language models often reside in a narrow cone, where two random sentences have high similarity scores (lack of uniformity)
- closely related sentences may have more different representations (lack of alignment)
- Figure 1 illustrates such behaviors with random sentence pairs (from pretraining corpus) and semantically similar pairs (those annotated with maximum similarity from STS-B [3]). With RoBERTa, the cosine similarities of most random sentence pairs are near 0.8, bigger than many semantically similar pairs. The representation space from ELECTRA is even more squeezed. Nearly all sentence pairs, both random and similar ones, have around 0.9 cosine similarity. This may not be surprising as ELECTRA is pretrained to predict the same output (“non-replacement”) for all tokens in these sequences. The irregular representation space raises the risk of degeneration [37, 55] and often necessitates sophisticated post-adjustment or fine-tuning to improve the sequence representations [16, 30, 32, 60].

COCO-LM Pretraining

The auxiliary Transformer is pretrained by masked language modeling (MLM) and generates corrupted sequences.
The main Transformer is pretrained to correct the corruption (CLM) and to contrast the corrupted sequences with the cropped sequences (SCL)

Network Configurations

auxiliary model
- Similar to ELECTRA, the auxiliary Transformer is smaller than the main model
- We reduce the number of layers to 1/3 or 1/4 (under base or large model setup, respectively) but keep its hidden dimension the same with the main model, instead of shrinking its hidden dimensions
- We disable dropout in it when sampling replacement tokens.
main model
- standard architecture of BERT/ELECTRA

Experimental Setup

Pretraining Settings
- base, base++, and large++. Base is the BERTBase training configuration [11]: Pretraining on Wikipedia and BookCorpus [63] (16 GB of texts) for 256 million samples on 512 token sequences
- 32, 768 uncased BPE vocabulary
Model Architecture
- base/base++ model uses the BERT Base architecture [11]: 12 layer Transformer, 768 hidden size, plus T5 relative position encoding.
- large++ model is the same with BERTLarge, 24 layer and 1024 hidden size, plus T5 relative position encoding
- auxiliary network uses the same hidden size but a shallow 4-layer Transformer in base/base++ and a 6-layer one in large++. When generating XMLM we disable dropout in the auxiliary model
Downstream Tasks
- GLUE [54] and SQuAD 2.0
- Standard hyperparameter search in fine-tuning is performed, and the search space can be found in Appendix B.
- reported results are the median of five random seeds on GLUE and SQuAD

Evaluation Results

COCO-LM outperforms all recent state-of-the-art pretraining models on GLUE average and SQuAD

Efficiency

COCO-LM is more efficient in GPU hours. It outperforms RoBERTa & ELECTRA by 1+ points

Ablation Studies

base setting on GLUE DEV
예상과는 좀 다른..

Architecture.

Removing relative position encoding (Rel-Pos) leads to better numbers on some tasks but significantly hurts MNLI.

Pretraining Signal Construction.

Using randomly replaced tokens to corrupt text sequence hurts significantly. Using a converged auxiliary network to pretrain the main model also hurts. It is better to pretrain the two Transformers together

CLM Setup.

Disabling the multi-task learning and using All-Token MLM [7] reduces model accuracy.
The copy mechanism is effective. The benefits of the stop gradient operation are more on stability (preventing training divergence).

Analyses of Contrastive Learning with SCL

Ablation on Data Augmentation

Alignment and Uniformity

The representation space from COCO-LM is drastically different from those in Figure 1
With COCO-LM, similar pairs are more aligned and random pairs are distributed more uniformly
Their average cosine similarity is 0.925 when pretrained with SCL, while is 0.863 without SCL. This better alignment and uniformity is achieved by COCO-LM with SCL via pretraining

Regularizing the Representation Learning for Better Few-Shot Ability.

SCL is necessary to regularize the representation space and to reduce the risk of degeneration

Analyses of Language Modeling with CLM

CLM과 All-Token MLM 비교
It is quite an unbalanced task
- For the majority of the tokens (Original) the task is simply to copy its input at the same position.
- For the replaced tokens (7 − 8% total), however, the model needs to detect the abnormality brought by the auxiliary model and recover the original token
- Implicitly training the copy mechanism as part of the hard LM task is not effective: The copy accuracy of All-Token MLM is much lower, and thus the LM head may confuse original tokens with replaced ones
  - As shown in Table 3 and ELECTRA [7], pretraining with All-Token MLM performs worse than using the RTD task, though the latter is equivalent to only training the copy mechanism
  - The multi-task learning of CLM is necessary for the main Transformer to stably learn the language modeling task upon the corrupted text sequence.

Prompt-Based Fine-Tuning with CLM

the prompt-based fine-tuning experiments on MNLI for RoBERTa and COCO-LM under base++ and large++ sizes
COCO-LM’s main Transformer does not even see any
[MASK] tokens during pretraining but still performs well on predicting masked tokens for prompt-based learning.
Note that ELECTRA and COCO-LM variants without the CLM task are not applicable: Their main Transformers are not pretrained by language modeling tasks (thus no language modeling capability is learned to generate prompt label words).

Conclusion and Future Work

we present COCO-LM, which pretrains language models using Corrective Language Modeling and Sequence Contrastive Learning upon corrupted text sequences
With standard pre- training data and Transformer architectures, COCO-LM improves the accuracy on the GLUE and SQuAD benchmarks, while also being more efficient in utilizing pretraining computing resources and network parameters
One limitation of this work is that the contrastive pairs are constructed by simple cropping and MLM replacements
To better understand and tailor the training of the auxiliary model to the main model is another important future research direction

코드구현

loss 관련 코드 스니펫: https://github.com/microsoft/COCO-LM/issues/2#issuecomment-1003639940
scl쪽 (span으로 한번 임베딩뽑고, src로도 한번 뽑고)
위 코드 위치: https://github.com/microsoft/COCO-LM/blob/6bb6e5f62d65349657dd51f2f535454a1c50c2e9/fairseq/fairseq/models/cocolm/model.py#L190
unofficial implementation: https://github.com/lucidrains/coco-lm-pytorch/blob/main/coco_lm_pytorch/coco_lm_pytorch.py

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

https://eagle705.github.io/COCO-LM-Correcting and Contrasting Text Sequences for Language Model Pretraining/

Author

Joosung Yoon

Posted on

2022-06-20

Updated on

2022-08-30

Licensed under

#nlp

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

논문파일

Ref

Author

요약

Abstract

Introduction

Method

Preliminary on Language Model Pretraining

BERT Pretraining

ELECTRA Pretraining

Challenges of ELECTRA-Style Pretraining

COCO-LM Pretraining

Network Configurations

Experimental Setup

Evaluation Results

Efficiency

Ablation Studies

Architecture.

Pretraining Signal Construction.

CLM Setup.

Analyses of Contrastive Learning with SCL

Ablation on Data Augmentation

Alignment and Uniformity

Regularizing the Representation Learning for Better Few-Shot Ability.

Analyses of Language Modeling with CLM

Prompt-Based Fine-Tuning with CLM

Conclusion and Future Work

코드구현

Author

Posted on

Updated on

Licensed under

댓글

카테고리

아카이브

태그

광고

카탈로그

최근 글

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

논문파일

Ref

Author

요약

Related work

Abstract

Introduction

Related Work

Method

Preliminary on Language Model Pretraining

BERT Pretraining

ELECTRA Pretraining

Challenges of ELECTRA-Style Pretraining

COCO-LM Pretraining

Network Configurations

Experimental Setup

Evaluation Results

Efficiency

Ablation Studies

Architecture.

Pretraining Signal Construction.

CLM Setup.

Analyses of Contrastive Learning with SCL

Ablation on Data Augmentation

Alignment and Uniformity

Regularizing the Representation Learning for Better Few-Shot Ability.

Analyses of Language Modeling with CLM

Prompt-Based Fine-Tuning with CLM

Conclusion and Future Work

코드구현

Author

Posted on

Updated on

Licensed under

댓글

카테고리

아카이브

태그

광고

카탈로그

최근 글