2022-08-01 게시 됨2022-08-30 업데이트 됨paper32분안에 읽기 (약 4759 단어)

Efficient Training of Language Models to Fill in the Middle (FIM)

발표자료

(발표)Efficient Training of Language Models to Fill in the Middle.pdf

느낀점

첫인상은 data augmentation 기법에 관련된 내용을 extensive하게 검증했다정도..?
free-form generation을 하고 싶다에 초점을 두고 논문 전개

Note

50%란게 어떤걸까
- 데이터셋에서 FIM으로 transformation하는 비율 (FIM 자체는 랜덤하게 짜르니까)
SPM에서 캐싱이 무슨 의미 일까

Author

자세히 보기

2022-07-13 게시 됨2022-08-30 업데이트 됨paper13분안에 읽기 (약 1898 단어)

(ALiBi) TRAIN SHORT, TEST LONG: ATTENTION WITH LINEAR BIASES ENABLES INPUT LENGTH EXTRAPOLATION

Ref

Author

저자: Ofir Press1,2 Noah A. Smith1,3 Mike Lewis2
1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Facebook AI Research 3Allen Institute for AI

요약

extrapolation 잘됨
11% 빠름 속도, 11% 메모리 적게씀
동일 길이로 학습한 모델 대비 짧은 길이로 학습해도 ppl 유지됨
구현도 간단하다
position embedding 지우고 대신에 길이에 linear하게 비례해서 attention score 깎아버리자!

Abstract

자세히 보기

2022-06-20 게시 됨2022-08-30 업데이트 됨paper15분안에 읽기 (약 2211 단어)

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

논문파일

COCO-LM- Correcting and Contrasting Text Sequences for Language Model Pretraining.pdf

Ref

github: https://github.com/microsoft/COCO-LM
발표슬라이드: COCO_LM_220622.pdf

Author

저자: Yu Meng1∗, Chenyan Xiong2, Payal Bajaj2, Saurabh Tiwary2, Paul Bennett2, Jiawei Han1, Xia Song2
1 University of Illinois at Urbana-Champaign 2 Microsoft
- NeurIPS 2021 논문

요약

자세히 보기

2022-05-23 게시 됨2022-08-30 업데이트 됨paper18분안에 읽기 (약 2671 단어)

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

용어

reduce -> 각 프로세스가 가진 값들을 특정한 하나의 process에 연산해서 모으는 연산 (하나의 값으로 모음)
all+* -> 이름 앞에 all이 붙으면 연산결과를 참여한 모든 프로세스가 동일하게 반환받음
all reduce -> 하나의 디바이스가 reduce한 값을 참여한 모든 프로세스가 동일하게 받을 수 있게 전달

논문파일

Megatron-LM- Training Multi-Billion Parameter Language Models Using Model Parallelism.pdf

Ref

https://www.youtube.com/watch?v=w4a-ARCEiqU

Author

자세히 보기

2022-05-19 게시 됨2022-08-30 업데이트 됨paper14분안에 읽기 (약 2164 단어)

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Author

저자: Yinhan Liu∗§ Myle Ott∗§ Naman Goyal∗§ Jingfei Du∗§ Mandar Joshi† Danqi Chen§ Omer Levy§ Mike Lewis§ Luke Zettlemoyer†§ Veselin Stoyanov§
- † Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
- § Facebook AI

느낀점

Abstract

hyperparameter choices have significant impact on the final results
carefully measures the impact of many key hyperparameters and training data size
find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it

Introduction

We present a replication study of BERT pretraining (Devlin et al., 2019), which includes a careful evaluation of the effects of hyperparmeter tuning and training set size.
modifications
- (1) training the model longer, with bigger batches, over more data;
- (2) removing the next sentence prediction objective;
- (3) training on longer sequences; and
- (4) dynamically changing the masking pattern applied to the training data.
contributions
- (1) We present a set of important BERT design choices and training strategies and introduce alternatives that lead to better downstream task performance;
- (2) We use a novel dataset, CC-NEWS, and confirm that using more data for pretraining further improves performance on downstream tasks;
- (3) Our training improvements show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods.

자세히 보기

2022-01-10 게시 됨2022-08-30 업데이트 됨paper2분안에 읽기 (약 369 단어)

A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More

Author

저자:
- Iddo Drori1,a,b, Sunny Trana, Roman Wangb, Newman Chengb, Kevin Liua, Leonard Tangc, Elizabeth Kea, Nikhil Singha, Taylor L. Pattic, Jayson Lynchd, Avi Shporera, Nakul Vermab, Eugene Wub, and Gilbert Strang(아니 그 유명한 길버트 스트랭..)a
- aMIT; bColumbia University; cHarvard University; dUniversity of Waterloo

느낀점

large scale 모델이 생각보다 할줄아는게 많다는걸 알게됨.. 코드로 파인튜닝하면 수학문제 푸는 코드도 만드는구나 (그런 코드가 깃헙에 있었겠지만..!)

Abstract

program synthesis을 통해 PLM & code에 finetune된 모델(Codex Transformer model)이 수학문제를 풀수있음을 논함
university-level Mathematics course questions을 생성하는 연구(?)

Introduction

자세히 보기

2021-12-20 게시 됨2022-08-30 업데이트 됨paper12분안에 읽기 (약 1834 단어)

CLINE: Contrastive Learning with Semantic Negative Examples for Natural Language Understanding

Author

저자:
- Dong Wang1,2∗ , Ning Ding1,2∗, Piji Li3† , Hai-Tao Zheng1,2†
- 1Department of Computer Science and Technology, Tsinghua University 2Tsinghua ShenZhen International Graduate School, Tsinghua University 3Tencent AI Lab
- google scholar에서 찾긴 어려웠음

느낀점

이 논문에서는 adversarial을 같은말이라고 쓰는거 같고, constrastive를 반대말이라고 쓰는듯..
PLM을 학습할때 두번째 pair에 아무 문장이나 넣는게 아니라 의미적으로 다른 문장을 넣겠다가 핵심임
https://github.com/kandorm/CLINE

Abstract

PLM이 양질의 semantic representation을 만들어주지만 simple perturbations에 취약함
PLM을 강건하게 하기위해 adversarial training에 포커스를 맞추고 있음
이미지 프로세싱과는 다르게 텍스트는 discrete하기 때문에 몇개의 단어 교체는 매우 큰 차이를 만들어내기도함
이러한 결과를 연구하기 위해 perturbation 관련 여러 파일럿 실험을 진행했음
adversarial training이 useless하거나 오히려 모델에 안좋다는 사실을 발견함
이러한 문제를 해결하기 위해 Contrastive Learning withg semantIc Negative Examples (CLINE)을 제안함
unsupervised 방식의 의미적으로 네거티브한 샘플들을 구축했고, 이를 통해 semantically adversarial attacking에 robust하도록 개선하려함
실험적 결과로는 sentiment analysis, reasoning, MRC 등 태스크에서 개선효과가 있었음
문장레벨에서 CLINE이 서로 다른 의미에 대해서 분리되고 같은 의미에 대해서는 모이는 것도 확인할 수 있었음(임베딩얘긴듯..)

Introduction

자세히 보기

2021-12-13 게시 됨2022-08-30 업데이트 됨paper5분안에 읽기 (약 812 단어)

GPT Understands, Too

Author

저자:
- Xiao Liu* 1 2 Yanan Zheng* 1 2 Zhengxiao Du1 2

느낀점

neural model은.. 작은 변화에 너무 민감하다?!

Abstract

GPTs 계열에서 기존의 fine-tuning 방법이 NLU task에서 좋은 결과를 내기 어려웠음
새로운 방법인 p-tuning이라는 방법을 제안해서 BERT와 비슷한 크기의 모델에서는 좋은 결과를 내게함
knowledge probing (LAMA) benchmark에서 64%(P@1)를 기록했음, SuperGlue에선는 BERT의 지도학습보다 좋은 결과를 냄
p-tuning이 BERT 성능도 좋게함을 발견함(few-sho & 지도학습 셋팅에서)
p-tuning은 few-shot SuperGlue에서 SOTA임

Introduction

자세히 보기

2021-12-06 게시 됨2022-08-30 업데이트 됨paper7분안에 읽기 (약 1089 단어)

WARP: Word-level Adversarial ReProgramming

Author

저자:
- Karen Hambardzumyan1, Hrant Khachatrian1,2, Jonathan May3 (1YerevaNN, 2Yerevan State University,
  3Information Sciences Institute, University of Southern California), 2021

느낀점

PET + p-tuning

Abstract

대부분의 transfer learning은 params sharing을 최대화해서, 하나 혹은 여러 task-specific layers를 LM 위에 쌓아서 학습하는 형태임
본 논문에서는 다른 형태로 automatic propmpt generation이라는 선행연구 기반의 adversarial reprogramming 방법을 사용함
Adversarial reprogramming에서는 task-specific word embeddings 을 학습하는데, 이는 특정 input text가 합쳐져서 입력으로 들어올때 LM이 specified task를 해결하게 하는 것임 (이래서 propmpt연구의 확장이라 했나..)
25K trainable params로 25M trainable params 모델까지 outperform했음 (GLUE benchmark 기준)
task-specific human-readable prompts로 few-shot setting(32 training samples)에서 2개의 SuperGLUE task에서 GPT-3보다 좋은 성능을 냄

Introduction

자세히 보기

2021-11-29 게시 됨2022-08-30 업데이트 됨paper7분안에 읽기 (약 1079 단어)

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Author

저자:
- Oshin Agarwal∗1 Heming Ge2 Siamak Shakeri2 Rami Al-Rfou2
  (1 University of Pennsylvania 2Google Research), 2021

느낀점

기존에 KG triples을 자연어 문장으로 바꾸는 연구가 생각보다 적었었나? 싶었음 (혹은 잘 안되었던것 같음. explicit하게 표현이 안된다던지)

Abstract

KG triples를 자연어로 바꾸는 연구들(Data-To-Text Generation)은 주로 도메인특화된 벤치마크셋 중심으로 연구되었음
wikidata와 같은 데이터셋도 structetured KGs와 natural language corpora를 결합하는데 쓸수있음을 본 연구에서 보였음 (existing LM과 결합가능)

Introduction

자세히 보기