2023-01-09 게시 됨2023-01-09 업데이트 됨11분안에 읽기 (약 1598 단어)

(Chinchilla) Training Compute-Optimal Large Language Models

Note

paper file: Training Compute-Optimal Large Language Models.pdf
최적 모델 크기와 데이터 크기, FLOPs를 알기위한 함수를 estimate했던 논문
데이터 스케일링도 모델스케일링만큼 중요하다!

Author

Jordan Hoffmann★, Sebastian Borgeaud★, Arthur Mensch★, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals and Laurent Sifre★ (★Equal contributions)
- DeepMind

Abstract

investigate the optimal model size and number of tokens for training a transformer language model
By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, find that for compute-optimal training, the model size and the number of training tokens should be scaled equally
- 모델 사이즈와 학습 토큰의 스케일은 비례함
for every doubling of model size the number of training tokens should also be doubled
test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data
Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
- 친칠라가 다운스트림태스크에서 다 이겼다?
Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher

Introduction

2022-12-12 게시 됨2022-12-12 업데이트 됨paper18분안에 읽기 (약 2743 단어)

Robust Conversational Agents against Imperceptible Toxicity Triggers

Note

Github: https://github.com/Ninarehm/Robust-Agents
발표자료: Robust Conversational Agents against Imperceptible Toxicity Triggers.pdf

Author

Ninareh Mehrabi1, Ahmad Beirami2∗, Fred Morstatter1, Aram Galstyan1
1University of Southern California - Information Sciences Institute 2Meta AI

Abstract

최근 NLP 연구는 다양한 toxicity detection에 개선이 있었음
- toxicity detection models with the intention of identifying and mitigating toxic language from existing systems.
기존 연구가 많긴하나 adversarial attacks과 defense에 대한 연구는 부족했음
- adversarial attacks that force the system to generate toxic language and the defense against them
기존의 연구는 대부분 사람이 attack 용 문장을 생성해왔음, 비용이 비싸고 확장가능하지 않음
반면에 자동화해서 만든 attack인 경우 attack vector가 human-like language와 맞지 않음, 이는 LM loss로 detecting이 가능함
- Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss
본 연구에서는 conversational agents를 눈에 띄지 않게 공격(앞서 자동화한 공격과 달리 인식되지 못하게) 하는 방법을 coherency, relevancy, fluency 관점에서 제안함
- propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language
본 연구에서는 제안한 attack에 대한 defense mechanism도 제안함. 공격을 완화시킬뿐만 아니라 conversational flow도 유지시킬 수 있는 방법을 제안함
- propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow
결론적으로 공격이 잘들어와도 잘 막을 수 있는 방법에 대해 automaitc and human evaluations했고 효과적임을 보였음
- our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy

Introduction

2022-11-02 게시 됨2022-11-02 업데이트 됨paper30분안에 읽기 (약 4456 단어)

SOCIAL CHEMISTRY 101 - Learning to Reason about Social and Moral Norms

Note

web site: https://maxwellforbes.com/social-chemistry/
paper: SOCIAL CHEMISTRY 101- Learning to Reason about Social and Moral Norms.pdf
데이터셋 수집 순서
- 상황 수집 필요 (여기선 레딧, 고민상담, 찬반좌등이 있을법한 상황설정된곳에서 가져옴)
- → 상황마다 RoT를 1~5개 생성
- → RoT마다 multiple annotation 작성 (같은 RoT를 다르게 볼 수 있으니..)
- → RoT Breakdown, Action Breakdown (Action도 해야되나.. RoT안에 포함된거아닌가.. attribute 자체는 다르긴한데? 정답은! RoT는 도덕적인 관점에서 따지고, Action은 법이나 문화적인 관점에서 따짐)
- → 추가적인 태깅도 진행 불분명한 상황이거나, 19금상황이거나, 너무불편한주제(학대나,…); 이런 경우는 스킵가능
RoT가이드라인등은 비슷하게 가져가면 될 것으로 보임
137명이 태깅함
split our dataset into 80/10/10% train/dev/test partitions by situation
데이터 예시

area	m	split	rot-agree	rot-categorization	rot-moral-foundations	rot-char-targeting	rot-judgment	action	action-agency	action-moral-judgment	action-agree	action-legal	action-pressure	action-char-involved	action-hypothetical	situation	situation-short-id	rot	rot-id	rot-worker-id	breakdown-worker-id	n-characters	characters
amitheasshole	1	train	4	advice	loyalty-betrayal	char-1	it’s bad	doing something that causes other people to lose trust in you.	agency	-1	3	legal	-2	char-1	hypothetical	losing trust in my friend	reddit/amitheasshole/aypvmz	It’s bad to do something that causes other people to lose trust in you.	rot/reddit/amitheasshole/aypvmz/3K5TEWLKGYQFYAIY0H6JQMIY5MEIVM/127/2	127	0	2	narrator\|my friend
amitheasshole	1	dev	3	social-norms	loyalty-betrayal	char-0	expected	people participating in the big events in their friends’ lives if asked.	agency	0	3	legal	2	char-0	explicit-no	saying no to being a bridesmaid at a friend’s wedding	reddit/amitheasshole/9tzn0z	People are expected to participate in the big events in their friends’ lives if asked.	rot/reddit/amitheasshole/9tzn0z/3EG49X351XRR9FSSYVYCH4PEC656XX/89/1	89	39	3	narrator\|a bridesmaid\|a friend
amitheasshole	1	test	3	social-norms	care-harm\|loyalty-betrayal	char-1	Partners should	Listening to each other’s issues.	agency	2	3	legal	2	char-1	probable	telling my boyfriend I am bored and unhappy at my job	reddit/amitheasshole/a1311q	Partners should listen to each other’s issues.	rot/reddit/amitheasshole/a1311q/3JV9LGBJWWT6CZ369HK2AIBAUGUGOV/111/2	111	145	2	narrator\|my boyfriend

Author

Maxwell Forbes†‡ Jena D. Hwang‡ Vered Shwartz†‡ Maarten Sap† Yejin Choi†‡
†Paul G. Allen School of Computer Science & Engineering, University of Washington ‡Allen Institute for AI

Abstract

introduce SOCIAL-CHEM- 101, a large-scale corpus that catalogs 292k rules-of-thumb such as “It is rude to run a blender at 5am” as the basic conceptual units.
Each rule-of-thumb is further broken down with 12 different dimensions of people’s judgments, including social judgments of good and bad, moral foundations, expected cultural pressure, and assumed legality
- which together amount to over 4.5 million annotations of categorical labels and free-text descriptions.
NEURAL NORM TRANSFORMER, learns and generalizes SOCIAL-CHEM-101 to successfully reason about previously unseen situations, generating relevant (and potentially novel) attribute-aware social rules-of-thumb

2022-10-05 게시 됨2022-10-05 업데이트 됨15분안에 읽기 (약 2266 단어)

A Contrastive Framework for Neural Text Generation (NeurIPS 2022)

발표자료

논문: A Contrastive Framework for Neural Text Generation.pdf
발표자료: A Contrastive Framework for Neural Text Generation.pdf

느낀점

잘 쓴 논문 같다
간단한 아이디어지만 효과적
decoding 시간이 생각보다 덜 걸려서 신기했음
contrastive 방법론이 결국 비슷한 토큰 안나오게 하겠다인데, simCTG는 MLE 로 보완되지만 디코딩 부분은 아예 비슷한걸 견제하는 식으로 나오는데도 결과가 좋게 나오는게 신기 (물론 기존의 확률아 있어서 보완이 되지만) -> degeneration penalty를 크게 줘도 ppl 결과가 좋길래 신기했음)

Note

Author

Yixuan Su* Tian Lan** Yan Wang** Dani Yogatama+ Lingpeng Kong++ Nigel Collier*
*Language Technology Lab, University of Cambridge
**Tencent AI Lab +DeepMind
++Department of Computer Science, The University of Hong Kong

2022-09-20 게시 됨2022-09-20 업데이트 됨paper5분안에 읽기 (약 812 단어)

Learning rate & warmup step & LR scheduling

Background

요즘 딥러닝을 하다보면 수도없이 접하게 되는 단어 중 하나는 learning rate, warmup, LR scheduler와 같은 것들입니다.
이미 시중에 여러가지 기법들이 나와있고 한번은 정리해야겠다 생각했는데, 우연히 좋은 스레드를 발견하게 되서 공유해봅니다.

원문: What is exactly the learning rate warmup described in the paper?

버트 논문에는 다음과 같은 그림이 있습니다. 여기서 밑줄쳐진 learning rate warmup과 linear decay에 대한 내용입니다.

Warmup

일반적으로 lr warmup은 말그대로 천천히 lr을 올리는 작업을 뜻합니다.
lr을 2e-5로 셋팅하고 warmup step을 10,000으로 셋팅한다면, (linear 일 경우) lr은 10,000 스텝동안 0에서 2e-5 까지 증가하게 됩니다.
코드 구현을 보면 다음과 같습니다
현재 스텝(global step)이 warmup 스텝 대비 어느정도이지 비율을 구하고 (warmup_percent_done = global_steps_float / warmup_steps_float)그에 맞는 warmup_lr을 구하는 방식으로 lr을 조절해갑니다. (warmup_learning_rate = init_lr * warmup_percent_done)

def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
  """Creates an optimizer training op."""
  global_step = tf.train.get_or_create_global_step()

  learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)

  # Implements linear decay of the learning rate.
  learning_rate = tf.train.polynomial_decay(
      learning_rate,
      global_step,
      num_train_steps,
      end_learning_rate=0.0,
      power=1.0,
      cycle=False)

  # Implements linear warmup. I.e., if global_step < num_warmup_steps, the
  # learning rate will be `global_step/num_warmup_steps * init_lr`.
  if num_warmup_steps:
    global_steps_int = tf.cast(global_step, tf.int32)
    warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)

    global_steps_float = tf.cast(global_steps_int, tf.float32)
    warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)

    warmup_percent_done = global_steps_float / warmup_steps_float
    warmup_learning_rate = init_lr * warmup_percent_done

    is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
    learning_rate = (
        (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)

  # It is recommended that you use this optimizer for fine tuning, since this
  # is how the model was trained (note that the Adam m/v variables are NOT
  # loaded from init_checkpoint.)
  optimizer = AdamWeightDecayOptimizer(
      learning_rate=learning_rate,
      weight_decay_rate=0.01,
      beta_1=0.9,
      beta_2=0.999,
      epsilon=1e-6,
      exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])

2022-08-30 게시 됨2022-09-14 업데이트 됨paper20분안에 읽기 (약 3045 단어)

LLM(Large-Scale Language Model)을 위한 넓고 얕은 지식들

최근 LLM 관련 일을 하면서 익혀야할게 너무나 많다는 사실을 새삼스럽게 알게 되었다. 예전엔 아 그냥 하면 되지~ 정도로 생각했는데.. 디테일한게 생각보다 많구나 싶어서 이것 저것 많이 보기야 봤는데 머리에 남는게 없는 것 같아서 글로 간단하게 그리고 약간 어쩔수 없이 파편화된 상태로 정리해놓으려한다. 나 포함 누군가에게 도움이 되기를 바라며

PyTorch로 분산 어플리케이션 개발하기

제일 먼저 볼 것!
참고
Pytorch Multi-GPU 정리 중
분산학습을 할때 로그를 찍으면 프로세스 개수만큼 찍힌다 -> 따로 처리가 필요해짐! if rank==0일때만 찍게 한다던지

code snippet

all-reduce

EleutherAI gpt-neox

def reduce_losses(losses):
    """Reduce a tensor of losses across all GPUs."""
    reduced_losses = torch.cat([loss.clone().detach().view(1) for loss in losses])
    torch.distributed.all_reduce(reduced_losses)
    reduced_losses = reduced_losses / torch.distributed.get_world_size()
    return reduced_losses

Multi GPU & Node

2022-08-01 게시 됨2022-08-30 업데이트 됨paper32분안에 읽기 (약 4759 단어)

Efficient Training of Language Models to Fill in the Middle (FIM)

발표자료

(발표)Efficient Training of Language Models to Fill in the Middle.pdf

느낀점

첫인상은 data augmentation 기법에 관련된 내용을 extensive하게 검증했다정도..?
free-form generation을 하고 싶다에 초점을 두고 논문 전개

Note

50%란게 어떤걸까
- 데이터셋에서 FIM으로 transformation하는 비율 (FIM 자체는 랜덤하게 짜르니까)
SPM에서 캐싱이 무슨 의미 일까

Author

2022-07-13 게시 됨2022-08-30 업데이트 됨paper13분안에 읽기 (약 1898 단어)

(ALiBi) TRAIN SHORT, TEST LONG: ATTENTION WITH LINEAR BIASES ENABLES INPUT LENGTH EXTRAPOLATION

Ref

Author

저자: Ofir Press1,2 Noah A. Smith1,3 Mike Lewis2
1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Facebook AI Research 3Allen Institute for AI

요약

extrapolation 잘됨
11% 빠름 속도, 11% 메모리 적게씀
동일 길이로 학습한 모델 대비 짧은 길이로 학습해도 ppl 유지됨
구현도 간단하다
position embedding 지우고 대신에 길이에 linear하게 비례해서 attention score 깎아버리자!

Abstract

2022-06-20 게시 됨2022-08-30 업데이트 됨paper15분안에 읽기 (약 2211 단어)

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

논문파일

COCO-LM- Correcting and Contrasting Text Sequences for Language Model Pretraining.pdf

Ref

github: https://github.com/microsoft/COCO-LM
발표슬라이드: COCO_LM_220622.pdf

Author

저자: Yu Meng1∗, Chenyan Xiong2, Payal Bajaj2, Saurabh Tiwary2, Paul Bennett2, Jiawei Han1, Xia Song2
1 University of Illinois at Urbana-Champaign 2 Microsoft
- NeurIPS 2021 논문

요약

2022-05-23 게시 됨2022-08-30 업데이트 됨paper18분안에 읽기 (약 2671 단어)

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

용어

reduce -> 각 프로세스가 가진 값들을 특정한 하나의 process에 연산해서 모으는 연산 (하나의 값으로 모음)
all+* -> 이름 앞에 all이 붙으면 연산결과를 참여한 모든 프로세스가 동일하게 반환받음
all reduce -> 하나의 디바이스가 reduce한 값을 참여한 모든 프로세스가 동일하게 받을 수 있게 전달

논문파일

Megatron-LM- Training Multi-Billion Parameter Language Models Using Model Parallelism.pdf

Ref

https://www.youtube.com/watch?v=w4a-ARCEiqU

Author

Note

Author

Abstract

Introduction

Note

Author

Abstract

Introduction

Note

Author

Abstract

발표자료

느낀점

Note

Author

Background

Warmup

PyTorch로 분산 어플리케이션 개발하기

code snippet

all-reduce

Multi GPU & Node

발표자료

느낀점

Note

Author

Ref

Author

요약

Abstract

논문파일

Ref

Author

요약

용어

논문파일

Ref

Author

카테고리

아카이브

태그

광고

최근 글