2022-09-20 게시 됨2022-09-20 업데이트 됨paper5분안에 읽기 (약 812 단어)

Learning rate & warmup step & LR scheduling

Background

요즘 딥러닝을 하다보면 수도없이 접하게 되는 단어 중 하나는 learning rate, warmup, LR scheduler와 같은 것들입니다.
이미 시중에 여러가지 기법들이 나와있고 한번은 정리해야겠다 생각했는데, 우연히 좋은 스레드를 발견하게 되서 공유해봅니다.

원문: What is exactly the learning rate warmup described in the paper?

버트 논문에는 다음과 같은 그림이 있습니다. 여기서 밑줄쳐진 learning rate warmup과 linear decay에 대한 내용입니다.

Warmup

일반적으로 lr warmup은 말그대로 천천히 lr을 올리는 작업을 뜻합니다.
lr을 2e-5로 셋팅하고 warmup step을 10,000으로 셋팅한다면, (linear 일 경우) lr은 10,000 스텝동안 0에서 2e-5 까지 증가하게 됩니다.
코드 구현을 보면 다음과 같습니다
현재 스텝(global step)이 warmup 스텝 대비 어느정도이지 비율을 구하고 (warmup_percent_done = global_steps_float / warmup_steps_float)그에 맞는 warmup_lr을 구하는 방식으로 lr을 조절해갑니다. (warmup_learning_rate = init_lr * warmup_percent_done)

def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
  """Creates an optimizer training op."""
  global_step = tf.train.get_or_create_global_step()

  learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)

  # Implements linear decay of the learning rate.
  learning_rate = tf.train.polynomial_decay(
      learning_rate,
      global_step,
      num_train_steps,
      end_learning_rate=0.0,
      power=1.0,
      cycle=False)

  # Implements linear warmup. I.e., if global_step < num_warmup_steps, the
  # learning rate will be `global_step/num_warmup_steps * init_lr`.
  if num_warmup_steps:
    global_steps_int = tf.cast(global_step, tf.int32)
    warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)

    global_steps_float = tf.cast(global_steps_int, tf.float32)
    warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)

    warmup_percent_done = global_steps_float / warmup_steps_float
    warmup_learning_rate = init_lr * warmup_percent_done

    is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
    learning_rate = (
        (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)

  # It is recommended that you use this optimizer for fine tuning, since this
  # is how the model was trained (note that the Adam m/v variables are NOT
  # loaded from init_checkpoint.)
  optimizer = AdamWeightDecayOptimizer(
      learning_rate=learning_rate,
      weight_decay_rate=0.01,
      beta_1=0.9,
      beta_2=0.999,
      epsilon=1e-6,
      exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])

why warmup is required for BERT(MLM, from scratch)?

LR을 조절해주는걸 LR scheduling이라 부르고, 이러한 행위는 DL model 성능 개선에 도움을 주는 것으로 알려졌습니다.
- 관련논문: A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay
Another advantage of a high learning rate near the beginning (after warmup, which is another issue) is that it has a regularisation effect as it ends up in a relatively “flat” part of parameter space (ie: the hessian of the loss is relatively small). The idea of “super-convergence” tries to utilise this (paper, blog).
- 블로그 내용이 꽤 유익하니 참고하셔도 좋을 것 같습니다, linear, cycle등 scheduler에 대해 다룹니다.

Weight decay & AdamW

참고: https://hiddenbeginner.github.io/deeplearning/paperreview/2019/12/29/paper_review_AdamW.html
weight decay는 gradient를 업데이트할때 이전 weight의 크기를 일정부분 감소시켜줘서 업데이트를 원활하게 해주는 트릭입니다. 0~1 사이로 셋팅하며 PLM에서는 보통 0.01 정도로 셋팅합니다.

Learning rate & warmup step & LR scheduling

https://eagle705.github.io/Learning-rate-warmup-scheduling/

Author

Joosung Yoon

Posted on

2022-09-20

Updated on

2022-09-20

Licensed under

#nlp

Learning rate & warmup step & LR scheduling

Background

Warmup

why warmup is required for BERT(MLM, from scratch)?

Weight decay & AdamW

Author

Posted on

Updated on

Licensed under

댓글

카테고리

아카이브

태그

광고

카탈로그

최근 글