Robust Conversational Agents against Imperceptible Toxicity Triggers

Note

Author

  • Ninareh Mehrabi1, Ahmad Beirami2∗, Fred Morstatter1, Aram Galstyan1
  • 1University of Southern California - Information Sciences Institute 2Meta AI

Abstract

  • 최근 NLP 연구는 다양한 toxicity detection에 개선이 있었음
    • toxicity detection models with the intention of identifying and mitigating toxic language from existing systems.
  • 기존 연구가 많긴하나 adversarial attacks과 defense에 대한 연구는 부족했음
    • adversarial attacks that force the system to generate toxic language and the defense against them
  • 기존의 연구는 대부분 사람이 attack 용 문장을 생성해왔음, 비용이 비싸고 확장가능하지 않음
  • 반면에 자동화해서 만든 attack인 경우 attack vector가 human-like language와 맞지 않음, 이는 LM loss로 detecting이 가능함
    • Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss
  • 본 연구에서는 conversational agents를 눈에 띄지 않게 공격(앞서 자동화한 공격과 달리 인식되지 못하게) 하는 방법을 coherency, relevancy, fluency 관점에서 제안함
    • propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language
  • 본 연구에서는 제안한 attack에 대한 defense mechanism도 제안함. 공격을 완화시킬뿐만 아니라 conversational flow도 유지시킬 수 있는 방법을 제안함
    • propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow
  • 결론적으로 공격이 잘들어와도 잘 막을 수 있는 방법에 대해 automaitc and human evaluations했고 효과적임을 보였음
    • our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy

Introduction

  • 대화 시스템에서 adversarial attacks을 고려하는게 safe, robust 대화를 위해서 중요함
    • consider adversarial attacks on human-centric chatbots and dialogue systems. It is important for these systems to be safe and robust in the face of natural(-looking) human conversations
  • 대화 예시
    image
  • attacks
    • Our proposed approach works by augmenting the universal adversarial triggers (UAT) from Wallace et al. (2019) with additional selection criteria to generate imperceptible yet effective triggers
  • defense
    • then focus on a defense mechanism for the non-adversarial (defender) model to avoid generating toxic utterances
    • 간단한 방법(Xu et al., 2020)으로도 adversarial triggers를 찾아낼 수 있지만, 대화 흐름을 깰수있어서, 흐름을 깨지 않는 “detoxifies” 답변을 사용하는 defense mechanism 쪽에 관심을 가짐
    • Our proposed method relies on two levels of interpretable reasoning that helps the model to
      • (1) identify the key adversarial tokens responsible for the attack and
      • (2) avoid generating toxic responses by masking those tokens during the generation process.

Attack Approaches

  • first discuss the universal adversarial trigger(UAT) attack proposed by Wallace et al. (2019), which we use as our baseline
  • then propose alterations to this baseline to make the universal triggers more natural-looking and suitable for conversational domain

Methodology

  • Universal Adversarial Trigger (UAT) (Wallace et al., 2019)

    • The goal in universal adversarial trigger attack is to find a universal trigger sequence for a given trained model, which if attached to the start of any given input can cause the model to output the desired outcome (Wallace et al., 2019)
    • trigger sequence는 given input 앞에 붙게되면 모델의 결과물을 원하는대로 바꿔놓을 수 있는 것을 말하는 듯
    • This attack starts with a fixed-length sequence as the initial trigger, e.g., “the the the the the the” and tries to iteratively replace the tokens in the sequence to satisfy an objective.
    • The iterations terminate when no improvement (replacement) can be made to further optimize the objective
    • toxic token을 생성하기 위해서 넣는 거다보니 LM loss를 만족시킬 필요도 없고, ppl이 되게 높은 easily detectable한 반복적인 문장이 나오는 경우가 많음
    • image
  • Universal Adversarial Trigger with Language Model Loss (UAT-LM)

    • An intuitive solution to address the above shortcoming of UAT is to impose a language modeling objective on the trigger tokens.
    • image
    • 위 전략대로 lm loss 를 추가해도 generated triggers 끼리는 말이 될지라도 conversation flow가 coherency, relevancy하다고 장담할순없음
    • 다른 방법으로 수정한 방법론 제안하겠음
  • Unigram Trigger with Selection Criteria (UTSC)

    • propose an alternative approach in which we generate a collection of unigram triggers (with sequence length one) from UAT
      • 유니그램 트리거를 UAT에서 모음
    • then feed these triggers along with the history of the conversation h to our dialogue model and generate different attack utterances
      • 유니그램 트리거를 붙여서 다른 attack utterances를 생성해냄
    • Next, we pick the best suited attack utterance amongst all the generated attack utterances according to our selection criterion as demonstrated in Figure 2
      • 생성된것들중에서 selection criterion에 기초해서 가장 잘 맞는 utterance를 선택함
      • 유니그램 트리거를 conversation history에 붙여서 DialoGPT로 example을 생성하고 toxicity classfiers(단일 or 앙상블)로 점수를 내고 각각 기준(UTSC-N)에 따른 가장 높은 점수의 문장을 골라낸다
        • UTSC-1: 가장 높은 toxicity score 갖거나
        • UTSC-2: threshold 보다 큰 문장중에 가장 낮은 toxicity 점수를 갖거나 (threshold 못넘으면 가장 높은 점수를 가진 것)
        • UTSC-3: 가장 낮은 toxicity 점수를 갖는 것
      • image
      • 유니그램이라 fluency도 눈에 띄게 희생되진 않는다

Experimental Setup

  • General Setup
    • use DialoGPT
    • to generate 100 conversations around a specific topic
    • The topic is determined by the context sentence that starts the conversation between the adversary and the defender.
    • Each conversation runs for 10 turns
  • Toxicity Detection Models
    • utilize an ensemble of three different toxicity detection models:
      • Toxic-bert, Perspective API, and Safety classifier (Xu et al., 2020)
      • Toxic-bert is the least sensitive of the three, followed by Perspective API, and the Safety classifier
    • allow the adversary to only use one of the toxicity detection models to design its attack. We then quantify toxicity using the other two toxicity detection methods, not accessed by the adversary.
  • Data
    • context sentences from two different datasets, Wizard of Wikipedia (Dinan et al., 2018) and ConvoKit’s Reddit Corpus
    • Wikipedia: neutral topics
    • Reddit: sensitive topics
    • picked 50 random context sentences from the Wizard of Wikipedia and 50 from the Reddit datasets.
  • AMT Experiments
    • To compare and verify the quality of conversations generated during and after the attacks, we conduct human experiments
    • AMT workers annotated 100 conversations from each of the three attacks and each conversa- tion was annotated by 3 AMT workers giving us overall 900 annotated conversations 300 from each attack

Results

  • Attack Effectiveness
    image
    • two of our proposed attacks UAT-LM and UTSC-1 are performing the best according to the Perspective API and Toxic- bert classifiers
    • UAT baseline performs the best according to Safety classifier.
    • Overall results show that UTSC-1 and UAT-LM attacks are competitive attacks in terms of attack effectiveness.
    • UAT(baseline) attack tends to generate meaningless phrases, e.g., “acist neighborhoodsJohnson carry morals Ukrain” which can easily be detected as an anomaly and make the conversation not flow naturally
    • GPT-2 기준 PPL 차이
      • UAT is absurdly high (∼10^7) compared to ∼10^4 for UAT-LM, and ∼ 160 for UTSC-1
      • no attack case is ~39
  • Attack Transferability
    • attack is forcing the defender to generate actual toxic language rather than fooling the toxicity classifier.
    • image
  • Human Evaluation
    • Our UTSC-1 attack is rated to have the highest coherency
    • UTSC- 1 is rated to have more fluent attacks generated with mostly moderate to good scores and a higher average–shown by the black dotted lines–compared to the UAT and UAT-LM baselines
    • Fleiss Kappa (Fleiss, 1971) annotator agreement results from this evaluation is reported in Table 1. Annotators have reasonable overall agreement for all the qualities
      • image
    • image

Defense Approaches

  • two components
    • (a) detecting the attack and
    • (b) mitigating its effect by ensuring that the defender does not generate a toxic response
  • detection
    • The detection problem is rather straightforward, as the defense can simply run a toxicity classifier on the generated response
  • mitigation
    • Xu et al. (2020) suggested a mitigating approach which, when a toxic response is detected, simply resets the dialogue and generates a (non-toxic) utterance by randomly sampling from a predefined set of topics
      • 기존 연구에서는 대화 중단 후 미리 정의된 토픽에서 랜덤하게 샘플링해서 대화 다시 생성
    • 하지만 본 연구에서는 대화흐름을 유지하면서 toxic utterance 생성을 피하려고함!

Methodology

  • defense mechanism in the second stage utilizes two layers of reasoning using two different interpretability techniques
    • The first layer aims to detect which tokens in the defender’s utterance is making the toxicity detection model to label the utterance as being toxic.
      • defender’s utterance에서 문제의 토큰 찾기 (L1), we call these tokens the L1 tokens
    • The second layer aims to detect which tokens in the adversary’s attack utterance are responsible for generation of L1 tokens form defender’s utterance
      • adversary’s attack utterance에서 L1 token을 생성하는 원인 토큰 찾기 (L2 token)
    • defender then masks the L2 tokens from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance
      • L1을 생성하는 L2 토큰을 마스킹한다! 그리고 문장을 재생성한다!
    • then apply a toxicity classifier on this new utterance
      • 새로 생성된 문장의 toxicity를 본다
        • toxicity 없으면 통과! 있으면 좀 더 masking해서 반복
          image
  • For the first layer, we use transformers interpret which provides explanations and identifies the L1 token according to Toxic-bert model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# With both the model and tokenizer initialized we are now able to get explanations on an example text.

from transformers_interpret import SequenceClassificationExplainer
cls_explainer = SequenceClassificationExplainer(
model,
tokenizer)
word_attributions = cls_explainer("I love you, I like you")

>>> word_attributions
[('[CLS]', 0.0),
('i', 0.2778544699186709),
('love', 0.7792370723380415),
('you', 0.38560088858031094),
(',', -0.01769750505546915),
('i', 0.12071898121557832),
('like', 0.19091105304734457),
('you', 0.33994871536713467),
('[SEP]', 0.0)]
  • For the second layer, we use LERG (Tuan et al., 2021) that provides local explanations for dialogue response generation and identifies the L2 token
    • LERG (Local Explanation of Response Generation) is a unified approach to explain why a conditional text generation model will predict a text
    • image

Experimental Setup

Baselines

  • Two-stage Non Sequitur Baseline
    • toxicity 발견하면 주제 바꿔서 말하게 하기
      • uses a toxicity classifier to detect if the utterance is toxic or not. It then changes the topic of the conversation if the ut- terance was detected to be toxic, e.g., “Hey do you want to talk about something else? How about we talk about X?” where X is a randomly chosen topic from 1087 topics judged as safe from the Wizard of Wikipedia conversational topic list
  • Trigger Masking (TM) Baseline
    • consider masking the adversarial trigger tokens. Note that the defender does not generally know which tokens were the trigger-tokens used by the adversary, so this approach is not applicable in realistic settings.
    • 실전에선 trigger tokens이 어떤건지 모르지만 인사이트용을 위해 추가함
    • 오라클 베이스라인임

AMT Experiments

  • evaluate the defense quality according to relevancy and fluency, the coherency of the overall conversation, and the toxicity of the defense utterance
  • 27 conversations were rated from each of the three defenses (TM, Two- stage Non Sequitur, and our proposed defense). 3 AMT workers rated each conversation which gave us 243 annotations 81 from each defense

Results

  • Defense Effectiveness
    • our proposed defense mechanism as well as the Non Sequitur baseline achieve 100% defense effectiveness according to Toxic-bert classifier
    • our proposed method for all the attacks except UAT-LM, we were able to reach 100% defense effectiveness by only masking one token
      • For UAT-LM, almost 90% of cases were resolved by masking one token and the rest were resolved by the iterative approach that masked multiple tokens (up to 3)
  • Defense Transferability
    image
  • Human Evaluation
    image
    image

Beyond Conversational Agents

  • show the generalizability of our defense method against non-conversational generation tasks, by conducting experiments with RealToxicityPrompts dataset
  • image

Conclusion

  • studied the possibility of generating imperceptible attacks against conversational agents that, while fluent and coherent, target the model into generating toxic responses
  • proposed a defense mechanism that was shown to be effective through various automatic and human evaluations as well as its transferability to human attacks, general generation tasks, and different toxicity classifiers
  • Future work can focus on improving our proposed attacks both in terms of imperceptibility and effectiveness as well as more advanced defense mechanisms.

image

Robust Conversational Agents against Imperceptible Toxicity Triggers

https://eagle705.github.io/Robust-Conversational-Agents-against-Imperceptible-Toxicity-Triggers/

Author

Joosung Yoon

Posted on

2022-12-12

Updated on

2022-12-12

Licensed under

댓글