Robust Conversational Agents against Imperceptible Toxicity Triggers
Note
- Github: https://github.com/Ninarehm/Robust-Agents
- 발표자료: Robust Conversational Agents against Imperceptible Toxicity Triggers.pdf
Author
- Ninareh Mehrabi1, Ahmad Beirami2∗, Fred Morstatter1, Aram Galstyan1
- 1University of Southern California - Information Sciences Institute 2Meta AI
Abstract
- 최근 NLP 연구는 다양한 toxicity detection에 개선이 있었음
- toxicity detection models with the intention of identifying and mitigating toxic language from existing systems.
- 기존 연구가 많긴하나 adversarial attacks과 defense에 대한 연구는 부족했음
- adversarial attacks that force the system to generate toxic language and the defense against them
- 기존의 연구는 대부분 사람이 attack 용 문장을 생성해왔음, 비용이 비싸고 확장가능하지 않음
- 반면에 자동화해서 만든 attack인 경우 attack vector가 human-like language와 맞지 않음, 이는 LM loss로 detecting이 가능함
- Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss
- 본 연구에서는 conversational agents를 눈에 띄지 않게 공격(앞서 자동화한 공격과 달리 인식되지 못하게) 하는 방법을 coherency, relevancy, fluency 관점에서 제안함
- propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language
- 본 연구에서는 제안한 attack에 대한 defense mechanism도 제안함. 공격을 완화시킬뿐만 아니라 conversational flow도 유지시킬 수 있는 방법을 제안함
- propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow
- 결론적으로 공격이 잘들어와도 잘 막을 수 있는 방법에 대해 automaitc and human evaluations했고 효과적임을 보였음
- our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy
Introduction
- 대화 시스템에서 adversarial attacks을 고려하는게 safe, robust 대화를 위해서 중요함
- consider adversarial attacks on human-centric chatbots and dialogue systems. It is important for these systems to be safe and robust in the face of natural(-looking) human conversations
- 대화 예시
- attacks
- Our proposed approach works by augmenting the universal adversarial triggers (UAT) from Wallace et al. (2019) with additional selection criteria to generate imperceptible yet effective triggers
- defense
- then focus on a defense mechanism for the non-adversarial (defender) model to avoid generating toxic utterances
- 간단한 방법(Xu et al., 2020)으로도 adversarial triggers를 찾아낼 수 있지만, 대화 흐름을 깰수있어서, 흐름을 깨지 않는 “detoxifies” 답변을 사용하는 defense mechanism 쪽에 관심을 가짐
- Our proposed method relies on two levels of interpretable reasoning that helps the model to
- (1) identify the key adversarial tokens responsible for the attack and
- (2) avoid generating toxic responses by masking those tokens during the generation process.
Attack Approaches
- first discuss the universal adversarial trigger(UAT) attack proposed by Wallace et al. (2019), which we use as our baseline
- then propose alterations to this baseline to make the universal triggers more natural-looking and suitable for conversational domain
Methodology
Universal Adversarial Trigger (UAT) (Wallace et al., 2019)
- The
goal
in universal adversarial trigger attack isto find a universal trigger sequence
for agiven trained model
, which if attached to the start of any given input can cause the model to output the desired outcome (Wallace et al., 2019) - trigger sequence는 given input 앞에 붙게되면 모델의 결과물을 원하는대로 바꿔놓을 수 있는 것을 말하는 듯
- This attack starts with a
fixed-length sequence
asthe initial trigger
, e.g.,“the the the the the the”
and tries toiteratively replace the tokens
in the sequenceto satisfy an objective
. - The
iterations terminate
when no improvement (replacement) can be made
to further optimize the objective - toxic token을 생성하기 위해서 넣는 거다보니 LM loss를 만족시킬 필요도 없고, ppl이 되게 높은 easily detectable한 반복적인 문장이 나오는 경우가 많음
- The
Universal Adversarial Trigger with Language Model Loss (UAT-LM)
- An intuitive solution to address the above shortcoming of UAT is to impose a
language modeling objective on the trigger tokens
. - 위 전략대로 lm loss 를 추가해도 generated triggers 끼리는 말이 될지라도 conversation flow가 coherency, relevancy하다고 장담할순없음
- 다른 방법으로 수정한 방법론 제안하겠음
- An intuitive solution to address the above shortcoming of UAT is to impose a
Unigram Trigger with Selection Criteria (UTSC)
- propose an alternative approach in which we generate a collection of unigram triggers (with sequence length one) from UAT
- 유니그램 트리거를 UAT에서 모음
- then feed these triggers along with the history of the conversation h to our dialogue model and generate different attack utterances
- 유니그램 트리거를 붙여서 다른 attack utterances를 생성해냄
- Next, we pick the best suited attack utterance amongst all the generated attack utterances according to our selection criterion as demonstrated in Figure 2
- 생성된것들중에서 selection criterion에 기초해서 가장 잘 맞는 utterance를 선택함
- 유니그램 트리거를 conversation history에 붙여서 DialoGPT로 example을 생성하고 toxicity classfiers(단일 or 앙상블)로 점수를 내고 각각 기준(UTSC-N)에 따른 가장 높은 점수의 문장을 골라낸다
- UTSC-1: 가장 높은 toxicity score 갖거나
- UTSC-2: threshold 보다 큰 문장중에 가장 낮은 toxicity 점수를 갖거나 (threshold 못넘으면 가장 높은 점수를 가진 것)
- UTSC-3: 가장 낮은 toxicity 점수를 갖는 것
- 유니그램이라 fluency도 눈에 띄게 희생되진 않는다
- propose an alternative approach in which we generate a collection of unigram triggers (with sequence length one) from UAT
Experimental Setup
- General Setup
- use
DialoGPT
- to generate
100 conversations
around a specific topic - The topic is determined by the context sentence that starts the conversation between the adversary and the defender.
- Each conversation runs for
10 turns
- use
- Toxicity Detection Models
- utilize an ensemble of three different toxicity detection models:
Toxic-bert
,Perspective API
, andSafety classifier
(Xu et al., 2020)- Toxic-bert is the least sensitive of the three, followed by Perspective API, and the Safety classifier
- allow the
adversary to only use one
of the toxicity detection models to design its attack. We then quantify toxicity using the other two toxicity detection methods, not accessed by the adversary.
- utilize an ensemble of three different toxicity detection models:
- Data
- context sentences from two different datasets, Wizard of Wikipedia (Dinan et al., 2018) and ConvoKit’s Reddit Corpus
- Wikipedia: neutral topics
- Reddit: sensitive topics
- picked 50 random context sentences from the Wizard of Wikipedia and 50 from the Reddit datasets.
- AMT Experiments
- To compare and verify the quality of conversations generated during and after the attacks, we conduct human experiments
- AMT workers annotated 100 conversations from each of the three attacks and each conversa- tion was annotated by 3 AMT workers giving us overall 900 annotated conversations 300 from each attack
Results
- Attack Effectiveness
- two of our proposed attacks UAT-LM and UTSC-1 are performing the best according to the Perspective API and Toxic- bert classifiers
- UAT baseline performs the best according to Safety classifier.
Overall results show that UTSC-1 and UAT-LM attacks are competitive attacks in terms of attack effectiveness.
- UAT(baseline) attack tends to generate meaningless phrases, e.g.,
“acist neighborhoodsJohnson carry morals Ukrain”
which can easily be detected as an anomaly and make the conversation not flow naturally - GPT-2 기준 PPL 차이
- UAT is absurdly high (∼10^7) compared to ∼10^4 for UAT-LM, and ∼ 160 for UTSC-1
- no attack case is ~39
- Attack Transferability
- attack is forcing the defender to generate actual toxic language rather than fooling the toxicity classifier.
- Human Evaluation
- Our UTSC-1 attack is rated to have the highest coherency
- UTSC- 1 is rated to have more fluent attacks generated with mostly moderate to good scores and a higher average–shown by the black dotted lines–compared to the UAT and UAT-LM baselines
- Fleiss Kappa (Fleiss, 1971) annotator agreement results from this evaluation is reported in Table 1. Annotators have reasonable overall agreement for all the qualities
Defense Approaches
- two components
- (a) detecting the attack and
- (b) mitigating its effect by ensuring that the defender does not generate a toxic response
- detection
- The detection problem is rather straightforward, as the defense can simply run a toxicity classifier on the generated response
- mitigation
- Xu et al. (2020) suggested a mitigating approach which, when a toxic response is detected, simply resets the dialogue and generates a (non-toxic) utterance by randomly sampling from a predefined set of topics
- 기존 연구에서는 대화 중단 후 미리 정의된 토픽에서 랜덤하게 샘플링해서 대화 다시 생성
- 하지만 본 연구에서는 대화흐름을 유지하면서 toxic utterance 생성을 피하려고함!
- Xu et al. (2020) suggested a mitigating approach which, when a toxic response is detected, simply resets the dialogue and generates a (non-toxic) utterance by randomly sampling from a predefined set of topics
Methodology
- defense mechanism in the second stage utilizes two layers of reasoning using two different interpretability techniques
- The first layer aims to detect which tokens in the defender’s utterance is making the toxicity detection model to label the utterance as being toxic.
- defender’s utterance에서 문제의 토큰 찾기 (L1), we call these tokens the L1 tokens
- The second layer aims to detect which tokens in the adversary’s attack utterance are responsible for generation of L1 tokens form defender’s utterance
- adversary’s attack utterance에서 L1 token을 생성하는 원인 토큰 찾기 (L2 token)
- defender then masks the L2 tokens from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance
- L1을 생성하는 L2 토큰을 마스킹한다! 그리고 문장을 재생성한다!
- then apply a toxicity classifier on this new utterance
- 새로 생성된 문장의 toxicity를 본다
- toxicity 없으면 통과! 있으면 좀 더 masking해서 반복
- toxicity 없으면 통과! 있으면 좀 더 masking해서 반복
- 새로 생성된 문장의 toxicity를 본다
- The first layer aims to detect which tokens in the defender’s utterance is making the toxicity detection model to label the utterance as being toxic.
- For the
first layer
, we use transformers interpret which provides explanations and identifies the L1 token according toToxic-bert
model- BERT를 통해 L1 토큰을 찾음
- https://github.com/cdpierse/transformers-interpret 사용
1 | from transformers import AutoModelForSequenceClassification, AutoTokenizer |
- For the
second layer
, we use LERG (Tuan et al., 2021) that provides local explanations for dialogue response generation and identifies theL2 token
LERG (Local Explanation of Response Generation) is a unified approach to explain why a conditional text generation model will predict a text
Experimental Setup
Baselines
- Two-stage Non Sequitur Baseline
- toxicity 발견하면 주제 바꿔서 말하게 하기
- uses a toxicity classifier to detect if the utterance is toxic or not. It then changes the topic of the conversation if the ut- terance was detected to be toxic, e.g., “Hey do you want to talk about something else? How about we talk about X?” where X is a randomly chosen topic from 1087 topics judged as safe from the Wizard of Wikipedia conversational topic list
- toxicity 발견하면 주제 바꿔서 말하게 하기
- Trigger Masking (TM) Baseline
- consider masking the adversarial trigger tokens. Note that the defender does not generally know which tokens were the trigger-tokens used by the adversary, so this approach is not applicable in realistic settings.
- 실전에선 trigger tokens이 어떤건지 모르지만 인사이트용을 위해 추가함
- 오라클 베이스라인임
AMT Experiments
- evaluate the defense quality according to relevancy and fluency, the coherency of the overall conversation, and the toxicity of the defense utterance
- 27 conversations were rated from each of the three defenses (TM, Two- stage Non Sequitur, and our proposed defense). 3 AMT workers rated each conversation which gave us 243 annotations 81 from each defense
Results
- Defense Effectiveness
- our proposed defense mechanism as well as the Non Sequitur baseline achieve
100% defense effectiveness
according to Toxic-bert classifier - our proposed method for all the attacks except UAT-LM, we were able to reach
100% defense effectiveness
by only masking one token
- For UAT-LM, almost 90% of cases were resolved by masking one token and the rest were resolved by the iterative approach that masked multiple tokens (up to 3)
- our proposed defense mechanism as well as the Non Sequitur baseline achieve
- Defense Transferability
- Human Evaluation
Beyond Conversational Agents
- show the generalizability of our defense method against non-conversational generation tasks, by conducting experiments with RealToxicityPrompts dataset
Conclusion
- studied the possibility of generating imperceptible attacks against conversational agents that, while fluent and coherent, target the model into generating toxic responses
- proposed a defense mechanism that was shown to be effective through various automatic and human evaluations as well as its transferability to human attacks, general generation tasks, and different toxicity classifiers
- Future work can focus on improving our proposed attacks both in terms of imperceptibility and effectiveness as well as more advanced defense mechanisms.
Robust Conversational Agents against Imperceptible Toxicity Triggers
https://eagle705.github.io/Robust-Conversational-Agents-against-Imperceptible-Toxicity-Triggers/