Maxwell Forbes†‡ Jena D. Hwang‡ Vered Shwartz†‡ Maarten Sap† Yejin Choi†‡ †Paul G. Allen School of Computer Science & Engineering, University of Washington ‡Allen Institute for AI
Abstract
introduce SOCIAL-CHEM- 101, a large-scale corpus that catalogs 292k rules-of-thumb such as “It is rude to run a blender at 5am” as the basic conceptual units.
Each rule-of-thumb is further broken down with 12 different dimensions of people’s judgments, including social judgments of good and bad, moral foundations, expected cultural pressure, and assumed legality
which together amount to over 4.5 million annotations of categorical labels and free-text descriptions.
NEURAL NORM TRANSFORMER, learns and generalizes SOCIAL-CHEM-101 to successfully reason about previously unseen situations, generating relevant (and potentially novel) attribute-aware social rules-of-thumb
Introduction
맞닥뜨리게 되는 예시
“wanting to call the cops on my neighbors,” (내 이웃을 경찰에 신고하고 싶다)
여러 이유들이 있음 (legality, cultural pressure, …, morality)
“reporting a crime” and “being friends with your neighbor” are conflicting norms (범죄 신고 vs 이웃의 친구가 되라가 상충됨)
Figure 1
가운데 hexagon 내용이 상황
1차로 작은 hexagon에 들어 있는 내용은 category
그 안에 Tube에 들어있는 내용이 RoT
Situtation & RoTs
RoTs
we organize descriptive norms via free-text rules-of-thumb (RoTs) as the basic conceptual units.
RoT는 Judgement와 action으로 구성됨
RoT는 12개의 차원으로 구성됨
such as social judgments of good and bad, theoretical categories of moral foundations expected cultural pressure, and assumed legality
SOCIAL CHEM-101, a new type of NLP resource that catalogs 292k RoTs over 104k real life situations, along with 365k sets of structural annotations, which break each RoT into 12 dimensions of norm attributes. Together, this amounts to over 4.5M categorical and free-text annotations.
합치면 450만개의 categorical & free-text annotations.. -> 실제 데이터는 33만개정도던데.. 어디서 이렇게 차이가 나지?
Even so, this breadth of this task proves challenging to current neural models, with humans rating model’s adherence to different attributes from 0.28 to 0.91 micro-F1.
Approach
social norms are culturally-sensitive standards
Social norms
preserving biological needs to survival (e.g., refraining from harming or killing)
maintaining social civility and order (e.g., maintain- ing politeness, recognizing personal space)
providing identity and belonging to a community (e.g., respecting the elderly)
RoTs
Our aim is then to forefront these implicit expectations about social norms via RoTs
formalize the definition of RoTs as situationally relevant evaluative judgments of social norm
예시
1 2
Punching someone. RoT: It is unacceptable to injure a person.
More complex situations can be associated with multiple RoTs
예시
RoTs about stealing (RoT 1) vs. punching (RoT 2)
RoTs targeting the different characters in the situation (RoTs 1, 4 target the narrator; RoTs 2, 3 target narrator’s friend)
additional social interpretation implicit in the situation (RoT 3: theft from a friend is cast as an act of betrayal)
1 2 3 4 5
Punching a friend who stole from me. RoT 1: It is unacceptable to injure a person. RoT 2: People should not steal from others. RoT 3: It is bad to betray a friend. RoT 4: It is OK to want to take revenge.
추가 예제
SOCIAL-CHEM-101 Dataset
obtained 104k source situations from 4 text domains (§3.1), for which we elicited 292k RoTs from crowd workers (§3.2)
define a structured annotation task where workers isolate the central action described by the RoT and provide a series of judgments about the RoT and the action (§3.3).
In total, we collect 365k structured annotations, performing multiple annotations per RoT for a subset of the RoTs to study the variance in annotations.
최종적으로 36만개정도 모은듯, RoT마다 multiple annotation하면서!
Situations
레딧과 기존 연구에서 상황에 대한 문장들 10만개 모았음
gather a total of 104k real life situations from four domains
scraped titles of posts in the subreddits r/confessions (32k)
r/amitheasshole (r/AITA, 30k)
30k sentences from the ROCStories corpus (rocstories, Mostafazadeh et al., 2016)
scraped titles from the Dear Abby advice column archives3 (dearabby, 12k)
Rules-of-Thumb (RoTs)
주어진 상황에 대해서 1~5개 정도의 RoTs 작성하게 함
To collect RoTs, we provide workers with a situation as a prompt and them to write 1 – 5 RoTs inspired by that situation
10만개 상황에서 30만개 RoTs 추출
From the 104k situations, we elicit a total of 292k RoTs.
RoTs는 보통 10단어정도 됨
Despite RoTs averaging just 10 words, we observe that 260k/292k RoTs are unique across the dataset
RoTs 만들기 위해서 workers에게 basics of social norms에 대해 설명
instruct the workers to produce RoTs that explain the basics of social norms
1 2 3 4 5 6
1. inspired by the situation, to maintain a lower bound on relevance; ㄴ 상황에 근거해야함 2. self-contained, to be understandable without additional explanation; and ㄴ 추가적인 설명이 없이도 이해가능해야함 3. structured as judgment of acceptability (e.g., good/bad, (un)acceptable, okay) and an action that is assessed. ㄴ 좋다 나쁘다등의 판단과 액션으로 이루어져야함
RoT의 diversity를 위해서는 Vagueness와 Specificity 사이에서 밸런스를 맞춰야함
Vagueness: “It is rude be selfish.”
Specificity: “It is rude not to share your mac’n’cheese with your younger brother.”
구별 가능한 아이디어로 작성할 것을 권고, 단순한 도치나 변형은 피할 것
also ask workers to write RoTs illustrating distinct ideas and avoid trivial inversions to prevent low-information RoTs that rephrase the same idea or simply invert the judgement and action.
Character Identification
ask workers to identify phrases in each situation that refer to people
ex) “My brother chased after __the Uber driver__”
밑줄친 캐릭터에 대해서 셋팅하고 RoTs 모았다는 것 같은데, Narrator도 디폴트로 셋팅하고
workers mark the underlined spans. We collect three workers’ spans, calling each span a character. All characters identified become candidates for grounding RoTs and actions in the structured annotation. As such, we optimize for recall instead of precision by using the largest set of characters identified by any worker. We also include a narrator character by default.
RoT Breakdowns
구조적인 어노테이션을 breakdown이라는 용어로 사용하고자함
We perform a structured annotation, which we term a breakdown
In an RoT breakdown, a worker isolates the underlying action contained in the RoT
central annotation goals
The first goal is to tightly ground RoTs to their respective situations.
The second goal is to partition social expectations using theoretically motivated categories.
아래 그림을 보면 ROT에 대한 Attribute가 다르고 Action에 대한 Attribute가 다름 (데이터셋 구축 난이도 증가할듯..)
Grounding Attributes
We call three attributes grounding attributes
to ground the RoT and action to the situation and characters
At the RoT-level, workers mark which character should heed the RoT with the RoT Targeting attribute.
At the action level, workers first pick the action’s best candidate character, for whom the action is most relevant
Social Attributes
social expectations in an RoT. The first two social attributes both label anticipated agreement
For an RoT, this attribute asks how many people probably agree with the RoT as stated.
At the action level, it asks what portion of people probably agree with the judgment given the action.
RoT는 도덕적인 관점에서 따지고, Action은 법이나 문화적인 관점에서 따짐 (어렵네..)
An RoT-level attribute is the set of Moral Foundations, based on a well-known social psychology theory that outlines culturally innate moral reasoning (Haidt, 2012).
The action-level attributes legality and cultural pressure are designed to reflect the two coarse-grained categories proposed by the Social Norms Theory (Kitts and Chiang, 2008; Perkins and Berkowitz, 1986)
Finally, the social judgment aims to capture subjective moral judgment. A base judgment of what is good or bad is thought to intrinsically motivate social norms
The RoT Category attribute estimate distinctions between morality, social norms, and other kinds of advice; general world knowledge (e.g., “It is good to eat when you are hungry”)
The attribute agency is designed to let workers distinguish RoTs that involve agentive action from those that indicate an an experience (e.g., “It is sad to lose a family member” -> 경험적인 행동).
경험적 행동 인지 구별
Analysis
three key aspects of our formalism: social judgment, anticipated agreement, and cultural pressure
Figure5는 RoTs 기반으로 위 3가지 attribute를 분석한 것
도덕적 판단 vs 문화적 (Moral Judgement를 social judgement X agreement로 만든건가..?)
"serving customers after close" (Moral Judgement: 최고등급(8) / Cultural Pressure: Discretionary)
도덕적으로는 매우 훌륭한 판단이지만 문화적으로는 호불호 갈림
사회적 판단 vs 합의
"giving ultimatums"(최후통첩 주기) (Social Judgement: bad / Agreement: Controversial(50%))
사회적인 판단에선 그렇게 좋은 판단은 아니지만, 의견이 분분함
In the left plot (Figure 5 (a)), the x-axis contains a new quantity, where social judgment (∈ [−2, 2]) is multiplied by agreement (∈ [0, 4]) to scale it.
x values range from universally-agreed bad actions (-8) to universally-agreed good actions (+8).
Model
investigate neural models based on pre-trained language models for learning various sub-tasks de- rived from SOCIAL-CHEM-101
Training Objectives
RoT가 있어야 Action도 있는거기 때문에 (RoT가 Judgement와 Action으로 이루어져있긴 하지만)
상황이 주어지면 -> RoT에 대한 attribute를 생성하고 -> action에 대한 attribute를 생성함
하지만 이 연구에서는 상황이 주어졌을때 action을 생성하는 것에 대해서 집중
In this paper, we instead focus our study of actions on a more difficult distribution that conditions only on the situation:
original training obj
focus on action
여러가지 실험 셋팅을 시도해봄
Table 1 shows the setups that we consider, and Figure 6 illustrates an example objective.
combine and shuffle all objectives’ views of the data.
Table1
Figure 6
Architectures
present results for the GPT and GPT-2 architectures (Radford et al., 2018, 2019),
as well as two encoder-decoder language models (BART and T5, Lewis et al., 2019; Raffel et al., 2019).
term these architectures trained on our objectives the NEURAL NORM TRANSFORMER.
Experiments and Results
Tasks
pick two particular objectives to asses the models
The first is p(y, b_y | s) — “model choice.”
each model is allowed to pick the most likely attributes b_y, given a situation s
generate an RoT (or action) y that adheres to those attributes
setup should be easier because a model is allowed to pick the conditions
attribute 선택하고 ROT (or action) 생성하고
second setting is p(y|s, b_y) — “conditional.”
provide models with a set of attributes b_y that y they must follow when generating an RoT (or action) y.
more challenging setup, because models cannot simply condition on the set of attributes that they find most likely
select sets of attributes b_y provided by the human annotators for the situation s to ensure models are not tasked with generating from impossible constraints
생성해본적 없거나 불가능해보이는 constraints로부터 생성해야되니 더 어렵다는 말 같다
Setup
split our dataset into 80/10/10% train/dev/test partitions by situation
For all models we use top-p decoding with p = 0.9
Baselines
Random RoT baseline to verify the dataset diversity (selections should have low relevance to test situations)
evaluation setup (RoTs and actions should still be internally consistent) -> 이게 무슨 뜻일까
use a BERT-Score (Zhang et al.,2020) retrieval baseline that finds the most similar training situation
If attributes b_y are provided, y the retriever picks the RoT (or action) from the retrieved situation with the most similar attributes
attribute가 어떻게 생겼길래 similar한 situation 문장을 고를수가있지?
Ablations
-Small
finetune GPT-2 Small with the same general architecture
-No pretrain
randomly initialize the model’s weights.
Results
Human Evaluation
Table 2 presents a human evaluation measuring how effective models are at generating RoTs and actions for both task settings
Relevance score가 중요함 (whether RoTs actually apply to the provided situation)
In both setups, T5’s generations rank as most tightly relevant to the situation. But in terms of correctly following attributes, GPT-2 is more consistent, especially in the controlled task setup (lower; top scores on 5/9 attributes).
However, no model is able to achieve a high score on all columns in the bottom half of the table
Automatic Evaluation
train attributes classifiers using RoBERTa
use them to classify the model outputs
Table 3 presents test set model performance on perplexity, BLEU (Papineni et al., 2002), and at- tribute micro-F1 classifier score
The automatic metrics are consistent with human evaluation. T5 is a strong generator overall, achieving the high- est BLEU score and the highest relevance score in §5.2. However, GPT-2 more consistently adheres to attributes, outperforming T5 in attribute F1
automatic eval도 human eval이랑 결과가 비슷하다!
Morality & Political Bias
Table 4 shows the correlations between RoT attributes and the political leaning and reliability of sources
Conclusion
present SOCIAL-CHEM-101, an attempt at providing a formalism and resource around the study of grounded social, moral, and ethical norms.
preliminary success in generative modeling of structured RoTs, and corroborate findings of moral leaning in an extrinsic task
Appendix
Additional Dataset Details
Situations
Domains
provide here a more thorough description how we collected situations from the four domains we consider. Figure 7 gives more example situations from each domain.
allow annotators to mark each situation with any of the following labels that apply. (상황에따라 비고란에 추가 태깅)
Unclear: The situation was too simple, vague, or confusing to understand what happened. (불분명한 것)
NSFW: The situation contains suggestive or adult content. (19금상황)
Dark / disturbing / controversial: The situation contained content that may make folks uncomfortable, like suicide, torture, or abuse. (불편한 상황들, 학대나 등등..)
어노테이터는 저런 케이스는 넘어가도 되거나 RoT 계속 작성하거나 선택가능
Annotators may pass on writing RoTs for a situation marked with any of those boxes, or they may still choose to do so
We keep all of the labels collected. They are included in the dataset as additional fields.
they could be used to omit certain training data to keep a model biased away from potentially controversial subjects
Character Identification
to find the most descriptive phrase referring to each unique non-narrator person in the passage exactly once
always having a single, best reference to each person in the situation enables more consistent grounding.
provide here the character identification guidelines that we give to the crowd worker annotators
Character Identification Guidelines
guide_1
guide_2
Rules-of-Thumb (RoTs)
Figure 8 shows a sample of RoTs organized both by situation domain and topic.
RoT Writing Guidelines
백화사전(단순 정보성)이랑은 구분될만한, 매일 겪는 사회적인 통념등이 있어야함
단순 문장(보트는 비싸)이 아니라, 판단과 액션 두개로 나뉘어져야함
외부 정보가 필요한 내용은 안됨 (글에 없는 내용, 사건을 만들어내는)
모호함과 구체성 사이에 균형 필요
같은말을 패러프레이징해서 반복하는 식의 문구는 지양
RoT_1
RoT_2
RoT Breakdowns
RoT Categorization
Moral Foundations
Action and Judgment
Agency
Social Judgment
Anticipated Agreement
Legality
Cultural Pressure
Taking Action
RoT Categorization
to distinguish more desired annotation topics (morality/ethics, social norms) from less desired ones (advice and “it is what it is” statements)
RoT categories are not mutually-exclusive (exclusive한건 아니다..?!) and the lines are not always clear
use all data regardless of RoT category in this paper’s experiments(실험에서는 카테고리 상관없이 사용했다고..), future work using this dataset may consider filtering based on RoT category
Moral Foundations
Action and Judgment
-
Agency
challenging to distinguish agency from experience in cases where the action involves think- ing thoughts or feeling emotions (생각이냐 경험이냐)
Social Judgment
transcribe the intent of RoT’s original judgment, rather than pick their own
workers can mark their disagreement through their annotation of the anticipated agreement attribute
Anticipated Agreement
Legality
Cultural Pressure
Taking Action
Crowdsourcing
Workers undergo an extensive vetting process before working on RoTs
This includes a paid qualification (qual) with a quiz on each of the guidelines and a manual review of sample RoTs
Workers then pass the qual move to a staging pool where they can work on a small number of situations
Annotator Demographics
With an extensive qualification process, 137 workers participated in our tasks.
Of those, 55% were women and 45% men. 89% of workers identified as white, 7% as Black.
39% were in the 30-39 age range, 27% in the 21-29 and 19% in the 40-49 age ranges.
A majority (53%) of workers were single, and 35% were married.
47% of workers considered themselves as middle class, and 41% working class.
In terms of education level, 44% had a bachelor’s degree, 36% some college experience or an associates degree.
Experimental details
Generative Models We use the Transformers package (Wolf et al., 2019) to implement our models. We train all the models for a single epoch with a batch size of 64, with the random seed 42
Each input and output sequence is prefixed with a special token indicating its type (e.g. [attrs], [rot], [action]).
define a special token for each attribute value (e.g. , , , )
initialize the special token embeddings with the embedding of their corresponding words, taking the average for multiword expressions
SOCIAL CHEMISTRY 101 - Learning to Reason about Social and Moral Norms