Maxwell Forbes†‡ Jena D. Hwang‡ Vered Shwartz†‡ Maarten Sap† Yejin Choi†‡ †Paul G. Allen School of Computer Science & Engineering, University of Washington ‡Allen Institute for AI
introduce SOCIAL-CHEM- 101, a large-scale corpus that catalogs 292k rules-of-thumb such as “It is rude to run a blender at 5am” as the basic conceptual units.
Each rule-of-thumb is further broken down with 12 different dimensions of people’s judgments, including social judgments of good and bad, moral foundations, expected cultural pressure, and assumed legality
which together amount to over 4.5 million annotations of categorical labels and free-text descriptions.
NEURAL NORM TRANSFORMER, learns and generalizes SOCIAL-CHEM-101 to successfully reason about previously unseen situations, generating relevant (and potentially novel) attribute-aware social rules-of-thumb
맞닥뜨리게 되는 예시
“wanting to call the cops on my neighbors,” (내 이웃을 경찰에 신고하고 싶다)
여러 이유들이 있음 (legality, cultural pressure, …, morality)
“reporting a crime” and “being friends with your neighbor” are conflicting norms (범죄 신고 vs 이웃의 친구가 되라가 상충됨)
가운데 hexagon 내용이 상황
1차로 작은 hexagon에 들어 있는 내용은 category
그 안에 Tube에 들어있는 내용이 RoT
Situtation & RoTs
we organize descriptive norms via free-text rules-of-thumb (RoTs) as the basic conceptual units.
RoT는 Judgement와 action으로 구성됨
RoT는 12개의 차원으로 구성됨
such as social judgments of good and bad, theoretical categories of moral foundations expected cultural pressure, and assumed legality
SOCIAL CHEM-101, a new type of NLP resource that catalogs 292k RoTs over 104k real life situations, along with 365k sets of structural annotations, which break each RoT into 12 dimensions of norm attributes. Together, this amounts to over 4.5M categorical and free-text annotations.
합치면 450만개의 categorical & free-text annotations.. -> 실제 데이터는 33만개정도던데.. 어디서 이렇게 차이가 나지?
Even so, this breadth of this task proves challenging to current neural models, with humans rating model’s adherence to different attributes from 0.28 to 0.91 micro-F1.
social norms are culturally-sensitive standards
preserving biological needs to survival (e.g., refraining from harming or killing)
maintaining social civility and order (e.g., maintain- ing politeness, recognizing personal space)
providing identity and belonging to a community (e.g., respecting the elderly)
Our aim is then to forefront these implicit expectations about social norms via RoTs
formalize the definition of RoTs as situationally relevant evaluative judgments of social norm
Punching someone. RoT: It is unacceptable to injure a person.
More complex situations can be associated with multiple RoTs
RoTs about stealing (RoT 1) vs. punching (RoT 2)
RoTs targeting the different characters in the situation (RoTs 1, 4 target the narrator; RoTs 2, 3 target narrator’s friend)
additional social interpretation implicit in the situation (RoT 3: theft from a friend is cast as an act of betrayal)
1 2 3 4 5
Punching a friend who stole from me. RoT 1: It is unacceptable to injure a person. RoT 2: People should not steal from others. RoT 3: It is bad to betray a friend. RoT 4: It is OK to want to take revenge.
obtained 104k source situations from 4 text domains (§3.1), for which we elicited 292k RoTs from crowd workers (§3.2)
define a structured annotation task where workers isolate the central action described by the RoT and provide a series of judgments about the RoT and the action (§3.3).
In total, we collect 365k structured annotations, performing multiple annotations per RoT for a subset of the RoTs to study the variance in annotations.
최종적으로 36만개정도 모은듯, RoT마다 multiple annotation하면서!
레딧과 기존 연구에서 상황에 대한 문장들 10만개 모았음
gather a total of 104k real life situations from four domains
scraped titles of posts in the subreddits r/confessions (32k)
r/amitheasshole (r/AITA, 30k)
30k sentences from the ROCStories corpus (rocstories, Mostafazadeh et al., 2016)
scraped titles from the Dear Abby advice column archives3 (dearabby, 12k)
주어진 상황에 대해서 1~5개 정도의 RoTs 작성하게 함
To collect RoTs, we provide workers with a situation as a prompt and them to write 1 – 5 RoTs inspired by that situation
10만개 상황에서 30만개 RoTs 추출
From the 104k situations, we elicit a total of 292k RoTs.
RoTs는 보통 10단어정도 됨
Despite RoTs averaging just 10 words, we observe that 260k/292k RoTs are unique across the dataset
RoTs 만들기 위해서 workers에게 basics of social norms에 대해 설명
instruct the workers to produce RoTs that explain the basics of social norms
1 2 3 4 5 6
1. inspired by the situation, to maintain a lower bound on relevance; ㄴ 상황에 근거해야함 2. self-contained, to be understandable without additional explanation; and ㄴ 추가적인 설명이 없이도 이해가능해야함 3. structured as judgment of acceptability (e.g., good/bad, (un)acceptable, okay) and an action that is assessed. ㄴ 좋다 나쁘다등의 판단과 액션으로 이루어져야함
RoT의 diversity를 위해서는 Vagueness와 Specificity 사이에서 밸런스를 맞춰야함
Vagueness: “It is rude be selfish.”
Specificity: “It is rude not to share your mac’n’cheese with your younger brother.”
구별 가능한 아이디어로 작성할 것을 권고, 단순한 도치나 변형은 피할 것
also ask workers to write RoTs illustrating distinct ideas and avoid trivial inversions to prevent low-information RoTs that rephrase the same idea or simply invert the judgement and action.
ask workers to identify phrases in each situation that refer to people
ex) “My brother chased after __the Uber driver__”
밑줄친 캐릭터에 대해서 셋팅하고 RoTs 모았다는 것 같은데, Narrator도 디폴트로 셋팅하고
workers mark the underlined spans. We collect three workers’ spans, calling each span a character. All characters identified become candidates for grounding RoTs and actions in the structured annotation. As such, we optimize for recall instead of precision by using the largest set of characters identified by any worker. We also include a narrator character by default.
구조적인 어노테이션을 breakdown이라는 용어로 사용하고자함
We perform a structured annotation, which we term a breakdown
In an RoT breakdown, a worker isolates the underlying action contained in the RoT
central annotation goals
The first goal is to tightly ground RoTs to their respective situations.
The second goal is to partition social expectations using theoretically motivated categories.
아래 그림을 보면 ROT에 대한 Attribute가 다르고 Action에 대한 Attribute가 다름 (데이터셋 구축 난이도 증가할듯..)
We call three attributes grounding attributes
to ground the RoT and action to the situation and characters
At the RoT-level, workers mark which character should heed the RoT with the RoT Targeting attribute.
At the action level, workers first pick the action’s best candidate character, for whom the action is most relevant
social expectations in an RoT. The first two social attributes both label anticipated agreement
For an RoT, this attribute asks how many people probably agree with the RoT as stated.
At the action level, it asks what portion of people probably agree with the judgment given the action.
RoT는 도덕적인 관점에서 따지고, Action은 법이나 문화적인 관점에서 따짐 (어렵네..)
An RoT-level attribute is the set of Moral Foundations, based on a well-known social psychology theory that outlines culturally innate moral reasoning (Haidt, 2012).
The action-level attributes legality and cultural pressure are designed to reflect the two coarse-grained categories proposed by the Social Norms Theory (Kitts and Chiang, 2008; Perkins and Berkowitz, 1986)
Finally, the social judgment aims to capture subjective moral judgment. A base judgment of what is good or bad is thought to intrinsically motivate social norms
The RoT Category attribute estimate distinctions between morality, social norms, and other kinds of advice; general world knowledge (e.g., “It is good to eat when you are hungry”)
The attribute agency is designed to let workers distinguish RoTs that involve agentive action from those that indicate an an experience (e.g., “It is sad to lose a family member” -> 경험적인 행동).
경험적 행동 인지 구별
three key aspects of our formalism: social judgment, anticipated agreement, and cultural pressure
Figure5는 RoTs 기반으로 위 3가지 attribute를 분석한 것
도덕적 판단 vs 문화적 (Moral Judgement를 social judgement X agreement로 만든건가..?)
"serving customers after close" (Moral Judgement: 최고등급(8) / Cultural Pressure: Discretionary)
도덕적으로는 매우 훌륭한 판단이지만 문화적으로는 호불호 갈림
사회적 판단 vs 합의
"giving ultimatums"(최후통첩 주기) (Social Judgement: bad / Agreement: Controversial(50%))
사회적인 판단에선 그렇게 좋은 판단은 아니지만, 의견이 분분함
In the left plot (Figure 5 (a)), the x-axis contains a new quantity, where social judgment (∈ [−2, 2]) is multiplied by agreement (∈ [0, 4]) to scale it.
x values range from universally-agreed bad actions (-8) to universally-agreed good actions (+8).
investigate neural models based on pre-trained language models for learning various sub-tasks de- rived from SOCIAL-CHEM-101
RoT가 있어야 Action도 있는거기 때문에 (RoT가 Judgement와 Action으로 이루어져있긴 하지만)
상황이 주어지면 -> RoT에 대한 attribute를 생성하고 -> action에 대한 attribute를 생성함
하지만 이 연구에서는 상황이 주어졌을때 action을 생성하는 것에 대해서 집중
In this paper, we instead focus our study of actions on a more difficult distribution that conditions only on the situation:
original training obj
focus on action
여러가지 실험 셋팅을 시도해봄
Table 1 shows the setups that we consider, and Figure 6 illustrates an example objective.
combine and shuffle all objectives’ views of the data.
present results for the GPT and GPT-2 architectures (Radford et al., 2018, 2019),
as well as two encoder-decoder language models (BART and T5, Lewis et al., 2019; Raffel et al., 2019).
term these architectures trained on our objectives the NEURAL NORM TRANSFORMER.
Experiments and Results
pick two particular objectives to asses the models
The first is p(y, b_y | s) — “model choice.”
each model is allowed to pick the most likely attributes b_y, given a situation s
generate an RoT (or action) y that adheres to those attributes
setup should be easier because a model is allowed to pick the conditions
attribute 선택하고 ROT (or action) 생성하고
second setting is p(y|s, b_y) — “conditional.”
provide models with a set of attributes b_y that y they must follow when generating an RoT (or action) y.
more challenging setup, because models cannot simply condition on the set of attributes that they find most likely
select sets of attributes b_y provided by the human annotators for the situation s to ensure models are not tasked with generating from impossible constraints
생성해본적 없거나 불가능해보이는 constraints로부터 생성해야되니 더 어렵다는 말 같다
split our dataset into 80/10/10% train/dev/test partitions by situation
For all models we use top-p decoding with p = 0.9
Random RoT baseline to verify the dataset diversity (selections should have low relevance to test situations)
evaluation setup (RoTs and actions should still be internally consistent) -> 이게 무슨 뜻일까
use a BERT-Score (Zhang et al.,2020) retrieval baseline that finds the most similar training situation
If attributes b_y are provided, y the retriever picks the RoT (or action) from the retrieved situation with the most similar attributes
attribute가 어떻게 생겼길래 similar한 situation 문장을 고를수가있지?
finetune GPT-2 Small with the same general architecture
randomly initialize the model’s weights.
Table 2 presents a human evaluation measuring how effective models are at generating RoTs and actions for both task settings
Relevance score가 중요함 (whether RoTs actually apply to the provided situation)
In both setups, T5’s generations rank as most tightly relevant to the situation. But in terms of correctly following attributes, GPT-2 is more consistent, especially in the controlled task setup (lower; top scores on 5/9 attributes).
However, no model is able to achieve a high score on all columns in the bottom half of the table
train attributes classifiers using RoBERTa
use them to classify the model outputs
Table 3 presents test set model performance on perplexity, BLEU (Papineni et al., 2002), and at- tribute micro-F1 classifier score
The automatic metrics are consistent with human evaluation. T5 is a strong generator overall, achieving the high- est BLEU score and the highest relevance score in §5.2. However, GPT-2 more consistently adheres to attributes, outperforming T5 in attribute F1
automatic eval도 human eval이랑 결과가 비슷하다!
Morality & Political Bias
Table 4 shows the correlations between RoT attributes and the political leaning and reliability of sources
present SOCIAL-CHEM-101, an attempt at providing a formalism and resource around the study of grounded social, moral, and ethical norms.
preliminary success in generative modeling of structured RoTs, and corroborate findings of moral leaning in an extrinsic task
Additional Dataset Details
provide here a more thorough description how we collected situations from the four domains we consider. Figure 7 gives more example situations from each domain.