#MeToo on Campus: Studying College Sexual Assault at Scale Using Data Reported on Social Media

01/16/2020 ∙ by Viet Duong, et al. ∙ University of Rochester 0

Recently, the emergence of the #MeToo trend on social media has empowered thousands of people to share their own sexual harassment experiences. This viral trend, in conjunction with the massive personal information and content available on Twitter, presents a promising opportunity to extract data driven insights to complement the ongoing survey based studies about sexual harassment in college. In this paper, we analyze the influence of the #MeToo trend on a pool of college followers. The results show that the majority of topics embedded in those #MeToo tweets detail sexual harassment stories, and there exists a significant correlation between the prevalence of this trend and official reports on several major geographical regions. Furthermore, we discover the outstanding sentiments of the #MeToo tweets using deep semantic meaning representations and their implications on the affected users experiencing different types of sexual harassment. We hope this study can raise further awareness regarding sexual misconduct in academia.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Sexual harassment is defined as ”bullying or coercion of a sexual nature, or the unwelcome or inappropriate promise of rewards in exchange for sexual favors.” 111www.en.wikipedia.org/wiki/Sexual_Harassement In fact, it is an ongoing problem in the U.S., especially within the higher education community. According to the National Sexual Violence Resource Center (NSRVC), one in five women and one in sixteen men are sexually assaulted while they are attending college. 222https://goo.gl/rgvYH2 In addition to the prevalence of campus sexual harassment, it has been shown to have detrimental effects on student’s well-being, including health-related disorders and psychological distress [14, 12]. However, these studies on college sexual misconduct usually collect data based on questionnaires from a small sample of the college population, which might not be sufficiently substantial to capture the big picture of sexual harassment risk of the entire student body.

Alternatively, social media opens up new opportunities to gather a larger and more comprehensive amount of data and mitigate the risk of false or inaccurate narratives from the studied subjects. On October 15 of 2017, prominent Hollywood actress Alyssa Milano, by accusing Oscar-winning film producer, Harvey Weinstein, for multiple sexual impropriety attempts on herself and many other women in the film industry, ignited the ”MeToo” trend on social media that called for women and men to share their own sexual harassment experience. According to CNN, over 1.7 million users had used the hash-tag in 85 countries. 333http://www.cnn.com/2017/10/30/health/metoo-legacy/index.html Benefiting from the tremendous amount of data supplied by this trend and the existing state-of-the-art semantic parser and generative statistical models, we propose a new approach to characterizing sexual harassment by mining the tweets from college users with the hash-tag #metoo on Twitter.

Our main contributions are several folds. We investigate campus sexual harassment using a big-data approach by collecting data from Twitter. We employ traditional topic modeling and linear regression methods on a new dataset to highlight patterns of the ongoing troubling social behaviors at both institutional and individual levels. We propose a novel approach to combining domain-general deep semantic parsing and sentiment analysis to dissect personal narratives.

Related Work

Previous works for sexual misconduct in academia and workplace dated back to last few decades, when researchers studied the existence, as well as psychometric and demographic insights regarding this social issue, based on survey and official data [9, 10, 19]. However, these methods of gathering data are limited in scale and might be influenced by the psychological and cognitive tendencies of respondents not to provide faithful answers [3].

The ubiquity of social media has motivated various research on widely-debated social topics such as gang violence, hate code, or presidential election using Twitter data [2, 6, 13, 20]

. Recently, researchers have taken the earliest steps to understand sexual harassment using textual data on Twitter. Using machine learning techniques, Modrek and Chakalov (2019) built predictive models for the identification and categorization of lexical items pertaining to sexual abuse, while analysis on semantic contents remains untouched

[15]. Despite the absence of Twitter data, Field et al. (2019) did a study more related to ours as they approach to the subject geared more towards linguistics tasks such as event, entity and sentiment analysis [8]. Their work on event-entity extraction and contextual sentiment analysis has provided many useful insights, which enable us to tap into the potential of our Twitter dataset.

There are several novelties in our approach to the #MeToo problem. Our target population is restricted to college followers on Twitter, with the goal to explore people’s sentiment towards the sexual harassment they experienced and its implication on the society’s awareness and perception of the issue. Moreover, the focus on the sexual harassment reality in colleges calls for an analysis on the metadata of this demographics to reveal meaningful knowledge of their distinctive characteristics [11].


Data Collection

In this study, we limit the sample size to the followers identified as English speakers in the U.S. News Top 200 National Universities. We utilize the Jefferson-Henrique444https://git.io/JvUnQ script, a web scraper designed for Twitter to retrieve a total of over 300,000 #MeToo tweets from October 15th, when Alyssa Milano posted the inceptive #MeToo tweet, to November 15th of 2017 to cover a period of a month when the trend was on the rise and attracting mass concerns. Since the lists of the followers of the studied colleges might overlap and many Twitter users tend to reiterate other’s tweets, simply putting all the data collected together could create a major redundancy problem. We extract unique users and tweets from the combined result set to generate a dataset of about 60,000 unique tweets, pertaining to 51,104 unique users.

Text Preprocessing

We pre-process the Twitter textual data to ensure that its lexical items are to a high degree lexically comparable to those of natural language. This is done by performing sentiment-aware tokenization, spell correction, word normalization, segmentation (for splitting hashtags) and annotation. The implemented tokenizer with SentiWordnet corpus [7] is able to avoid splitting expressions or words that should be kept intact (as one token), and identify most emoticons, emojis, expressions such as dates, currencies, acronyms, censored words (e.g. s**t), etc. In addition, we perform modifications on the extracted tokens. For spelling correction, we compose a dictionary for the most commonly seen abbreviations, censored words and elongated words (for emphasis, e.g. ”reallyyy”). The Viterbi algorithm is used for word segmentation, with word statistics (unigrams and bigrams) computed from the NLTK English Corpus

to obtain the most probable segmentation posteriors from the unigrams and bigrams probabilities. Moreover, all texts are lower-cased, and URLs, emails and mentioned usernames are replaced with common designated tags so that they would not need to be annotated by the semantic parser.

College Metadata

The meta-statistics on the college demographics regarding enrollment, geographical location, private/public categorization and male-to-female ratio are obtained. Furthermore, we acquire the Campus Safety and Security Survey dataset from the official U.S. Department of Education website and use rape-related cases statistic as an attribute to complete the data for our linear regression model. The number of such reported cases by these 200 colleges in 2015 amounts to 2,939.


Regression Analysis

We examine other features regarding the characteristics of the studied colleges, which might be significant factors of sexual harassment. Four factual attributes pertaining to the 200 colleges are extracted from the U.S. News Statistics, which consists of Undergraduate Enrollment, Male/Female Ratio, Private/Public, and Region (Northeast, South, West, and Midwest). We also use the normalized rape-related cases count (number of cases reported per student enrolled) from the stated government resource as another attribute to examine the proximity of our dataset to the official one. This feature vector is then fitted in a linear regression to predict the normalized #metoo users count (number of unique users who posted #MeToo tweets per student enrolled) for each individual college.

Labeling Sexual Harassment

Per our topic modeling results, we decide to look deeper into the narratives of #MeToo users who reveal their personal stories. We examine 6,760 tweets from the most relevant topic of our LDA model, and categorize them based on the following metrics: harassment types (verbal, physical, and visual abuse) and context (peer-to-peer, school employee or work employer, and third-parties). These labels are based on definitions by the U.S. Dept. of Education [5].

Topic Modeling on #MeToo Tweets

In order to understand the latent topics of those #MeToo tweets for college followers, we first utilize Latent Dirichlet Allocation (LDA) to label universal topics demonstrated by the users. We determine the optimal topic number by selecting the one with the highest coherence score. Since certain words frequently appear in those #MeToo tweets (e.g., sexual harassment, men, women, story, etc.), we transform our corpus using TF-IDF, a term-weighting scheme that discounts the influence of common terms.

Semantic Parsing with Trips

Learning deep meaning representations, which enables the preservation of rich semantic content of entities, meaning ambiguity resolution and partial relational understanding of texts, is one of the challenges that the TRIPS parser [1] is tasked to tackle. This kind of meaning is represented by TRIPS Logical Form (LF), which is a graph-based representation that serves as the interface between structural analysis of text (i.e., parse) and the subsequent use of the information to produce knowledge. The LF graphs are obtained by using the semantic types, roles and rule-based relations defined by the TRIPS Ontology [1]

at its core in combination with various linguistic techniques such as Dialogue Act Identification, Dependency Parsing, Named Entity Recognition, and Crowd-sourced Lexicon (Wordnet).

Figure 1: The meaning representation of the example sentence ”He harassed me.” in TRIPS LF, the Ontology types of the words are indicated by ”:*” and the role-argument relations between them are denoted by named arcs.

Figure 1 illustrates an example of the TRIPS LF graph depicting the meaning of the sentence ”He harassed me,” where the event described though the speech act TELL (i.e. telling a story) is the verb predicate HARASS, which is caused by the agent HE and influences the affected (also called ”theme” in traditional literature) ME. As seen from the previously discussed example, the action-agent-affected relational structure is applicable to even the simplest sentences used for storytelling, and it is in fact very common for humans to encounter in both spoken and written languages. This makes it well suited for event extraction from short texts, useful for analyzing tweets with Twitter’s 280 character limit. Therefore, our implementation of TRIPS parser is particularly tailored for identifying the verb predicates in tweets and their corresponding agent-affected arguments (with F1 score), so that we can have a solid ground for further analysis.

Connotation Frames and Sentiment Analysis

In order to develop an interpretable analysis that focuses on sentiment scores pertaining to the entities and events mentioned in the narratives, as well as the perceptions of readers on such events, we draw from existing literature on connotation frames: a set of verbs annotated according to what they imply about semantically dependent entities. Connotation frames, first introduced by Rashkin, Singh, and Choi (2016), provides a framework for analyzing nuanced dimensions in text by combining polarity annotations with frame semantics (Fillmore 1982). More specifically, verbs are annotated across various dimensions and perspectives so that a verb might elicit a positive sentiment for its subject (i.e. sympathy) but imply a negative effect for its object. We target the sentiments towards the entities and verb predicates through a pre-collected set of 950 verbs that have been annotated for these traits, which can be more clearly demonstrated through the example ”He harassed me.”:

  • : something negative happened to the writer.

  • : the writer (affected) most likely feels negative about the event.

  • : the writer most likely has negative feelings towards the agent as a result of the event.

  • : the reader most likely view the agent as the antagonist.

  • : the reader most likely feels sympathetic towards the writer.

In addition to extracting sentiment scores from the pre-annotated corpus, we also need to predict sentiment scores of unknown verbs. To achieve this task, we rely on the 200-dimensional GloVe word embeddings [17], pretrained on their Twitter dataset, to compute the scores of the nearest neighboring synonyms contained in the annotated verb set and normalize their weighted sum to get the resulting sentiment (Equation 1).


where is the indicator function for whether verb predicate is in the annotation set , is the set of nearest neighbors ’s of verb . Because our predictive model computes event-entity sentiment scores and generates verb predicate knowledge simultaneously, it is sensitive to data initialization. Therefore, we train the model iteratively on a number of random initialization to achieve the best results.

Experimental Results

Topical Themes of #MeToo Tweets

The results of LDA on #MeToo tweets of college users (Table 1) fall into the same pattern as the research of Modrek and Chakalov (2019), which suggests that a large portion of #MeToo tweets on Twitter focuses on sharing personal traumatic stories about sexual harassment [15]. In fact, in our top 5 topics, Topics 1 and 5 mainly depict gruesome stories and childhood or college time experience. This finding seems to support the validity of the Twitter sample of Modrek and Chakalov (2019), where 11% discloses personal sexual harassment memories and 5.8% of them was in formative years [15]. These users also shows multiple emotions toward this movement, such as compassion (topic 2), determination (topic 3), and hope (topic 4). We will further examine the emotion features in the latter results.

Topic Topic Label Keywords
1 Recalling the experience times, old, man, got, called, girl, wrong, asked, home, done, scared, tried, room
2 Showing sympathy, sharing news sexually, harassed, assaulted, women, experiences, magnitude, sense, problem, fear, men, status, getting, heartbreaking
3 Calling for actions story, normal, denial, please, forward, tell, shame, shared, wish, women, daughters, worse, glad
4 Showing optimism instant, take, right, fight, past, action, self, write, hey, loved, spoke, sisters, body, former, front, claims, stronger
5 Detailing early life experience call, issue, hashtag, child, awareness, guy, telling, trauma, number, party, teacher, sexual, raise, sometimes
Table 1: Top 5 topics from all #MeToo Tweets from 51,104 college followers.

Regression Result

Feature Coefficient Std. Err. t-stat p-value
M/F Ratio 3.613e-04 6.690e-03 0.054 0.9570
Enrollment 1.075e-06 1.639e-06 0.656 0.5128
Private 2.858e-03 3.676e-02 0.078 0.9381
Northeast 7.840e-02 3.734e-02 2.100 0.0370
West 8.909e-02 4.032e-02 2.210 0.0283
South 7.529e-02 3.763e-02 2.001 0.0468
Normalized cases count 9.098e+01 1.176e+01 7.735 5.7e-13
constant -7.943e-02 4.990e-02 -1.592 0.1131
Table 2: Linear regression results.

Observing the results of the linear regression in Table 2, we find the normalized governmental reported cases count and regional feature to be statistically significant on the sexual harassment rate in the Twitter data (). Specifically, the change in the number of reported cases constitutes a considerable change in the number of #MeToo users on Twitter as p-value is extremely small at . This corresponds to the research by Napolitano (2014) regarding the ”Yes means yes” movement in higher education institutes in recent years, as even with some limitations and inconsistency, the sexual assault reporting system is gradually becoming more rigorous [16]. Meanwhile, attending colleges in the Northeast, West and South regions increases the possibility of posting about sexual harassment (positive coefficients), over the Midwest region. This finding is interesting and warrants further scrutiny.

Event-Entity Sentiment Analysis

We discover that approximately half of users who detailed their sexual harassment experiences with the #MeToo hashtag suffered from physical aggression. Also, more than half of them claimed to encounter the perpetrators outside the college and work environment. The sentimental score for the affected entities and the verb of cases pertaining to faculty are strictly negative, suggesting that academic personnel’s actions might be described as more damaging to the students’ mental health. This finding resonates a recent research by Cantapulo et al. regarding the potential hazard of sexual harassment conducts by university faculties using data from federal investigation and relevant social science literature [4]. Furthermore, many in this group tend to mention their respective age, typically between 5 and 20 (24% of the studied subset). This observation reveals an alarming number of child and teenager sexual abuse, indicating that although college students are not as prone to sexual harassment from their peers and teachers, they might still be traumatized by their childhood experiences.

In addition, although verbal abuse experiences accounts for a large proportion of the tweets, it is challenging to gain sentiment insights into them, as the majority of them contains insinuations and sarcasms regarding sexual harassment. This explains why the sentiment scores of the events and entities are very close to neutral.

Harassment Participant Event-Sentiment Affected-Sentiment Percentage
Physical 3rd-Party -0.0429 -0.0479 23.63%
Physical Faculty -0.1999 -0.2308 11.39%
Physical Peer -0.0136 -0.1018 16.03%
Verbal 3rd-Party 0.1385 0.0077 25.32%
Verbal Faculty 0.1454 0.0051 5.91%
Verbal Peer 0.0819 -0.0024 5.91%
Visual 3rd-Party 0.1015 -0.0407 6.33%
Visual Faculty 0.1333 0.0500 0.42%
Visual Peer -0.3946 0.0000 0.84%
Table 3: Semantic sentiment results.

Limitations and Ethical Implications

Our dataset is taken from only a sample of a specific set of colleges, and different samples might yield different results. Our method of identifying college students is simple, and might not reflect the whole student population. Furthermore, the majority of posts on Twitter are short texts (under 50 words). This factor, according to previous research, might hamper the performance of the LDA results, despite the use of the TF-IDF scheme [18].

Furthermore, while the main goal of this paper is to shed lights to the ongoing problems in the academia and contribute to the future sociological study using big data analysis, our dataset might be misused for detrimental purposes. Also, data regarding sexual harassment is sensitive in nature, and might have unanticipated effects on those addressed users.


In this study, we discover a novel correlation between the number of college users who participate in the #MeToo movement and the number of official reported cases from the government data. This is a positive sign suggesting that the higher education system is moving into a right direction to effectively utilize Title IV, a portion of the Education Amendments Act of 1972, 555https://goo.gl/J5ZSpb which requests colleges to submit their sexual misconduct reports to the officials and protect the victims. In addition, we capture several geographic and behavioral characteristics of the #MeToo users related to sexual assault such as region, reaction and narrative content following the trend, as well as sentiment and social interactions, some of which are supported by various literature on sexual harassment. Importantly, our semantic analysis reveals interesting patterns of the assaulting cases. We believe our methodologies on defining these #MeToo users and their features will be applicable to further studies on this and other alarming social issues.

Furthermore, we find that the social media-driven approach is highly useful in facilitating crime-related sociology research on a large scale and spectrum. Moreover, since social networks appeal to a broad audience, especially those outside academia, studies using these resources are highly useful for raising awareness in the community on concurrent social problems.

Last but not least, many other aspects of the text data from social media, which could provide many interesting insights on sexual harassment, remain largely untouched. In the future, we intend to explore more sophisticated language features and implement more supervised models with advanced neural network parsing and classification. We believe that with our current dataset, an extension to take advantage of cutting-edge linguistic techniques will be the next step to address the previously unanswered questions and uncover deeper meanings of the tweets on sexual harassment.


  • [1] J. Allen and C. M. Teng (2018) Putting semantics into semantic roles. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pp. 235–244. Cited by: Semantic Parsing with TRIPS.
  • [2] P. Blandfort, D. U. Patton, W. R. Frey, S. Karaman, S. Bhargava, F. Lee, S. Varia, C. Kedzie, M. B. Gaskell, R. Schifanella, et al. (2019) Multimodal social media analysis for gang violence prevention. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13, pp. 114–124. Cited by: Related Work.
  • [3] S. Brutus, H. Aguinis, and U. Wassmer (2013) Self-reported limitations and future directions in scholarly reports: analysis and recommendations. Journal of Management 39 (1), pp. 48–75. External Links: Document, Link, https://doi.org/10.1177/0149206312455245 Cited by: Related Work.
  • [4] N. C. Cantalupo and W. C. Kidder (2018) A systematic look at a serial problem: sexual harassment of students by university faculty. Utah L. Rev., pp. 671. Cited by: Event-Entity Sentiment Analysis.
  • [5] N. Cantu (2020-01) Sexual harassment guidance. US Department of Education (ED). External Links: Link Cited by: Labeling Sexual Harassment.
  • [6] M. ElSherief, V. Kulkarni, D. Nguyen, W. Y. Wang, and E. Belding (2018) Hate lingo: a target-based linguistic analysis of hate speech in social media. In Twelfth International AAAI Conference on Web and Social Media, Cited by: Related Work.
  • [7] A. Esuli and F. Sebastiani (2006) Sentiwordnet: a publicly available lexical resource for opinion mining.. In LREC, Vol. 6, pp. 417–422. Cited by: Text Preprocessing.
  • [8] A. Field, G. Bhat, and Y. Tsvetkov (2019) Contextual affective analysis: a case study of people portrayals in online# metoo stories. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13, pp. 158–169. Cited by: Related Work.
  • [9] L. F. Fitzgerald, S. L. Shullman, N. Bailey, M. Richards, J. Swecker, Y. Gold, M. Ormerod, and L. Weitzman (1988) The incidence and dimensions of sexual harassment in academia and the workplace. Journal of Vocational Behavior 32 (2), pp. 152 – 175. External Links: ISSN 0001-8791, Document, Link Cited by: Related Work.
  • [10] L. F. Fitzgerald, M. J. Gelfand, and F. Drasgow (1995) Measuring sexual harassment: theoretical and psychometric advances. Basic and Applied Social Psychology 17 (4), pp. 425–445. External Links: Document, Link, \(https://doi.org/10.1207/s15324834basp1704_{2}\) Cited by: Related Work.
  • [11] L. He, L. Murphy, and J. Luo (2016) Using social media to promote STEM education: matching college students with role models. CoRR abs/1607.00405. External Links: Link, 1607.00405 Cited by: Related Work.
  • [12] M. Huerta, L. M. Cortina, J. S. Pang, C. M. Torges, and V. J. Magley (2006) Sex and power in the academy: modeling sexual harassment in the lives of college women. Personality and Social Psychology Bulletin 32 (5), pp. 616–628. Note: PMID: 16702155 External Links: Document, Link, https://doi.org/10.1177/0146167205284281 Cited by: Introduction.
  • [13] R. Magu, K. Joshi, and J. Luo (2017) Detecting the hate code on social media. CoRR abs/1703.05443. External Links: Link, 1703.05443 Cited by: Related Work.
  • [14] J. F. McDermut, D. A. F. Haaga, and L. Kirk (2000) An evaluation of stress symptoms associated with academic sexual harassment. Journal of Traumatic Stress 13 (3), pp. 397–411. External Links: ISSN 1573-6598, Link, Document Cited by: Introduction.
  • [15] S. Modrek and B. Chakalov (2019) The #metoo movement in the united states: text analysis of early twitter conversations. Journal of medical internet research 21 (9), pp. e13837. Cited by: Related Work, Topical Themes of #MeToo Tweets.
  • [16] J. Napolitano (2014) Only yes means yes: an essay on university policies regarding sexual violence and sexual assult. Yale L. & Pol’y Rev. 33, pp. 387. Cited by: Regression Result.
  • [17] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    pp. 1532–1543. Cited by: Connotation Frames and Sentiment Analysis.
  • [18] J. Tang, Z. Meng, X. Nguyen, Q. Mei, and M. Zhang (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In International Conference on Machine Learning, pp. 190–198. Cited by: Limitations and Ethical Implications.
  • [19] R. Timothy, C. Sandra, D. Valerie, B. Kim, and B. M. B. (1982) The factorial survey: an approach to defining sexual harassment on campus. Journal of Social Issues 38 (4), pp. 99–110. External Links: Document, Link, https://spssi.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1540-4560.1982.tb01912.x Cited by: Related Work.
  • [20] Y. Wang, Y. Feng, X. Zhang, and J. Luo (2016)

    Gender politics in the 2016 U.S. presidential election: A computer vision approach

    CoRR abs/1611.02806. External Links: Link, 1611.02806 Cited by: Related Work.