Bollywood Movie Corpus for Text, Images and Videos

by   Nishtha Madaan, et al.

In past few years, several data-sets have been released for text and images. We present an approach to create the data-set for use in detecting and removing gender bias from text. We also include a set of challenges we have faced while creating this corpora. In this work, we have worked with movie data from Wikipedia plots and movie trailers from YouTube. Our Bollywood Movie corpus contains 4000 movies extracted from Wikipedia and 880 trailers extracted from YouTube which were released from 1970-2017. The corpus contains csv files with the following data about each movie - Wikipedia title of movie, cast, plot text, co-referenced plot text, soundtrack information, link to movie poster, caption of movie poster, number of males in poster, number of females in poster. In addition to that, corresponding to each cast member the following data is available - cast name, cast gender, cast verbs, cast adjectives, cast relations, cast centrality, cast mentions. We present some preliminary results on the task of bias removal which suggest that the data-set is quite useful for performing such tasks.


page 1

page 2

page 3

page 4


Analyzing Gender Stereotyping in Bollywood Movies

The presence of gender stereotypes in many aspects of society is a well-...

Early Predictions of Movie Success: the Who, What, and When of Profitability

This paper proposes a decision support system to aid movie investment de...

Condensed Movies: Story Based Retrieval with Contextual Embeddings

Our objective in this work is the long range understanding of the narrat...

Presenting a Larger Up-to-date Movie Dataset and Investigating the Effects of Pre-released Attributes on Gross Revenue

Movie-making has become one of the most costly and risky endeavors in th...

Using Data Science to Understand the Film Industry's Gender Gap

Data science can offer answers to a wide range of social science questio...

Transforming Wikipedia into an Ontology-based Information Retrieval Search Engine for Local Experts using a Third-Party Taxonomy

Wikipedia is widely used for finding general information about a wide va...

Using Robust Regression to Find Font Usage Trends

Fonts have had trends throughout their history, not only in when they we...

Please sign up or login with your details

Forgot password? Click here to reset