Disturbed YouTube for Kids: Characterizing and Detecting Disturbing Content on YouTube

01/21/2019 ∙ by Kostantinos Papadamou, et al. ∙ Boston University Telefonica The University of Alabama at Birmingham 0

A considerable number of the most-subscribed YouTube channels feature content popular among children of very young age. Hundreds of toddler-oriented channels on YouTube offer inoffensive, well produced, and educational videos. Unfortunately, inappropriate (disturbing) content that targets this demographic is also common. YouTube's algorithmic recommendation system regrettably suggests inappropriate content because some of it mimics or is derived from otherwise appropriate content. Considering the risk for early childhood development, and an increasing trend in toddler's consumption of YouTube media, this is a worrying problem. While there are many anecdotal reports of the scale of the problem, there is no systematic quantitative measurement. Hence, in this work, we develop a classifier able to detect toddler-oriented inappropriate content on YouTube with 82.8 perform a first-of-its-kind, large-scale, quantitative characterization that reveals some of the risks of YouTube media consumption by young children. Our analysis indicates that YouTube's currently deployed counter-measures are ineffective in terms of detecting disturbing videos in a timely manner. Finally, using our classifier, we assess how prominent the problem is on YouTube, finding that young children are likely to encounter disturbing videos when they randomly browse the platform starting from benign videos.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

YouTube has emerged as an alternative to traditional children’s TV, and a plethora of popular children’s videos can be found on the platform. For example, consider the millions of subscribers that the most popular toddler-oriented YouTube channels have: ChuChu TV is the most-subscribed “children-themed” channel, with 19.9M subscribers [24] as of September 2018. While most toddler-oriented content is inoffensive, and is actually entertaining or educational, recent reports have highlighted the trend of inappropriate content targeting this demographic [25, 18]. We refer to this new class of content as disturbing. A prominent example of this trend is the Elsagate controversy [22, 7], where malicious uploaders uploaded videos featuring popular cartoon characters like Spiderman, Elsa, Peppa Pig, Mickey Mouse, etc., combined with disturbing content containing, for example, mild violence and sexual connotations. Those videos usually include an innocent thumbnail aiming at tricking the toddlers and their custodians. Figure 1 shows an example of such a video in which Peppa Pig, a popular cartoon character, “accidentally” kills her dad. The issue at hand is that this video has 1.3M views, much more likes than dislikes, and has been available on the platform since 2016.

Figure 1: Example of inappropriate video of a popular cartoon character that includes violent content not suitable for toddlers.

In an attempt to offer a safer online experience for its young audience, YouTube launched the YouTube Kids application 111https://www.youtube.com/yt/kids/, which offers parents several control features enabling them to decide what their children are allowed to watch on YouTube. Unfortunately, despite YouTube’s attempts to curb the phenomenon of inappropriate videos for toddlers, disturbing videos still appear, even in YouTube Kids [29], due to the difficulty of identifying them. An explanation for this may be that YouTube relies heavily on users reporting videos they consider disturbing, and then YouTube employees manually inspecting them 222https://support.google.com/youtube/answer/2802027. However, since the process involves manual labor, the whole mechanism does not easily scale to the amount of videos that a platform like YouTube serves.

In this paper, we provide the first study of toddler-oriented disturbing content on YouTube. For the purposes of this work, we extend the definition of a toddler to any child aged between 1 and 5 years.

Our study comprises three phases. First, we aim to characterize the phenomenon of disturbing videos geared towards toddlers. To this end, we collect, manually review, and characterize toddler-oriented videos. For a more detailed analysis of the problem, we label these videos as one of four categories: 1)  suitable, 2) disturbing, 3) restricted (equivalent to MPAA’s333MPAA stands for Motion Picture Association of America (MPAA)https://www.mpaa.org/film-ratings/ R and NC-17 categories), and 4) irrelevant videos (cf. Section 2.2). Our characterization confirms that unscrupulous and potentially profit-driven uploaders create disturbing videos with similar characteristics as benign toddler-oriented videos in an attempt to make them show up as recommendation to toddlers browsing the platform.

Second, we develop a deep-learning classifier to automatically detect toddler-oriented disturbing videos. Even though this classifier performs better than baseline models, it still has a lower than desired performance. In fact, this low performance reflects the high degree of similarity between disturbing and suitable or restricted videos, and the subjectivity in deciding how to label these controversial videos, as confirmed by our trained annotators’ experience. For the sake of our analysis in the next steps, we collapse the initially defined labels into two categories and develop a more accurate classifier that is able to discern inappropriate from non-disturbing videos. Our experimental evaluation shows that the developed classifier outperforms several baseline models with an accuracy of


In the last phase, we leverage the developed classifier to understand how prominent the problem at hand is. From our analysis on the 133,806 collected videos, we find that are inappropriate for toddlers, which indicates that the problem is substantial. We also find that there is a probability that a toddler viewing a benign video suitable for her demographic will be given a top-ten recommendation for an inappropriate video. To further assess how safe YouTube is for toddlers, we run a live simulation in which we mimic a toddler randomly clicking on YouTube’s suggested videos. We find that there is a probability that a toddler following YouTube’s recommendations will encounter an inappropriate video within 10 hops if she starts from a video that appears among the top ten results of a toddler-appropriate keyword search (e.g., Peppa Pig).

Last, our assessment on YouTube’s current mitigations shows that the platform struggles to keep up with the problem: only and of our manually reviewed disturbing and restricted videos, respectively, have been removed by YouTube. Further, disturbing and restricted videos in our ground truth dataset have been live on YouTube for a mean of 794 and 827 days, respectively.

Contributions. In summary, our contribution is threefold:

  • We propose a sufficiently accurate classifier that can be used to discern disturbing videos, which target toddlers.

  • We undertake a large-scale analysis of the disturbing videos problem that is currently plaguing YouTube.

  • The implementation of the classifier, the manually reviewed ground truth dataset, and all the collected and examined videos will be publicly available so that the research community can build on our results to further investigate the problem.

2 Methodology

In this section, we present our data collection process and the methodology followed for building our ground truth.

2.1 Data Collection

For our data collection, we use the YouTube Data API 444https://developers.google.com/youtube/v3/, which provides metadata of videos uploaded on the platform. First, we collect a set of seed videos using three different approaches: the first two use information from /r/ElsaGate, which is a subreddit dedicated to studying this phenomenon [22], whereas the third approach focuses on obtaining a set of random videos. Specifically: 1) we create a list of 64 keywords 555https://drive.google.com/open?id=1ZDokscj9c1wl6FXGk5Rk0gjJHzsBAj5z by extracting the title and tags of videos posted on /r/ElsaGate. Subsequently, for each keyword, we obtain the first 30 videos as returned by YouTube’s Data API search functionality. This approach resulted in the acquisition of 893 seed videos. 2) we create a list of 33 channels 666https://drive.google.com/file/d/1jj8Z1pXIvBNwvZGicN5S9KVFaB85lOgN/view?usp=sharing, which are mentioned by users on /r/ElsaGate because of publishing inappropriate videos [7, 22]. Then, for each channel we collect all their videos, hence acquiring a set of 181 seed videos. 3) to obtain a random sample of videos, we collect the most popular videos in United States, Great Britain, Russia, India, and Canada, between November 18 and November 21, 2018, hence acquiring another 500 seed videos.

Using the above mentioned approaches, we collect 1,574 seed videos. However, this dataset is not big enough to study the idiosyncrasies of this problem. Therefore, and to expand our dataset, for each seed video we iteratively collect the top 10 recommended videos associated to it, as returned by the YouTube Data API, for up to three hops within YouTube’s recommendation graph. Table 1 summarizes the collected dataset. In total, our dataset comprises 1.5K seed videos and 132K videos that are recommended from the seed videos. For each video, we collect the following data descriptors: (a) video frames taken every 5th second of the video, using the FFmpeg 777https://www.ffmpeg.org/ library; (b) title;s (c) thumbnail; and (d) video statistics like number of views, likes, dislikes, etc.

Ethics. We confirm that for this study we only collect publicly available data, while making no attempt to de-anonymize users.

Crawling Strategy # of videos
Seed Channels 181
Seed Popular Videos 500
Seed Keywords 893
Recommended Videos 132,232
Table 1: Overview of the collected data: number of seed videos and number of their recommended videos collected for up to three hops.

2.2 Manual Annotation Process

In order to get labeled data, we manually review a subset of the videos (2.5K) by inspecting their video content, title, thumbnail, and tags (defined by the uploader of the video). To select the videos for the annotation process, we elect to use all of the seed videos except the ones that are marked as age-restricted by YouTube (1,329), as well as a small set (1,171) of randomly selected recommended videos. We did not consider the age-restricted videos in our annotation process as these videos are already annotated by YouTube, hence we consider them as “Restricted” by default (see below for the definition of our labels). Each video is presented to three annotators that inspect its content and metadata to assign one of the following labels:

Suitable. A video is suitable when its content is appropriate for toddlers (aged 1-5 years) and it is relevant with their typical interests. Some examples include normal cartoon videos, children’s songs, children that are playing, and educational videos (e.g., learning colors). In other words, any video that can be classified as G by the MPAA, and its target audience is toddlers.

Disturbing. A video is disturbing when it targets toddlers but it contains sexual hints, depiction of unusual eating habits (e.g., eating big portions of junk food), children driving, child abuse (e.g., children hitting each other), scream and horror sound effects, scary scenes or characters (e.g., injections, attacks by insects, etc.). In general, any video targeted at toddlers that should be classified as PG or PG-13 by MPAA is considered disturbing.

Restricted. We consider a video restricted when it contains content that is inappropriate for individuals under the age of 17. These videos are rated as R or NC-17 according to MPAA’s ratings. Such videos usually contain sexually explicit language, graphic nudity, pornography, violence (e.g., gaming videos featuring violence like God of War, or life-like violence, etc.), abusive/inappropriate language, online gambling, drug use, alcohol, or upsetting situations and activities.

Irrelevant. We consider a video irrelevant when it contains content that is not relevant with a toddler’s interests. Also, videos that are not disturbing or restricted but are only suitable for school-aged children (aged 6-11 years), adolescents (aged 12-17 years) and adults, like gaming videos (e.g., Minecraft) or music videos (e.g., a video clip of John Legend’s song) reside in this category. In general, G, PG and PG-13 videos that do not target toddlers are considered as irrelevant.

We opted to use these labels for our annotation process instead of adopting the five MPAA ratings for two reasons. First, our scope is videos that would normally be rated as PG and PG-13 but target very young audiences. We consider such targeting a malevolent activity that needs to be treated separately. At the same time, we have observed that the vast majority of videos that would normally be rated as R or NC-17 are already classified by YouTube as “age-restricted” and target either adolescents or adults. Second, YouTube does not use MPAA ratings to flag videos, thus, a ground truth dataset with such labels is not available.

The annotation process is carried out by two of the authors of this study and 66 undergraduate students. Each video is annotated by the two authors and one of the undergraduate students. The students come from different backgrounds and receive no specific training with regard to our study. To ease the annotation process, we develop a platform 888http://www.disturbedyoutubeforkids.xyz:3333/ that includes a clear description of our labels, as well as all the video’s information that an annotator needs in order to inspect and correctly annotate a video.

After obtaining all the annotations, we compute the Fleiss agreement score ([14] across all annotators: we find , which is considered “moderate” agreement. We also assess the level of agreement between the two authors, as we consider them experienced annotators, finding , which is considered “substantial” agreement. Finally, for each video we assign one of the labels according to the majority agreement of all the annotators, except a small percentage () where all annotators disagreed, which we also exclude from our ground truth dataset. Table 2 summarizes our ground truth dataset, which consists of 837 suitable, 710 disturbing, 818 restricted (73 annotated by our annotators and the 745 annotated as age restricted by YouTube), and 755 irrelevant videos.

Class Suitable Disturbing Restricted Irrelevant
# of videos 837 710 818 755
Table 2: Summary of our final ground truth dataset.
Figure 2: Per class proportion of videos for (a) the top 15 stems found in titles; and (b) the top 15 stems found in video tags.
Figure 3: Per class proportion of videos for (a) the top 15 labels found in thumbnails; and (b) videos that their thumbnail contains spoofed, adult, medical, violent, and/or racy content.
Category Suitable (%) Disturbing (%) Restricted (%) Irrelevant (%)
Entertainment 365 (43.6%) 184 (25.9%) 246 (30.1%) 248 (32.8%)
Film & Animation 140 (16.7%) 147 (20.7%) 183 (22.4%) 45 (6.0%)
Education 130 (15.5%) 18 (2.5%) 18 (2.2%) 24 (3.2%)
People & Blogs 111 (13.3%) 187 (26.3%) 222 (27.1%) 83 (11.0%)
Music 21 (2.5%) 11 (1.5%) 12 (1.5%) 105 (13.9%)
Comedy 20 (2.4%) 61 (8.6%) 49 (6.0%) 41 (5.4%)
Howto & Style 17 (2.0%) 11 (1.5%) 5 (0.6%) 56 (7.4%)
Gaming 13 (1.6%) 75 (10.6%) 56 (6.8%) 50 (6.6%)
Pets & Animals 5 (0.6%) 2 (0.3%) 1 (0.1%) 8 (1.1%)
Travel & Events 4 (0.5%) 4 (0.6%) 2 (0.2%) 4 (0.5%)
Sports 2 (0.2%) 2 (0.3%) 3 (0.4%) 33 (4.4%)
Autos & Vehicles 2 (0.2%) 2 (0.3%) 3 (0.4%) 7 (0.9%)
Science & Technology 2 (0.2%) 1 (0.1%) 2 (0.2%) 13 (1.7%)
News & Politics 1 (0.1%) 3 (0.4%) 10 (1.2%) 33 (4.4%)
Shows 4 (0.5%) 1 (0.1%) 3 (0.4%) 3 (0.4%)
Nonprofits & Activism 0 (0.0%) 1 (0.1%) 3 (0.4%) 2 (0.3%)
Table 3: Number of videos in each category per class in our ground truth dataset.

2.3 Ground Truth Dataset Analysis

Category. First, we look at the categories of the videos in our ground truth dataset. Table 3 reports the 15 categories found in the videos, as well as the distribution of videos across the different classes. Most of the disturbing and restricted videos are in Entertainment ( and ), People & Blogs ( and ), Film & Animation ( and ), Gaming ( and ), and Comedy ( and ): these results are similar with previous work [10]. In addition, we find a non-negligible percentage of disturbing and restricted videos in seemingly innocent categories like Education (- for both classes) and Music ( for both classes), in which is it also expected to find a lot of suitable videos. This is alarming since it indicates that disturbing videos “infiltrate” categories of videos that are likely to be selected by the toddler’s parents.

Title. The title of a video is an important factor that affects whether a particular video will be recommended when viewing other toddler-related videos. Due to this, we study the titles in our ground truth dataset to understand the tactics and terms that are usually used by uploaders of disturbing or restricted videos on YouTube. First, we pre-process the title by tokenizing the text into words and performing stemming using the Porter Stemmer algorithm. Figure 1(a) shows the top 15 stems found in titles along with their proportion for each class of videos. Unsurprisingly, the top 15 stems refer to popular cartoons like Peppa Pig, Elsa, Spiderman, etc. When looking at the results, we find that a substantial percentage of the videos that include these terms in their title are actually disturbing or restricted.

For example, from the videos that contain the terms “peppa” and “pig”, and , respectively are disturbing. Also, of the videos that contain the term “Elsa” are disturbing, while of them are restricted. Similar trends are observed with other terms like “frozen” ( and ), “spiderman” ( and ), “superhero”( and ), “real”( and ), and “joker”( and ). These results unveil that disturbing and restricted videos on YouTube refer to seemingly “innocent” cartoons on their title, but in reality the content of the video is likely to be either restricted or disturbing. Note that we find these terms in suitable videos too. This demonstrates that it is quite hard to distinguish suitable from disturbing or restricted videos by inspecting only the titles of the videos.

Figure 4: CDF of the number of (a) views and (b) likes of videos per class.
Figure 5: CDF of the number of (a) dislikes and (b) comments of videos per class.

Tags. Tags are words that uploaders define when posting a video on YouTube. They are an important feature, since they determine for which search results the video will appear. To study the effects of tags in this problem, we plot in Figure 1(b) the top 15 stems from tags. We make several observations: first, there is a substantial overlap between the stems found in the tags and title (cf. Figure 1(a) and Figure 1(b)). Second, we find that each class of videos has a considerable percentage for each tag, hence highlighting that disturbing or restricted videos use the same tags as suitable videos. Inspecting these results, we find that the tags “spiderman” ( and ) and “superhero” ( and ) appear mostly on disturbing and restricted videos, respectively. Also, “mous” () appears to have a higher portion of disturbing videos than the other tags. Third, we note that for all of the top 15 tags, the percentage of videos that are suitable are below .

In summary, the main take-away from this analysis is that it is hard to detect disturbing content just by looking at the tags, and that popular tags are shared among disturbing and suitable videos.

Thumbnails. To study the thumbnails of the videos in our ground truth, we make use of the Google Cloud Vision API 999https://cloud.google.com/vision/

, which is a RESTful API that derives useful insights from images using pre-trained machine learning models. Using this API we are able to: (a) extract descriptive labels from all the thumbnails in our ground truth; and (b) check whether a modification was made to a thumbnail, and whether a thumbnail contains adult, medical-related, violent, and/or racy content. Figure 

2(a) depicts the top 15 labels derived from the thumbnails in our ground truth. We observe that the thumbnails of disturbing and restricted videos contain similar entities as the thumbnails of suitable videos (cartoons, fictional characters, girls, etc.). This indicates that it is hard to determine whether a video is disturbing or not from its thumbnail.

Additionally, we see that almost all labels are found in the same number of disturbing and restricted videos, which denotes subjectivity between the two classes of videos. Figure 2(b) shows the proportion of each class for videos that contain spoofed, adult, medical-related, violent, and/or racy content. As expected, most of the videos whose thumbnails contain adult () and medical content () are restricted. However, this is not the case for videos whose thumbnails contain spoofed, violent or racy content, where we observe almost the same number of disturbing and restricted videos.

Statistics. Lastly, we examine various statistics related to the videos in our dataset. Figures 4 and 5 show the CDF of each video statistic type (views, likes, dislikes, and comments) for all four classes of videos in our ground truth dataset. As perhaps expected, disturbing and restricted videos tend to have less number of views and likes than all the other types of videos. However, an unexpected observation is that suitable videos have more dislikes than disturbing videos. Overall, similar to [17, 8], we notice less user engagement on disturbing and restricted videos compared to suitable and irrelevant videos.

A general conclusion from this ground truth analysis is that none of the video’s metadata can clearly indicate that a video is disturbing or not, thus, in most cases one (e.g., the toddler’s guardian) has to carefully inspect all the available video metadata, and potentially the actual video, to accurately determine if it is safe for a toddler to watch.

Assessing YouTube’s Counter-measures. To assess how fast YouTube detects and removes inappropriate videos, we leverage the YouTube Data API to count the number of offline videos (either removed by YouTube due to a Terms of Service violation or deleted by the uploader) in our manually reviewed ground truth dataset. We note that for our calculations, we do not consider the videos that were already marked as age-restricted, since YouTube took the appropriate measures.

As of January 15, 2019 only of the suitable, of the disturbing, of the restricted, and of the irrelevant videos were removed, while from those that were still available, , , , and , respectively, were marked as age-restricted. Alarmingly, the amount of the deleted disturbing and restricted videos, is significantly low. The same observation stands for the amount of disturbing and restricted videos marked as age-restricted. A potential issue here is that the videos on our dataset were recently uploaded and YouTube simply did not have time to detect them. However, we calculate the mean number of days from publication up to January, 2019, and find that this is not the case. The mean number of days since upload for the suitable, disturbing, restricted, and irrelevant videos is 612, 794, 827, and 455, respectively, with a mean of 623 days across the entire manually reviewed ground truth dataset. The indication from this finding is that YouTube’s deployed counter-measures are unable to effectively tackle the problem in a timely manner.

3 Detection of Disturbing Videos

In this section we provide the details of our deep learning model for detecting disturbing videos on YouTube.

3.1 Dataset and Feature Description

To train and test our proposed deep learning model we use our ground truth dataset of 3,120 videos, summarized in Table 2. For each video in our ground truth dataset our model processes the following:


Our classifier considers the text of the title by training an embedding layer, which encodes each word in the text in an N-dimensional vector space.

Tags. Similarly to the title, we encode the video tags into an N-dimensional vector space by training a separate embedding layer.

Thumbnail. We scale down the thumbnail images to 299x299 while preserving all three color channels.

Statistics. We consider all available statistical metadata for videos (number of views, likes, dislikes, and comments).

Style Features. In addition to the above, we also consider the style of the actual video (e.g., duration), the title (e.g., number of bad words), the video description, and the tags. For this we use features proposed in [17] that help the model to better differentiate between the videos of each class. Table 4 summarizes the types of style features that we use.

Type Style Features Description
Video-related # of frames in the video, video duration
Statistics-related ratio of # of likes to dislikes
Title- & description-related length of title, length of description,
ratio of description to title,
jaccard similarity between title and description,
# of ’!’ and ’?’ in title and description,
# of emoticons in title and description,
# of bad words in title and description,
# of kids-related words in title and description
Tags-related # of tags, # of bad words in tags,
# of kids-related words in tags,
jaccard similarity between tags and video title
Table 4:

List of the style features extracted from the available metadata of a video.

Figure 6: Architecture of our deep learning model for detecting disturbing videos. The model processes almost all the video features: (a) tags; (b) title; (c) statistics and style; and (d) thumbnail.

3.2 Model Architecture

Figure 6

depicts the architecture of our classifier, which combines the above mentioned features. Initially, the classifier consists of four different branches, where each branch processes a distinct feature type: title, tags, thumbnail, statistics, and style features. Then the outputs of all the branches are concatenated to form a two-layer, fully connected neural network that merges their output and drives the final classification.

The title feature is fed to a trainable embedding layer that outputs a 32-dimensional vector for each word in the title text. Then, the output of the embedding layer is fed to a Long Short-Term Memory (LSTM) 

[16]Recurrent Neural Network (RNN) that captures the relationships between the words in the title. For the tags, we use an architecturally identical branch trained separately from the title branch.

For thumbnails, due to the limited number of training examples in our dataset, we use transfer learning 


and the pre-trained Inception-v3 Convolutional Neural Network (CNN) 


, which is built from the large-scale ImageNet dataset.

101010http://image-net.org/ We use the pre-trained CNN to extract a a meaningful feature representation (2,048-dimensional vector) of each thumbnail. Last, the statistics and style features are fed to a fully-connected dense neural network comprising 25 units.

The second part of our classifier is essentially a two-layer, fully-connected dense neural network. At the first layer, (dubbed as Fusing Network), we merge together the outputs of the four branches, creating a 2,137-dimensional vector. This vector is subsequently processed by the 512 units of the Fusing Network. Next, to avoid possible over-fitting issues we regularize via the prominent Dropout technique [23]. We apply a Dropout level of , which means that during each iterations of training, half of the units in this layer do not update their parameters. Finally, the output of the Fusing Network is fed to the last dense-layer neural network of four units with softmax activation, which are essentially the probabilities that a particular video is suitable, disturbing, restricted, or irrelevant.

3.3 Experimental Evaluation

We implement our model using Keras 


with TensorFlow as the backend 

[1]. To train our model we randomly select of the videos from our ground truth dataset and test on the remaining out-of-sample dataset, while ensuring that both the train and the test set contain a balanced number of videos from each class.

We train and test our model using all the aforementioned features. For the stochastic optimization of this model we use the Adam algorithm with an initial learning rate of , and

. To evaluate our model, we compare it in terms of accuracy, precision, recall, and F1 score against the following five baselines: (i) a Support Vector Machine (SVM) with parameters

and ; (ii) a K-Nearest Neighbors classifier with neighbors and leaf size equal to

; (iii) a Bernoulli Naive Bayes classifier with

; (iv) a Decision Tree classifier with an entropy criterion; and (v) a Random Forest classifier with an entropy criterion and number of trees equal to


For hyper-parameter tuning of all the baselines we use the grid search strategy. Table 5 reports the performance of the proposed model as well as the five baselines, while Figure 7 shows their ROC curves. Although the proposed model outperforms all the baselines in terms of accuracy, precision, recall, and F1 score, it still has poor performance.

In an attempt to achieve better accuracy, we consider the video itself as an additional input to our model. We train our model considering frames per video along with all the other types of input. We then evaluate in terms of accuracy, precision, recall, and F1 score, which amount to , , , and , respectively. Unfortunately, adding the video frames in the set of features of our model yields worse performance in terms of accuracy, while it is more than times more computationally expensive compared to the model without video frames. Hence, we decided to keep the previous model formulation ignoring the video frames.

Model Accuracy Precision Recall F1 Score
SVM 27.1 56.7 25.3 11.3
Decision Tree 32.4 32.4 32.2 32.2
K-Nearest Neighbors 34.2 35.0 33.7 33.4
Naive Bayes 42.1 43.9 42.3 41.9
Random Forest 53.1 53.0 52.5 51.3
Proposed Model 61.3 62.2 60.9 61.3
Table 5: Performance metrics for the evaluated baselines as well as for the proposed deep learning model.
Figure 7: ROC Curves of all the baselines as well as of the proposed model trained for multi-class classification.

Understanding Misclassified Videos. While performing the manual annotation process, the two experienced annotators noticed a fair degree of subjectivity with respect to discriminating disturbing and restricted videos.

In order to understand what might be causing poor performance of our model, we look at the agreement of our annotators on the misclassified videos of the classifier trained without the video frames. More precisely, we calculate the agreement score of our annotators on all videos that are actually disturbing and misclassified as restricted, as well as on all restricted videos misclassified as disturbing. The Fleiss agreement score of all the annotators is , while the agreement score of our experienced annotators, while a bit higher (), is still very low. Both scores indicate a “slight” agreement, which in turn indicates that even our human annotators are not that good at differentiating disturbing and restricted videos. Table 6 summarizes our ground truth dataset after collapsing our labels.

To perform a more accurate analysis of the inappropriate videos on YouTube, we need a more accurate classifier. Thus, for the sake of our analysis in the next steps, we collapse our four labels into two general categories, by combining the suitable with the irrelevant videos into one “non-disturbing” category and the disturbing with the restricted videos into a second “inappropriate” category. In this way, we alleviate the subjectivity in discerning disturbing from restricted videos.

We call the first category “non-disturbing” despite including PG and PG-13 videos because those videos are not aimed at toddlers (irrelevant). On the other hand, videos rated as PG or PG-13 that target toddlers (aged 1 to 5) are disturbing and fall under the inappropriate category. When such videos appear on the video recommendation list of toddlers, it is a strong indication that they are disturbing and our binary classifier will detect them as inappropriate. In this way we are able to build a classifier that achieves better accuracy and detects videos that are safe to be watched by a toddler.

Class Non-disturbing (%) Inappropriate (%)
# of videos 1,592 () 1,528 ()
Table 6: Summary of our ground truth dataset after collapsing our labels into two categories.

Binary Classifier. Next, we train and evaluate the proposed model for binary classification on our reshaped ground truth dataset following the same approach as the one presented above. Table 7 reports the performance of our model as well as the baselines, while Figure 8 shows their ROC curves. We observe that our deep learning model outperforms all baseline models across all performance metrics. Specifically, our model substantially outperforms the Random Forest classifier, which has the best performance from all the five baselines, on accuracy, precision, recall, and F1 score, by , , , and , respectively.

Model Accuracy Precision Recall F1 Score
SVM 51.4 76.0 50.0 35.0
Decision Tree 54.6 53.7 51.8 52.8
K-Nearest Neighbors 58.1 58.2 51.5 54.6
Naive Bayes 63.2 62.2 63.6 62.9
Random Forest 67.6 67.7 64.6 66.1
Proposed Model 82.8 85.1 78.7 81.8
Table 7: Performance of the evaluated baselines trained for binary classification and of our proposed binary classifier.
Figure 8: ROC Curves of all the baselines as well as of the proposed model trained for binary classification.

4 Analysis

In this section, we study the interplay of non-disturbing and inappropriate videos on YouTube using our binary classifier. First, we assess the prevalence of inappropriate videos in our dataset and investigate how likely it is for YouTube to recommend an inappropriate video. Second, we perform random walks on YouTube’s recommendation graph to simulate the behavior of toddler that selects videos based on the recommendations.

4.1 Recommendation Graph Analysis

First, we investigate the prevalence of inappropriate videos on our dataset by our binary classifier on the whole dataset, which allows us to find which videos are inappropriate or non-disturbing. We find 122K () non-disturbing videos and 11K () inappropriate videos. These findings highlight the gravity of the problem: a parent searching on YouTube with simple toddler-related keywords and casually selecting from the recommended videos, is likely to expose their child to a substantial number of inappropriate videos.

But what is the interplay between the inappropriate and non-disturbing videos in our dataset? To shed light to this question, we create a directed graph, where nodes are videos, and edges are recommended videos (up to 10 videos due to our data collection methodology). For instance, if is recommended via then we add an edge from to . Then, for each video in our graph, we calculate the out-degree in terms of non-disturbing and inappropriate labeled nodes. From here, we can count the number of transitions the graph makes between differently labeled nodes. Table 8 summarizes the percentages of transitions between the two types of videos in our dataset. Unsurprisingly, we find that almost of the transitions are between non-disturbing videos, which is mainly because of the large number of non-disturbing videos in our sample. Interestingly, a user is recommended an inappropriate video of the times when she is currently watching a non-disturbing video.

Taken altogether, these findings show that the problem of toddler-related inappropriate videos on YouTube is not negligible and that there is a very real chance that a toddler will be recommended an inappropriate video when watching a non-disturbing video.

Source Destination % of total transitions
Non-disturbing Non-disturbing 83.96
Non-disturbing Inappropriate 5.79
Inappropriate Non-disturbing 7.88
Inappropriate Inappropriate 2.37
Table 8: Probability of each possible transition between non-disturbing and inappropriate videos in our dataset.
Figure 9: CDF of the number of inappropriate videos encountered per random walk (a) and cumulative percentage of inappropriate videos encountered at each hop out of all the inappropriate videos found (b) for non-disturbing and inappropriate seed keywords.
Figure 10: CDF of the number of inappropriate videos encountered per random walk (a) and cumulative percentage of inappropriate videos encountered at each hop out of all the inappropriate videos found (b) for clusters of seed keywords.

4.2 How likely it is for a toddler to come across inappropriate videos?

In the previous section, we showed that the problem of toddler-related videos is not negligible. However, it is unclear whether the previous results generalize to YouTube at large since our dataset is based on a snowball sampling up to 3 hops from a set of seed videos. In reality though, YouTube is an immense platform comprising billions of videos, which are recommended over many hops within YouTube’s recommendation graph. Therefore, to assess how prominent the problem is on a larger scale, we perform random walks on YouTube’s recommendation graph. This allow us to simulate the behavior of a user that searches on the platform for a video and then he watches several videos according to the recommendations. To do this, we use the list of seed words used for constructing our dataset: for each seed keyword, we initially perform a search query on YouTube and randomly select one video from the top 10. Then, we obtain the recommendations of the video and select one randomly. We iterate with the same process until we reach 10 hops, which constitutes the end of a single random walk. We repeat this operation for 100 random walks for each seed keyword, while at the same time classifying each video we visit, using our binary classifier.

First, we report our results by grouping the random walks based on the keywords used to seed them. That is, we separate the seed keywords into non-disturbing and inappropriate based on the words they include (we find 31 non-disturbing and 33 inappropriate seed keywords). Fig 8(a) shows the CDF of the number of inappropriate videos that we find in each random walk according to the seed keyword. We observe that we find at least one inappropriate video in of the walks when using non-disturbing keywords, while for inappropriate keywords we find at least one inappropriate video in of the walks. We also plot the cumulative percentage of inappropriate videos encountered at each hop of the random walks for both non-disturbing and inappropriate search keywords in Fig. 8(b). This allow us to observe in which hop we find the most inappropriate videos. Interestingly, we find that most of the inappropriate videos are found when starting our random walks (i.e., during the selection of the search results) and this number decreases as the number of hops increases. These findings highlight that the problem of inappropriate videos on YouTube emerges on users’ search results.

Next, we aim to assess whether our results change according to the content of the videos. To do this, we inspect all the seed keywords and create clusters based on the words they include. For example, all seed keywords related to “Peppa pig” are grouped into one cluster. Table 9 summarizes the clusters and the number of seed keywords in each cluster. Then, based on the clusters we group the random walks according to their seed keywords and their respective clusters. Fig. 9(a) shows the CDF of the number of inappropriate videos per random walk, while Fig. 9(b) shows the cumulative CDF per hop. We find that the results vary according to the keywords that we use and that we find the most inappropriate videos in random walks that start from “Peppa pig” (), “Elsa and Spiderman and Joker” (), and “Minnie and Mickey mouse” (). Also, we find a substantial number of inappropriate videos emerging from the search results in particular for the terms “Peppa pig” (), “Elsa and Spiderman and Joker” () (see hop zero in Fig. 9(b)).

Cluster name # of seed keywords
Elsa and Spiderman and Joker 26
Other 9
Peppa Pig 7
Superheroes 5
Minnie and Mickey mouse 5
Wrong Heads 5
Finger Family 4
Bad kids and Bad babys 3
Table 9: Summary of our clusters and the number of seed keywords in each cluster.

5 Related Work

Prior work studied YouTube videos with inappropriate content for children, as well as spam, hate or malicious activity.

Inappropriate Content for Children. Several studies focused on understanding videos that target young children, and how they interact with such videos and the platform. Buzzi [9] suggests the addition of extra parental controls on YouTube in an attempt to prevent children from accessing inappropriate content. Araújo et al. [5] study the audience profiles and comments posted on YouTube videos in popular children-oriented channels, and conclude that children under the age of 13 use YouTube and are exposed to advertising, inappropriate content, and privacy issues. Eickhoff et al. [13] propose a binary classifier, based on video metadata, for identifying suitable YouTube videos for children. Kaushal et al. [17] focus on the characterization and detection of unsafe content for children and its promoters on YouTube. They propose a machine learning classifier that considers a set of video-, user-, and comment-level features for the detection of users that promote unsafe content.

Spam, Hate and Malicious Activity. A large body of previous work focused on the detection of malicious activity on YouTube. Sureka et al. [27] use social network analysis techniques to discover hate and extremist YouTube videos, as well as hidden communities in the ecosystem. Agarwal et al. [2] develop a binary classifier trained with user and video features for detecting YouTube videos that promote hate and extremism. Giannakopoulos et al. [15] use video, audio, and textual features for training a k-nearest neighbors classifier for detecting YouTube videos containing violence. Ottoni et al. [21] perform an in-depth analysis on video comments posted by alt-right channels on YouTube. They conclude that the comments of a video are a better indicator for detecting alt-right videos when compared to the video’s title. Aggarwal et al.[3] use video features for detecting videos violating privacy or promoting harassment.

With regard to spam detection, Chowdury et al. [12] explore video attributes that may enable the detection of spam videos on YouTube. A similar study by Sureka [26] focuses on both user features and comment activity logs to propose formulas/rules that can accurately detect spamming YouTube users. Using similar features, Bulakh et al. [8] characterize and identify fraudulently promoted YouTube videos. Chaudhary et al. [10] use only video features, and propose a one-class classifier approach for detecting spam videos.

O’Callaghan et al. [19] use dynamic network analysis methods to identify the nature of different spam campaign strategies. Benevenuto et al. [6] propose two supervised classification algorithms to detect spammers, promoters, and legitimate YouTube users. Also, in an effort to improve the performance of spam filtering on the platform, Alberto et al. [4] test numerous approaches and propose a tool, based on Naive Bayes, that filters spam comments on YouTube. Finally, Zannettou et al. [31] propose a deep learning classifier for identifying videos that use manipulative techniques in order to increase their views (i.e., clickbait videos).

In contrast to all these studies, we are the first to focus on the characterization and detection of disturbing videos, i.e., inappropriate videos that explicitly target toddlers. We collect thousands of YouTube videos and manually annotate them according to four relevant categories. We develop a deep learning classifier that can detect inappropriate videos with an accuracy of . By classifying and analyzing these videos, we shed light on the prevalence of the problem on YouTube, and how likely it is for an inappropriate video to be served to a toddler who casually browses the platform.

6 Conclusions

An increasing number of young children are shifting from broadcast to streaming video consumption, with YouTube providing an endless array of content tailored toward young viewers. While much of this content is age-appropriate, there is also an alarming amount of inappropriate material available.

In this paper, we present the first characterization of inappropriate or disturbing videos targeted at toddlers. From a ground truth labeled dataset, we develop a deep learning classifier that achieves an accuracy of . We leverage this classifier to perform a large-scale study toddler-oriented content on YouTube, finding of the 133,806 videos in our dataset to be inappropriate. Even worse, we discover a chance of a toddler who starts watching non-disturbing videos to be recommended inappropriate ones within ten recommendations.

Although scientific debate (and public opinion) on the risks associated with “screen time” for young children is still on going, based on our findings, we believe a more pressing concern to be the dangers of crowd-sourced, uncurated content combined with engagement oriented, gameable recommendation systems. Considering the advent of algorithmic content creation (e.g., “deep fakes” [30]) and the monetization opportunities on sites like YouTube, there is no reason to believe there will be an organic end to this problem. Our classifier, and the insights gained from our analysis, can be used as a starting point to gain a deeper understanding and begin mitigating this issue.

7 Acknowledgments.

This project has received funding from the European Union’s Horizon 2020 Research and Innovation program under the Marie Skłodowska-Curie ENCASE project (Grant Agreement No. 691025). This work reflects only the authors’ views; the Agency and the Commission are not responsible for any use that may be made of the information it contains.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, 2016.
  • [2] S. Agarwal and A. Sureka. A focused crawler for mining hate and extremism promoting videos on youtube. In ACM Hypertext, 2014.
  • [3] N. Aggarwal, S. Agrawal, and A. Sureka. Mining youtube metadata for detecting privacy invading harassment and misdemeanor videos. In IEEE PST, 2014.
  • [4] T. C. Alberto, J. V. Lochter, and T. A. Almeida. Tubespam: Comment spam filtering on youtube. In IEEE ICMLA, 2015.
  • [5] C. S. Araújo, G. Magno, W. Meira, V. Almeida, P. Hartung, and D. Doneda. Characterizing videos, audience and advertising in youtube channels for kids. In SocInfo, 2017.
  • [6] F. Benevenuto, T. Rodrigues, A. Veloso, J. Almeida, M. Gonçalves, and V. Almeida. Practical detection of spammers and content promoters in online video sharing systems. IEEE Cybernetics, 2012.
  • [7] R. Brandom. Inside elsagate, the conspiracy-fueled war on creepy youtube kids videos. https://www.theverge.com/2017/12/8/16751206/elsagate-youtube-kids-creepy-conspiracy-theory, 2017.
  • [8] V. Bulakh, C. W. Dunn, and M. Gupta. Identifying fraudulently promoted online videos. In WWW, 2014.
  • [9] M. Buzzi. Children and youtube: access to safe content. In ACM SIGCHI, 2011.
  • [10] V. Chaudhary and A. Sureka. Contextual feature based one-class classifier approach for detecting video response spam on youtube. In IEEE PST, 2013.
  • [11] F. Chollet. Keras. https://github.com/fchollet/keras, 2015.
  • [12] R. Chowdury, M. N. M. Adnan, G. Mahmud, and R. M. Rahman. A data mining based spam detection system for youtube. In IEEE ICDIM, 2013.
  • [13] C. Eickhoff and A. P. de Vries. Identifying suitable youtube videos for children. Networked and electronic media summit, 2010.
  • [14] J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 1971.
  • [15] T. Giannakopoulos, A. Pikrakis, and S. Theodoridis. A multimodal approach to violence detection in video sharing sites. In IEEE ICPR, 2010.
  • [16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
  • [17] R. Kaushal, S. Saha, P. Bajaj, and P. Kumaraguru. Kidstube: Detection, characterization and analysis of child unsafe content & promoters on youtube. In IEEE PST, 2016.
  • [18] S. Maheshwari. On youtube kids, startling videos slip past filters. https://www.nytimes.com/2017/11/04/business/media/youtube-kids-paw-patrol.html, 2017.
  • [19] D. O’Callaghan, M. Harrigan, J. Carthy, and P. Cunningham. Network analysis of recurring youtube spam campaigns. In ICWSM, 2012.
  • [20] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In IEEE CVPR, 2014.
  • [21] R. Ottoni, E. Cunha, G. Magno, P. Bernadina, W. Meira Jr, and V. Almeida. Analyzing right-wing youtube channels: Hate, violence and discrimination. arXiv 1804.04096, 2018.
  • [22] Reddit. What is elsagate? https://www.reddit.com/r/ElsaGate/comments/6o6baf/, 2017.
  • [23] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014.
  • [24] Statista. Youtube: most subscribed kids content channels. https://www.statista.com/statistics/785626/most-popular-youtube-children-channels-ranked-by-subscribers/, 2018.
  • [25] A. Subedar and W. Yates. The disturbing youtube videos that are tricking children. https://www.bbc.com/news/blogs-trending-39381889, 2017.
  • [26] A. Sureka. Mining user comment activity for detecting forum spammers in youtube. arXiv 1103.5044, 2011.
  • [27] A. Sureka, P. Kumaraguru, A. Goyal, and S. Chhabra. Mining youtube to discover extremist videos, users and hidden communities. In AIRS, 2010.
  • [28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE CVPR, 2015.
  • [29] P. Weston. Youtube kids app is still showing disturbing videos. https://www.dailymail.co.uk/sciencetech/article-5358365/YouTube-Kids-app-showing-disturbing-videos.html, 2018.
  • [30] Wikipedia. Deepfake. https://en.wikipedia.org/wiki/Deepfake.
  • [31] S. Zannettou, S. Chatzis, K. Papadamou, and M. Sirivianos. The Good, the Bad and the Bait: Detecting and Characterizing Clickbait on YouTube. In IEEE S&P Workshops, 2018.