Combating Human Trafficking with Deep Multimodal Models

05/08/2017 ∙ by Edmund Tong, et al. ∙ Marinus Analytics Carnegie Mellon University 0

Human trafficking is a global epidemic affecting millions of people across the planet. Sex trafficking, the dominant form of human trafficking, has seen a significant rise mostly due to the abundance of escort websites, where human traffickers can openly advertise among at-will escort advertisements. In this paper, we take a major step in the automatic detection of advertisements suspected to pertain to human trafficking. We present a novel dataset called Trafficking-10k, with more than 10,000 advertisements annotated for this task. The dataset contains two sources of information per advertisement: text and images. For the accurate detection of trafficking advertisements, we designed and trained a deep multimodal model called the Human Trafficking Deep Network (HTDN).



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human trafficking “a crime that shames us all” unodc2008, has seen a steep rise in the United States since 2012. The number of cases reported rose from 3,279 in 2012 to 7,572 in 2016—more than doubling over the course of five years hotline2017. Sex trafficking is a form of human trafficking, and is a global epidemic affecting millions of people each year mccarthy2014human. Victims of sex trafficking are subjected to coercion, force, and control, and are not able to ask for help. Put plainly, sex trafficking is modern-day slavery and is one of the top priorities of law enforcement agencies at all levels.

A major advertising ground for human traffickers is the World Wide Web. The Internet has brought traffickers the ability to advertise online and has fostered the growth of numerous adult escort sites. Each day, there are tens of thousands of Internet advertisements posted in the United States and Canada that market commercial sex. Hiding among the noise of at-will adult escort ads are ads posted by sex traffickers. Often long undetected, trafficking rings and escort websites form a profit cycle that fuels the increase of both trafficking rings and escort websites.

For law enforcement, this presents a significant challenge: how should we identify advertisements that are associated with sex trafficking? Police have limited human and technical resources, and manually sifting through thousands of ads in the hopes of finding something suspicious is a poor use of those resources, even if they know what they are looking for. Leveraging state-of-the-art machine learning approaches in Natural Language Processing and computer vision to detect and report advertisements suspected of trafficking is the main focus of our work. In other words, we strive to find the victims and perpetrators of trafficking who hide in plain sight in the massive amounts of data online. By narrowing down the number of advertisements that law enforcement must sift through, we endeavor to provide a real opportunity for law enforcement to intervene in the lives of victims. However, there are non-trivial challenges facing this line of research:

Adversarial Environment. Human trafficking rings are aware that law enforcement monitors their online activity. Over the years, law enforcement officers have populated lists of keywords that frequently occur in trafficking advertisements. However, these simplistic queries fail when traffickers use complex obfuscation. Traffickers, again aware of this, move to new keywords to blend in with the at-will escort advertisements. This trend creates an adversarial environment for any machine learning system that attempts to find trafficking rings hiding in plain sight.

Defective Language Compositionality. Online escort advertisements are difficult to analyze, because they lack grammatical structures such as constituency. Therefore, any form of inference must rely more on context than on grammar. This presents a significant challenge to the NLP community. Furthermore, the majority of the ads contain emojis and non-English characters.

Generalizable Language Context. Machine learning techniques can easily learn unreliable cues in training sets such as phone numbers, keywords, and other forms of semantically unreliable discriminators to reduce the training loss. Due to limited similarity between the training and test data due to the large number of ads available online, relying on these cues is futile. Learned discriminative features should be generalizable and model semantics of trafficking.

Multimodal Nature. Escort advertisements are composed of both textual and visual information. Our model should treat these features interdependently. For instance, if the text indicates that the escort is in a hotel room, our model should consider the effect that such knowledge may have on the importance of certain visual features.

We believe that studying human trafficking advertisements can be seen as a fundamental challenge to the NLP, computer vision, and machine learning communities dealing with language and vision problems. In this paper, we present the following contributions to this research direction. First, we study the language and vision modalities of the escort advertisements through deep neural modeling. Second, we take a significant step in automatic detection of advertisements suspected of sex trafficking. While previous methods dubrawski2015leveraging

have used simplistic classifiers, we build an end-to-end-trained multimodal deep model called the Human Trafficking Deep Network (HTDN). The HTDN uses information from both text and images to extract cues of human trafficking, and shows outstanding performance compared to previously used models. Third, we present the first rigorously annotated dataset for detection of human trafficking, called Trafficking-10k, which includes more than 10,000 trafficking ads labeled with likelihoods of having been posted by traffickers.

111Due to the sensitive nature of this dataset, access can only be granted by emailing Cara Jones. Different levels of access are provided only to scientific community.

2 Related Works

Automatic detection of human trafficking has been a relatively unexplored area of machine learning research. Very few machine learning approaches have been proposed to detect signs of human trafficking online. Most of these approaches use simplistic methods such as multimedia matching zhou2016multimedia

, text-based filtering classifiers such as random forests, logistic regression, and SVMs


, and named-entity recognition to isolate the instances of trafficking

nagpal2015entity. Studies have suggested using statistical methods to find keywords and signs of trafficking from data to help law enforcement agencies kennedy2012predictive as well as adult content filtering using textual information zhou2016multimedia.

Multimodal approaches have gained popularity over the past few years. These multimodal models have been used for medical purposes, such as detection of suicidal risk, PTSD and depression scherer2016self; venek2016adolescent; yu2013multimodal; valstar2016avec

; sentiment analysis

zadeh2016multimodal; poria2016convolutional; zadeh2016mosi; emotion recognition poria2017review; image captioning and media description you2016image; donahue2015long; question answering antol2015vqa; and multimodal translation specia2016shared.

To the best of our knowledge, this paper presents the first multimodal and deep model for detection of human trafficking.

3 Trafficking-10k Dataset

In this section, we present the dataset for our studies. We formalize the problem of recognizing sex trafficking as a machine learning task. The input data is text and images; this is mapped to a measure of how suspicious the advertisement is with regards to human trafficking.

Figure 1: Distribution of advertisements in Trafficking-10k dataset across United States and Canada.

3.1 Data Acquisition and Preprocessing

A subset of 10,000 ads were sampled randomly from a large cache of escort ads for annotation in Trafficking-10k dataset. The distribution of advertisements across the United States and Canada is shown in Figure 1, which indicates the diversity of advertisements in Trafficking-10k. This diversity ensures that models trained on Trafficking-10k can be applicable nationwide. The 10,000 collected ads each consist of text and zero or more images. The text in the dataset is in plain text format, derived by stripping the HTML tags from the raw source of the ads. The set of characters in each advertisement is encoded as UTF-8, because there is ample usage of smilies and non-English characters. Advertisements are truncated to the first 184 words, as this covers more than 90% of the ads. Images are resized to pixels with RGB channels.

3.2 Trafficking Annotation

Detecting whether or not an advertisement is suspicious requires years of practice and experience in working closely with law enforcement. As a result, annotation is a highly complicated and expensive process, which cannot be scaled using crowdsourcing. In our dataset, annotation is carried out by two expert annotators, each with at least five years of experience, in detection of human trafficking and another annotator with one year of experience. In our dataset, annotations were done by three experts. One expert has over a year of experience, and the other two have over five years of experience in the human trafficking domain. To calculate the inter-annotator agreement, each annotator is given the same set of 1000 ads to annotate and the nominal agreement is found: there was a 83% pairwise agreement (0.62 Krippendorff’s alpha). Also, to make sure that annotations are generalizable across the annotators and law enforcement officers, two law enforcement officers annotated, respectively, a subset of 500 and 100 of the advertisements. We found a 62% average pairwise agreement (0.42 Krippendorff’s alpha) with our annotators. This gap is reasonable, as law enforcement officers only have experience with local advertisements, while Trafficking-10k annotators have experience with cases across the United States.

Annotators used an annotation interface specifically designed for the Trafficking-10k dataset. In the annotation interface, each advertisement was displayed on a separate webpage. The order of the advertisements is determined uniformly randomly, and annotators were unable to move to the next advertisement without labeling the current one. For each advertisement, the annotator was presented with the question: “In your opinion, would you consider this advertisement suspicious of human trafficking?” The annotator is presented with the following options: “Certainly no,” “Likely no,” “Weakly no,” “Unsure,”222This option is greyed out for 10 seconds to encourage annotators to make an intuitive decision. “Weakly yes,” “Likely yes,” and “Certainly yes.” Thus, the degree to which advertisements are suspicious is quantized into seven levels.

3.3 Analysis of Language

The language used in these advertisements introduces fundamental challenges to the field of NLP. The nature of the textual content in these advertisements raises the question of how we can make inferences in a linguistic environment with a constantly evolving lexicon. Language used in the Trafficking-10k dataset is highly inconsistent with standard grammar. Often, words are obfuscated by emojis and symbols. The word ordering is inconsistent, and there is rarely any form of constituency. This form of language is completely different from spoken and written English. These attributes make escort advertisements appear somewhat similar to tweets, specifically since these ads are normally short (more than 90% of the ads have at most 184 words). Another point of complexity in these advertisements is the high number of unigrams, due to usage of uncommon words and obfuscation. On top of unigram complexity, advertisers continuously change their writing pattern, making this problem more complex.