Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation

08/31/2017
by   Greg Durrett, et al.
0

One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data. In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own "fine-grained domain" in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/07/2017

Fine-grained Recognition in the Wild: A Multi-Task Domain Adaptation Approach

While fine-grained object recognition is an important problem in compute...
research
04/23/2019

Fine-Grained Named Entity Recognition using ELMo and Wikidata

Fine-grained Named Entity Recognition is a task whereby we detect and cl...
research
04/24/2020

Coach: A Coarse-to-Fine Approach for Cross-domain Slot Filling

As an essential task in task-oriented dialog systems, slot filling requi...
research
05/31/2021

Crowdsourcing Learning as Domain Adaptation: A Case Study on Named Entity Recognition

Crowdsourcing is regarded as one prospective solution for effective supe...
research
10/22/2021

Domain Adaptation and Active Learning for Fine-Grained Recognition in the Field of Biodiversity

Deep-learning methods offer unsurpassed recognition performance in a wid...
research
12/31/2020

FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine Translation

Previous domain adaptation research usually neglect the diversity in tra...
research
10/13/2022

M2D2: A Massively Multi-domain Language Modeling Dataset

We present M2D2, a fine-grained, massively multi-domain corpus for study...

Please sign up or login with your details

Forgot password? Click here to reset