“Exploring Transfer Learning Techniques for Named Entity Recognition in Noisy User-Generated Text”, 2021-05-31 (; similar):
Modern law enforcement agencies strive to identify current trends and developments in Darknet markets. Extracting information from such markets requires knowledge about the contained entities, which can be extracted via Named Entity Recognition (NER).
Modern NER models are trained via supervised learning, which requires an annotated dataset, but such datasets for specific application domains, eg. drug detection in Darknet markets, are rarely available. In this work, we created a NER dataset focused on drugs in Darknet markets and evaluated resources and techniques for domain and task adaptation of our NER models. The dataset, with about 3,500 item listings, was created via crowd-Sourcing and refined via a manual review. It is ~3× the size of the only other available NER dataset for Darknet markets, we were aware of at this time.
We found that we were able to improve our NER prediction performance by ‘domain adaptation’ via fine-tuning our language models on Darknet item descriptions and reduced versions of Wikipedia texts about illicit drugs. Our models were able to predict drug entities with a F1-Score of up to 84.04 points according to the CoNLL2003 NER evaluation metric.
[Keywords: NER, Named Entity Recognition, noisy user-generated text, darknet, drug detection, crowd-sourcing, Mechanical Turk]
…The Darknet data is loaded from 2 primary sources, the Darknet Market Archives [BCDH+15] and AZSecure-data [DZE+18].
The Darknet Market Archives contain multiple datasets about Darknet Market platforms and forums. We only used the “grams” dataset. This dataset contains nearly daily scrapes of multiple market platforms (eg. “Agora”). We chose to use the last date where these markets were scraped “2015-07-12” and only a subset of these markets, namely: “Abraxas”, “Agora”, “Alpha”, “ME”, and “Oxygen”. This dataset was only used for adjusting our language models to the target domain, called domain adaptation (see §2.1). For the dataset creation we used a dataset from AZSecure-data, which was scraped from a platform called “Dream Market”. At this time it was the largest Darknet market platform according to [DZE+18]. The data was collected 2013–42017 and contained 91,463 listings of which 61,420 were found in a category associated with drugs. The dataset contains a variety of product and vendor information.
In scope of this work, we were only interested in the product name and description. The item description was used for the annotation of named entities and the product name, was used to provide context to the annotators. However, other types of information were used during the pre-processing for pseudonymization purposes. The pseudonymization included removing all vendor names from the item listings, removing email addresses and telephone numbers and all links found in the dataset (those might also identify a vendor profile). A recent example for a drug item listing, which was online at the time of our project, can be seen in Figure 3.1.
Our experiment design required further datasets as representatives for standard NER corpora and text corpora with noisy user-generated data. Our standard NER text corpus is the well-known CoNLL2003 NER dataset [TKSDM03], which is based on newswire texts annotated with Person, Location, Organization and Miscellaneous entities. As representatives for the noisy user-generated text datasets we chose the Broad Twitter Corpus [DBR16] and the WNUT 2017 dataset [DNEL17]. The Broad Twitter Corpus contains 9,551 Tweets with annotations for entities of type Person, Location and Organization. The WNUT 2017 dataset contains 2,295 text from various sources (Reddit, Twitter, YouTube, and StackExchange comments) with annotations for Person, Location, Corporation, Product, Creative-Work and Group as named entity types. Furthermore, we used the extension from Al-Nabki [NFAFR20] of the WNUT 2017 dataset called “NuToT”. This dataset version is extended by Darknet market listings, which advertise illicit goods.