“Machine Learning Techniques for the Classification of Product Descriptions from Darknet Marketplaces”, 2020-01-29 (; similar):
Over the past decade, the darknet has created unprecedented opportunities for trafficking in illicit goods, such as weapons and drugs, and it has provided new ways to offer crime as a service. Natural language processing techniques can be applied to find the types of goods that are traded in these markets. In this paper we present the results of evaluating state-of-the-art machine learning methods for the classification of darknet market offers.
Several embeddings, such as GloVe embeddings20, FastText15, Tensor Flow Universal Sentence Encoder7, Flair’s contextual string embedding2 and term-frequency inverse-document-frequency (TF-IDF), as well as our domain-specific darknet embedding have been evaluated with a series of machine learning models, such as Random Forest, SVM, Naïve Bayes and Multilayer Perceptron.
To find the best combination of feature set and machine learning model for this task, the performance was evaluated on a publicly available collection covering 13 darknet markets with more than 10 million product offers6. After extracting unique advertisements from the corpus, the classifier was trained on a subset with those advertisements that contain strings related to weapons. The purpose was to determine how well the classifier can distinguish between different types of advertisements which seem all to be related to weapons according to the keywords they contain.
The best performance for this classification task was achieved using the Linear Support Vector Machine model with the Tensor Flow Universal Sentence Encoder for feature extraction, resulting in a micro-f1-score of 96%.
[Keywords: Natural language processing, machine learning, text classification, document embedding, darknet market]