Identifying Hidden Buyers in Darknet Markets via Dirichlet Hawkes Process

11/12/2019 ∙ by Panpan Zheng, et al. ∙ University of Arkansas Georgia State University 0

The darknet markets are notorious black markets in cyberspace, which involve selling or brokering drugs, weapons, stolen credit cards, and other illicit goods. To combat illicit transactions in the cyberspace, it is important to analyze the behaviors of participants in darknet markets. Currently, many studies focus on studying the behavior of vendors. However, there is no much work on analyzing buyers. The key challenge is that the buyers are anonymized in darknet markets. For most of the darknet markets, We only observe the first and last digits of a buyer's ID, such as “a**b”. To tackle this challenge, we propose a hidden buyer identification model, called UNMIX, which can group the transactions from one hidden buyer into one cluster given a transaction sequence from an anonymized ID. UNMIX is able to model the temporal dynamics information as well as the product, comment, and vendor information associated with each transaction. As a result, the transactions with similar patterns in terms of time and content group together as the subsequence from one hidden buyer. Experiments on the data collected from three real-world darknet markets demonstrate the effectiveness of our approach measured by various clustering metrics. Case studies on real transaction sequences explicitly show that our approach can group transactions with similar patterns into the same clusters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Darknet markets are online commercial websites that strongly provide privacy guarantees to both vendors and buyers. The markets are hosted in the darknet based on TOR service to hide the IP address and adopt cryptocurrencies, such as Bitcoin, as payment methods. Due to its anonymity, most of the transactions on darknet markets are related with trading illicit goods, such as illicit drugs, stolen credit cards, or even weapons.

To combat illicit transactions in the cyberspace, it is important to analyze the behavior of participants in darknet markets. Currently, many studies focus on studying the behavior of vendors, such as linking multiple accounts from a same vendor (Sybil accounts) [Tai, Soska, and Christin2019, Zhang et al.2019, Wang et al.2018]. However, there is no much work on analyzing buyers. One of the key challenges is that the buyers are anonymized in darknet markets.

In order to protect the buyers’ privacy and encourage the buyers to publish their comments, most of the darknet markets only reveal the first and last digits of a buyer’s ID, such as “a**b”, on the comment page. As a result, one observed anonymized ID can link to many different real-world buyers. Figure 1 shows an illustrative example of one mixed transaction sequence from the anonymized ID J***e. Each transaction contains information about four attributes: product, date, vendor, and comment in addition to the anonymized buyer name information. Our goal is to group those mixed transactions into clusters based on both content and temporal dynamics such that each cluster contains all transactions from one particular real-world user. Such disambiguation will allow us to learn transaction patterns of darknet markets and predict future transactions.

Figure 1: Illustrative example of transaction sequence from anonymized ID “J***e”, where transactions are from various real-world buyers.

In this work, we propose UNMIX, a hidden buyer identification model, based on the Dirichlet-Hawkes Process (DHP) [Du et al.2015], which is able to group continuous-time transactions for each hidden buyer by modeling temporal dynamics, product, title, and comment of each transaction. Temporal dynamics here refers to the time patterns of purchase behavior in Darknet market. Each buyer has its own temporal dynamics. For instance, buyer “Jaae” often purchases one kind of heroines on Monday night once a week while buyer “Jbbe” buys the same drug on Tuesday and Thursday (twice a week). The output from our model are clusters, each of which contains transactions from one specific buyer. The idea of our proposed model is to have the Hawkes process model the intensity rate of transactions, while the Dirichlet process captures the buyer-transaction cluster relationships (i.e., each cluster contains transactions from one buyer). In practice, different but similar buyers may have similar transaction patterns and even the same user, his/her transaction patterns may change along with the time. However, in our darknet market scenario, each anonymized transaction sequence (e.g., with ID ) only consists of a few real buyers. Based on our observation, these few real buyers tend to have different transaction patterns and with rich information like transaction comment, time and vendor, our UNMIX that groups transactions based on similar pattern can actually identify the hidden buyers.

UNMIX is a novel approach to achieve hidden buyer identification by integrating all the information associated with transactions, including temporal dynamics, products, comments, and vendors. The prior of each transaction belonging to one hidden buyer is determined by its temporal dynamics as different hidden buyers often exhibit different temporal dynamics. Specifically, the Hawkes process, one type of temporal point processes, is adopted to model the self-excitation phenomenon among transactions over continuous-time (e.g., buying illicit drugs in the past can raise the probability of buying them again in the future). The temporal dynamics of each identified hidden buyer is then characterized by one Hawkes process. Besides the temporal dynamics, the texts in product titles and comments and vendor involved in transactions are also incorporated into our model, which are characterized by a multinomial distribution and a categorical distribution, respectively. Meanwhile, by leveraging the Dirichlet process, the proposed model complexity grows as more transactions are collected over time, so our approach allows the number of hidden users from a mixed transaction stream that is not a-priori known or fixed.

The main contributions of our model are as follows. First, UNMIX does not need to assign a fixed number of hidden buyers underlying the unlimited number of transactions from one anonymized ID. Second, together with transaction content information, the temporal information provides important clues that improve accuracy in identifying hidden buyers in the same darknet markets. Third, experimental results on three real-world darknet markets indicate UNMIX is able to identify the various hidden buyers with different transaction patterns.

Related Work

Darknet Market Analysis

Darknet Markets are online markets hosted on the Tor service and guarantee strong anonymity property to participants. As a result, the darknet markets involve in illegal activities online. For the sake of public interests, the authorities and researchers have a growing interest to understand the darknet markets. Researchers have collected a large amount of data from darknet markets to analyze the active vendors, buyers, and goods being sold over time so that we can understand the growth of the darknet market ecosystem [Christin2013, Soska and Christin2015]. [Dittus, Wright, and Graham2018] conduct empirical studies to understand the supply chain underlying the markets. Besides analyzing the volumes of whole darknet markets, some studies analyze specific categories in the darknet markets. For example, [Broséus et al.2016] investigate the structure and organization of illicit drug trafficking. Since the darknet markets have strong correlations with cybercrime, [Van Wegberg et al.2018] focus on measuring the commoditization of cybercrime via darknet markets.

Many researchers target on the micro-level analysis, which studies the participants in the darknet markets. Due to the anonymity of darknet markets, the challenge of analyzing the behavior of participants in the darknet market is how to link user identities in the markets. Recently, several studies aim to link multiple accounts created by a real-world vendor [Wang et al.2018, Tai, Soska, and Christin2019, Zhang et al.2019]. The key idea of these studies is based on “stylometry” analysis, which is originally used to attribute authorship to anonymous documents. For example, [Zhang et al.2019] link multiple vendors by analyzing the styles of the product pictures and descriptions published by vendors. Unlike matching vendors which can adopt lengthy product descriptions and photos, the information can be used for identifying hidden buyers is very limited. To the best of our knowledge, how to identify hidden buyers in the darknet markets has not been studied in the literature.

Sequential Data Clustering

Identifying hidden buyers from a mixed transaction sequence can be viewed as a task for clustering sequential data. The widely-used models for clustering data from the topic modeling literature are the Latent Dirichlet Allocation (LDA) [Blei, Ng, and Jordan2003], where the number of topics is fixed, and its improved model, Hierarchical Dirichlet Process (HDP) [Teh et al.2005] with an unbounded number of topics. Many models are further proposed to fit the scenarios with online streaming text data [Wang, Blei, and Heckerman2008, Ahmed et al.2011, Liang, Yilmaz, and Kanoulas2016]. Recently, several studies further incorporate temporal dynamics to group streaming data [Du et al.2015, Mavroforakis, Valera, and Gomez-Rodriguez2017, Xu and Zha2017, Seonwoo, Oh, and Park2018]. For example, the Dirichlet Hawkes Process adopts the temporal point process, e.g., Hawkes process, to model the continuous-time information and the Dirichlet Process to solve the clustering problems [Du et al.2015]. In our work, we adapt the Dirichlet Hawkes Process for hidden buyer identification using the Hawkes process to model the temporal dynamics, the multinomial distribution to model texts in product titles and comments, and the categorical distribution to model vendors involved in transactions.

Preliminary

Dirichlet Process

The Dirichlet process (DP) is a Bayesian nonparametric model, which is parameterized by a concentration parameter and a base distribution over a space . It indicates that a random distribution drawn from DP is a distribution over , denoted as . The expectation of the distribution is the base distribution . The concentration parameter

controls the variance of

that a larger leads to a tighter distribution around . DP is widely used for clustering with the unknown number of clusters.

The Dirichlet process can also be represented as the Chinese Restaurant Process (CRP). CRP assumes a restaurant with an infinite number of tables, and each of the tables can seat an infinite number of customers. Within the context of clustering, each table indicates a cluster while each customer is a data point. The simulation process of CRP is as follows:

  1. The first customer always sits at the first table.

  2. Customer () sits at:

    1. a new table with probability .

    2. an existing table with probability where is the number of customers at table .

Let be a sequence sampled from . The conditional distribution of can be written as:

(1)

where is a point mass centred at . Equation 1 indicates that a new sample belongs to a new table with a constant probability or an existing table with probability proportional to . A larger indicates a higher probability that a customer will belong to the table . Hence, DP has a special clustering property that the rich gets richer.

Temporal Point Process

Temporal point process is a random process that models the observed random event patterns along the time. Given an event time sequence , a temporal point process can be characterized by the conditional intensity function which indicates the expected instantaneous rate of the next event at time ():

(2)

where indicates the number of events occurred in a time interval ; is the collection of historical events until time .

Let be the conditional density function of the event happening at time given the historical events up to time , which is defined as

(3)

where is the survival function that indicates the probability that no new event has ever happened up to time since .

With an observation window , the joint likelihood of the observed sequence is formalized as

(4)

Hawkes process. A Hawkes process is one type of temporal point process, which captures the self-excitation phenomenon among events [Hawkes1971]. In the Hawkes process, the conditional intensity function is defined as:

(5)

where is the base intensity that indicates the intensity of events triggered by external signals instead of previous events; is the triggering kernel that is usually a monotonically decreasing function which ensures the recent events have higher influences on the intensity of next event. The Hawkes process models the self-excitation phenomenon that a new event arrival increases the conditional intensity of the oncoming event immediately and then decreases back towards in the long term. Recently, the Hawkes process is widely used to model event patterns which are clustered, such as the information diffusion on social networks or the earthquake occurrences [Zhao et al.2015, Reinhart2017, Farajtabar2018].

Hidden Buyer Identification

In a darknet market, a buyer purchases products from vendors , and then publishes comments about the products. Especially,it is noticed that we can’t see the real user names of buyers. Instead, what we can observe are some anonymized IDs, each of which contains an unbounded number of real buyers. Given a series of transactions marked by one specific anonymized ID, our goal is to uncover these real buyers, and then, based on them, to group the transactions. In our scenario, these distinctive real buyers are named as hidden buyers. Given a series of transactions underlying one specific anonymized ID, its corresponding sequence of real buyers is denoted as with one set of real buyers as . Then, the hidden buyer associated with one certain event is expressed as .

Formally, transaction in is denoted as , which means that at time , a buyer purchases a product from a vendor , where is the corresponding vendor sequence, and publishes a comment . Since product titles and comments are both text information, so we further combine them as a contentvector by a bag of word model. Finally, we define one transaction in as . Note that since we only observe the time to publish a comment, in our scenario, we assume the operations, purchasing a product and publishing a comment, are synchronous.

To identify the hidden buyers, we assume that different hidden buyers have their own unique hidden transaction patterns. For example, buyer always buys fentanyl from one certain vendor without comments, while buyer often takes fentanyl from the same vendor as well but likes to leave the comments. Given this toy example, we are wondering if transactions with a similar purchasing pattern are associated with the same hidden buyer. To further explore and solve this problem, in this work, we aim to uncover the mixed transactions sequence marked by one anonymized ID ,and for this goal, we propose a novel identification framework named as UNMIX.

UNMIX is a Dirichlet process framework with Chinese restaurant process as implementation. In UNMIX, each table encapsulates a marked Hawkes process model, which is for time and type information, and a bag-of-words model, which is for textual comment information. Here, each table corresponds to a real hidden buyer in our scenario. For one specific transaction, its hidden buyer assignment is based on a discrete probability distribution that is derived by posterior predictive distribution. The estimated occurrence likelihoods are related to the historical transactions from these hidden buyers. Hence, transactions with the similar patterns are easily going to the same hidden buyer and an oncoming transaction tends to be assigned to a hidden buyer (table) in which the majority of previous transactions (restaurant customer) are similar to it.

Modeling Buyer Transactions

From the perspective of features, we consider three categories of information: time, content (product titles and comments) and vendor. Each of them has its own distinctive characteristics and should be captured by different models. For instance, due to the drug addiction effects, once a user starts to purchase illicit drugs, he may keep purchasing constantly in a short period of time. Since the behavior of purchasing drugs is self-exciting, it is natural to adopt the Hawkes process to model the purchasing behavior in terms of time. Meanwhile, vendor type and content information are characterized by categorical and multinomial distributions, respectively. Given the unbounded number of hidden buyers in a dynamic transaction sequence, we adopt the Dirichlet process as a prior probability distribution to model the generation of hidden buyers.

Generally, UNMIX is a hierarchical framework with two layers: in the outer layer, it employs Dirichlet process to capture the diversity of hidden transaction patterns for distinctive hidden buyers; in the inner layer (inside the hidden buyers), it takes use of Hawkes process, multinomial distribution and categorical distribution to model the time, content and vendor type information, respectively.

Intensity of the buyer transaction activity. We adopt the Hawkes process to model the buyer transactions over time. In our scenario, the sequence of transactions with the same anonymized ID are actually conducted by different hidden buyers. For each hidden buyer, we adopt one Hawkes process to model its temporal information. As a result, the intensity function of Hawkes process over the whole transaction sequence from all of existed hidden buyers is defined as:

(6)

where is the total number of identified hidden buyers until time . is the intensity of one certain hidden buyer and it can be expressed as follow:

(7)

where is the corresponding event time sequence of ; is the triggering kernel associated with one hidden buyer ; is the index of hidden buyer associated with the -th transaction, and denotes the -th transaction has been assigned to the -th buyer in Chinese restaurant process. Here, the triggering kernel function with base kernel functions is in the form as , where controls the self-excitation of the Hawkes process with , and is typical reference time point that controls the event decay. We adopt the Gaussian RBF kernel as the base kernel function.

Distribution of content information (product titles and comments). Since both product titles and comments are text information, we represent them as a bag-of-word language model. We call both the product title and comment in a transaction as the content of the transaction. As a result, we use a vector to represent the content in transaction , where each dimension refers to the frequency of the corresponding word sampled from a vocabulary . In particular, provided by hidden buyer , is describe as follow

(8)

where is the prior of multinomial distribution with size , which indicates the occurrence likelihood of each word in the content given the hidden buyer .

Distribution of vendors. In this work, we use vendor ID to indicate each vendor. Due to its dicreteness property, at each time , the vendor type is sampled from a categorical distribution with the sample space size as :

(9)

where is the prior of categorical distribution with size , which refers to the occurrence probability of each vendor type given the hidden buyer .

Figure 2: Graphical representation of UNMIX

The Generative Process

We can describe our model as a generative process similar to the CRP. At time , the oncoming transaction may be from either a new buyer or an existing buyer. To give a proper hidden buyer assignment of event , our proposed framework UNMIX, which is running on a Dirichlet process, will dynamically reuse an existing hidden buyer or generate a new one to adapt the upcoming event . Concretely, hidden buyer of the oncoming event can be chosen in a metropolis sampling-based way

(10)

where is the number of existing hidden buyers up to but not including time ; indicates the intensity of a Hawkes process for the hidden buyer defined in Equation 7. We can notice that plays the similar role as the concentration parameter in DP and the probability of belonging to is proportional to the intensity function from a Hawkes process.

The algorithm of the generative process is shown in Algorithm 1, where is the base intensity, is the initial parameter setting of trigger kernels in Equation 7, and are the initial prior for the categorical and multinomial distributions. Line 1 samples the time via a Hawkes process. Based on temporal dynamics of historical events, line 1 chooses a proper hidden buyer for the current event at time . Line 1 mainly shows the updating of s and s to sample vendor type and content information for the current event in the next step. Given the priors (s and s) above, line 1 and 1 illustrate how to draw the corresponding content and vendor type information.

Input : ,, ,
Output :  where is the total number of transactions produced by the generative process algorithm.
1 for  do
2       Sample the time ;
3       Sample the hidden buyer for the transaction at time by Eq. 10 ;
4       if  then
5             Reuse and for and ;
6            
7       else
8             Sample from , from , and from for the new user ;
9            
10       Sample each word in the content of transaction by Eq. 8 ;
11       Sample the vendor of transaction by Eq. 9 ;
12      
13 end for
Algorithm 1 The generative process of UNMIX

Inference

Given a sequence of transactions from an anonymized ID, we aim to infer the hidden buyer (hidden transaction patterns) of the oncoming transaction . We adopt a Sequential Monte Carlo (SMC) algorithm to sample the hidden buyer associated with each transaction . SMC adopts a set of particles to approximate the posterior distribution , in which is taken as the proposal distribution. In particular, based on Figure 2, the posterior distribution at time can be factorized as

(11)

In Equation 11, the prior is given by:

(12)

where indicates the intensity from buyer . For the inference of , which is used to parameterize the triggering kernels in the intensity function, we follow the literature [Cappe, Godsill, and Moulines2007, Carvalho et al.2010] and update by maximum likelihood estimation (Equation 4).

Based on the conjugate relation between the multinomial and Dirichlet distributions, the likelihood of the content distribution is:

(13)

where and indicate the total word count and the count of word appeared in the content from buyer excluding , respectively; and refer to the total word count and the count of word in content , respectively; is the value in Dirichlet prior for word .

Similarly, the likelihood of the vendor distribution is:

(14)

where is the count of the vendor type from unique buyer excluding the current vendor ; is the total number of vendors associated with the buyer excluding the current vendor ; is the value in Dirichlet prior for vendor .

Experiments

Datasets and Baselines.

Darknet Markets Vendors Anonymized Buyer IDs Transactions
Wall Street Market 440 1896 18603
Empire Market 273 1492 12937
Dream Market 606 2587 102378
Table 1: Statistics of three darknet markets
Figure 3: Distributions of transaction numbers conducted by anonymized IDs over three darknet markets

Datasets. To evaluate our approach, we have crawled the data from three popular darknet markets, i.e., Dream Market, Wall Street Market, and Empire Market. The statistics of the crawled darknet markets are shown in Table 1. Figure 3 further shows the distributions of transaction numbers over anonymized IDs in three darknet markets. Overall, it is a long-tail distribution, which indicates most of the anonymized IDs only conduct a small number of transactions.

Note that in the Dream Market, the buyers comment to the vendors instead of products. Hence, for the Dream Market, we only adopt the texts from comments as the content information.

Baselines. We compare our approach with two baselines.

  • Hierarchical Dirichlet Process (HDP) is a nonparametric Bayesian approach for topic modeling [Teh et al.2005]. We adopt DBSCAN to group the transactions, each of which is represented as the corresponding topic distribution. HDP only considers the information of product titles and buyer comments.

  • Dirichlet Hawkes Process (DHP) is a simplified version of our approach which does not adopt the vendor information for clustering.

Experiments on Transaction Sequences with Ground-truth

Experimental setup. Due to the anonymity of darknet markets, it is infeasible to get the ground-truth regarding the actual buyers with the same anonymized id. To quantify the performance of our proposed approach, we propose a procedure to generate transaction sequences with ground-truth. Specifically, based on our observations, the transactions conducted by one anonymized ID from one vendor in a short time have a high chance to be from one real-world buyer due to the consistent transaction behavior.

Therefore, for each darknet market, we first select anonymized IDs, where each anonymized ID has around five to eight transactions from one vendor in a month. Then, we combine all the transactions from these anonymized IDs to compose one transaction sequence and sort the sequence by transaction time. Hence, in this setting, we generate one transaction sequence for each darknet market, while each transaction sequence is actual from various anonymized IDs. The goal of this task is to group transactions from one anonymized ID into one cluster. The statistics of transaction sequences with ground-truth are shown in Table 2.

We evaluate the performance by four clustering metrics, including adjusted rand score (ARS), normalized mutual information score (NMI), and V-measure score (V-score), homogeneity score (H-score). These metrics are computed by comparing with the ground-truth labels.

Wall Street Market Empire Market Dream Market
sequence length 42 188 229
# of anonymized IDs () 6 27 36
Table 2: Statistics of sequences with ground-truth

Experimental results. Table 3 shows the clustering results on various transaction sequences. Overall, with incorporating the content, vendor, and time information for hidden buyer identification, our proposed approach achieves the good performance in terms of various clustering metrics. The performances of two baselines are worse than our proposed approach, which indicates without using vendor or temporal dynamics information could damage the performance of hidden buyer identification. Meanwhile, we can observe that the performance of the three approaches is reduced when the sequences become complicated. For example, our approach achieves the highest scores in Wall Street Market and the lowest scores in Dream Market. First, this is because the sequence of Wall Street Market is simple, which only consists of sequences from 6 anonymized IDs, while the sequence of Dream Market consists of 36 anonymized IDs. Moreover, for Dream Market, we only observe the comments as content information. Without using texts in product titles could damage the performance of clustering.

Approaches ARS NMI V-score H-score
# of
IDs ()
Wall Street
Market
HDP 0.1612 0.3380 0.3316 0.2777 3
DHP 0.9675 0.9804 0.9802 0.9999 7
Our approach 0.9385 0.9627 0.9621 1.000 8
Empire Market HDP 0.0422 0.2537 0.2197 0.1464 7
DHP 0.4874 0.8282 0.8281 0.8192 41
Our approach 0.5236 0.8549 0.8549 0.8588 44
Dream Market HDP 0.0215 0.3127 0.2773 0.1896 10
DHP 0.1391 0.6171 0.6151 0.5697 45
Our approach 0.1831 0.6881 0.6878 0.6707 59
Table 3: Results of hidden buyer identification on transaction sequences with ground-truth

We can notice that for Wall Street Market, DHP achieves a slightly better performance than our proposed approach. This is because the number of hidden buyers identified by DHP is close to the ground truth number. However, we argue that although we combine the sequence from different anonymized IDs to compose the sequence with ground truth, such sequence is only weakly-labeled since the short sequence from one anonymized ID could be actually from various hidden buyers. Based on our observation, our approach groups the subsequence from one anonymized ID into three clusters. However, these three hidden buyers do not share any common words in product titles and comments, which indicates the identified hidden buyers have different patterns such that they buy different products and have different comment styles. The identified three hidden buyers based on our approach have high chance to be three different real world buyers from the content aspect. Hence, the identification result of our approach on Wall Street Market is also reasonable.

Visualization. We further show the visualization results on the transaction sequences from Wall Street Market to illustrate the effectiveness of the proposed approach. We investigate our approach for hidden buyer identification from content and temporal dynamics aspects. To show the content information, we select the top 15 words from each predicted hidden buyer (cluster) and compare the word distributions between the ground-truth and predicted hidden buyers. Figures 3(a) and 3(b) show the word distribution of the sequence from Wall Street. Each color indicates the word distribution of one anonymized ID, while each bar indicates one word. We can observe that word distributions of predicted hidden buyers are very close to those of ground-truth, which indicates the importance of adopting content information for hidden buyer identification.

To show the information of temporal dynamics, we plot the intensity values of six identified hidden buyers () over time. For the other two identified hidden buyers, since each of them only has one transaction, we omit the intensity curves of them for simplicity. We can observe that these six hidden buyers are active at different months. Meanwhile, due to the self-excitation property of Hawkes process, once a transaction occurs, the intensity increases. Hence, when a hidden buyer becomes active, the following transactions have high chance to be from the same hidden buyer based on Equation 10.

(a) Ground-truth
(b) Predicted
(c) Intensity
Figure 4: Word and intensity distributions of the transaction sequence from Wall Street Market. Each color indicates a hidden user detected by the proposed approach.

Experiments on Transaction Sequences without Ground-truth

Experimental setup. In this experiment, we apply our algorithm on the transaction sequences without ground-truth from various anonymized IDs. For each darknet market, we select anonymized IDs with at least 50 transactions. We then have 28, 16, and 579 anonymized IDs for Wall Street Market, Empire Market, and Dream Market.

We adopt the Silhouette coefficient (Silhouette) and the topic coherence () to measure the consistency of clustering results [Rousseeuw1987, Röder, Both, and Hinneburg2015]. Both of these metrics evaluate the clustering performance without ground-truth. Originally, topic coherence evaluates topic models via top-k topic words. In this work, we extract the top-k frequent words from each cluster and evaluate their coherence. If the transactions in a cluster have high coherence in product titles and comments, we can then reasonably consider the cluster of transactions is conducted by one hidden buyer. The metric of topic coherence is implemented by Gensim 111https://radimrehurek.com/gensim/index.html. For Silhouette coefficient, we use the word distribution as the feature vector for each transaction. We report the mean value of each metric over various anonymized IDs in each market.

Approaches Silhouette
Wall Street Market HDP 0.4941 -0.0940
DHP 0.7361 -0.0040
Our approach 0.7668 0.0063
Empire Market HDP 0.4749 -0.0784
DHP 0.6659 -0.0154
Our approach 0.6726 0.0026
Dream Market HDP 0.5367 0.1101
DHP 0.6667 0.1202
Our approach 0.6735 0.1439
Table 4: Results of hidden buyer identification on transaction sequences without ground-truth

Experimental results. As shown in Table 4, our proposed approach achieves the best performance in terms of topic coherence and Silhouette coefficient. Specifically, Compared with DHP that does not adopt the vendor information, our approach achieves higher values in and Silhouette coefficient, which indicates the useful of incorporating vendor information for hidden buyer identification. HDP has the lowest values in and Silhouette coefficient among three approaches over three datasets. This could be because HDP does not use the temporal dynamics information. Since the intensity of the transaction is critical for hidden buyer identification, without using the temporal information could lead to poor performance.

Figure 5: The hidden buyers identified from an anonymized ID “s**y” in Empire Market. The first and second rows show the frequent words in product titles and comments associated with each hidden buyer, respectively. The third row show the intensity of each hidden buyer, while the bottom row shows the distribution of transactions over time.

Case study. Figure 5 shows an instance of the hidden buyer identification. Given a transaction sequence with 94 transactions from an anonymized ID “s**y” in Empire Market, we detect 22 hidden buyers by our proposed approach. We show the top 3 hidden buyers who have the highest transaction numbers, i.e., 21, 13, 13. The first and second rows show the top words in product titles and comments from these 3 hidden buyers, respectively. The third row shows the intensity values of the 3 hidden buyers. The last row shows the distribution of transactions from “s**y” over time. We can observe that the transactions from these 3 hidden buyers roughly spread over time. In particular, the time ranges of transactions from Buyer I, II and III are 2018-09-14 to 2018-11-03, 2018-10-24 to 2019-01-22, and 2018-12-13 to 2019-02-14, respectively. The intensities of the three identified hidden buyers also lie in these areas. Meanwhile, the products bought by the three buyers are different. For example, Buyer I buys products with the frequent word “HEINEKEN” in titles, while Buyer II and III buy products with frequent words “LSD”, and “Ketamine”, respectively. Although “Ketamine” appears in product titles bought by both Buyer I and III, Buyer I and Buyer III have different comment styles, i.e., Buyer I prefers to write detail comments while Buyer III usually does not comment on the products. Moreover, from the time aspect, Buyer I and III are active in different months. Overall, we can notice that the three hidden buyers detected from the anonymized ID “s**y” have different styles in time, product or comment perspective.

Conclusions

In this paper, we have proposed UNMIX for hidden buyer identification in darknet markets. Due to the unfixed number of hidden buyers, UNMIX adopts the Dirichlet process to group the transactions from one hidden buyer into a cluster. In order to capture the hidden behavior of different buyers, UNMIX uses the Hawkes process to model the transaction time information, the multinomial distribution to model the text information in product titles and comments, and the categorical distribution to model the vendors involved in the transactions. Experimental results in three real-world darknet markets show that UNMIX achieves the best performance for hidden buyer identification. The case studies also indicate that different hidden buyers identified by UNMIX have different behavior. In the future, we plan to study how to incorporate buyer ratings into our current framework to further improve the performance of hidden buyer identification. Meanwhile, we plan to also investigate linking hidden buyers across different darknet markets.

Acknowledgments

This work was supported in part by NSF (1564250, 1937010) and the Department of Energy (DE-OE0000779).

References

  • [Ahmed et al.2011] Ahmed, A.; Ho, Q.; Teo, C. H.; Eisenstein, J.; Smola, A.; and Xing, E. 2011. Online inference for the infinite topic-cluster model: Storylines from streaming text. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , 101–109.
  • [Blei, Ng, and Jordan2003] Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation.

    Journal of machine Learning research

    3(Jan):993–1022.
  • [Broséus et al.2016] Broséus, J.; Rhumorbarbe, D.; Mireault, C.; Ouellette, V.; Crispino, F.; and Décary-Hétu, D. 2016. Studying illicit drug trafficking on darknet markets: structure and organisation from a canadian perspective. Forensic science international 264:7–14.
  • [Cappe, Godsill, and Moulines2007] Cappe, O.; Godsill, S. J.; and Moulines, E. 2007. An overview of existing methods and recent advances in sequential monte carlo. Proceedings of the IEEE 95(5):899–924.
  • [Carvalho et al.2010] Carvalho, C. M.; Johannes, M. S.; Lopes, H. F.; and Polson, N. G. 2010. Particle learning and smoothing. Statistical Science 25(1):88–106.
  • [Christin2013] Christin, N. 2013. Traveling the silk road: A measurement analysis of a large anonymous online marketplace. In Proceedings of the 22nd international conference on World Wide Web, 213–224. ACM.
  • [Dittus, Wright, and Graham2018] Dittus, M.; Wright, J.; and Graham, M. 2018. Platform criminalism: The’last-mile’geography of the darknet market supply chain. In Proceedings of the 2018 World Wide Web Conference, 277–286. International World Wide Web Conferences Steering Committee.
  • [Du et al.2015] Du, N.; Farajtabar, M.; Ahmed, A.; Smola, A. J.; and Song, L. 2015. Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 219–228.
  • [Farajtabar2018] Farajtabar, M. 2018. Point process modeling and optimization of social networks. Ph.D. Dissertation.
  • [Hawkes1971] Hawkes, A. G. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1):83–90.
  • [Liang, Yilmaz, and Kanoulas2016] Liang, S.; Yilmaz, E.; and Kanoulas, E. 2016. Dynamic clustering of streaming short documents. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 995–1004.
  • [Mavroforakis, Valera, and Gomez-Rodriguez2017] Mavroforakis, C.; Valera, I.; and Gomez-Rodriguez, M. 2017. Modeling the dynamics of learning activity on the web. In Proceedings of the 26th International Conference on World Wide Web, 1421–1430.
  • [Reinhart2017] Reinhart, A. 2017. A review of self-exciting spatio-temporal point processes and their applications. arXiv:1708.02647 [stat].
  • [Röder, Both, and Hinneburg2015] Röder, M.; Both, A.; and Hinneburg, A. 2015. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399–408.
  • [Rousseeuw1987] Rousseeuw, P. J. 1987.

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.

    Journal of computational and applied mathematics 20:53–65.
  • [Seonwoo, Oh, and Park2018] Seonwoo, Y.; Oh, A.; and Park, S. 2018. Hierarchical dirichlet gaussian marked hawkes process for narrative reconstruction in continuous time domain. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , 3316–3325.
  • [Soska and Christin2015] Soska, K., and Christin, N. 2015. Measuring the longitudinal evolution of the online anonymous marketplace ecosystem. In 24th USENIX Security Symposium (USENIX Security 15), 33–48.
  • [Tai, Soska, and Christin2019] Tai, X. H.; Soska, K.; and Christin, N. 2019. Adversarial matching of dark net market vendor accounts. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1871–1880.
  • [Teh et al.2005] Teh, Y. W.; Jordan, M. I.; Beal, M. J.; and Blei, D. M. 2005. Sharing clusters among related groups: Hierarchical dirichlet processes. In Advances in neural information processing systems, 1385–1392.
  • [Van Wegberg et al.2018] Van Wegberg, R.; Tajalizadehkhoob, S.; Soska, K.; Akyazi, U.; Ganan, C. H.; Klievink, B.; Christin, N.; and Van Eeten, M. 2018. Plug and prey? measuring the commoditization of cybercrime via online anonymous markets. In 27th USENIX Security Symposium (USENIX Security 18), 1009–1026.
  • [Wang et al.2018] Wang, X.; Peng, P.; Wang, C.; and Wang, G. 2018. You are your photographs: Detecting multiple identities of vendors in the darknet marketplaces. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, 431–442.
  • [Wang, Blei, and Heckerman2008] Wang, C.; Blei, D.; and Heckerman, D. 2008. Continuous time dynamic topic models. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, 579–586.
  • [Xu and Zha2017] Xu, H., and Zha, H. 2017. A dirichlet mixture model of hawkes processes for event sequence clustering. In Advances in Neural Information Processing Systems, 1354–1363.
  • [Zhang et al.2019] Zhang, Y.; Fan, Y.; Song, W.; Hou, S.; Ye, Y.; Li, X.; Zhao, L.; Shi, C.; Wang, J.; and Xiong, Q. 2019. Your style your identity: Leveraging writing and photography styles for drug trafficker identification in darknet markets over attributed heterogeneous information network. 3448–3454.
  • [Zhao et al.2015] Zhao, Q.; Erdogdu, M. A.; He, H. Y.; Rajaraman, A.; and Leskovec, J. 2015. Seismic: A self-exciting point process model for predicting tweet popularity. In KDD.