Under the Underground: Predicting Private Interactions in Underground Forums

by   Rebekah Overdorf, et al.
Drexel University
NYU college

Underground forums where users discuss, buy, and sell illicit services and goods facilitate a better understanding of the economy and organization of cybercriminals. Prior work has shown that in particular private interactions provide a wealth of information about the cybercriminal ecosystem. Yet, those messages are seldom available to analysts, except when there is a leak. To address this problem we propose a supervised machine learning based method able to predict which public will generate private messages, after a partial leak of such messages has occurred. To the best of our knowledge, we are the first to develop a solution to overcome the barrier posed by limited to no information on private activity for underground forum analysis. Additionally, we propose an automate method for labeling posts, significantly reducing the cost of our approach in the presence of real unlabeled data. This method can be tuned to focus on the likelihood of users receiving private messages, or triggering private interactions. We evaluate the performance of our methods using data from three real forum leaks. Our results show that public information can indeed be used to predict private activity, although prediction models do not transfer well between forums. We also find that neither the length of the leak period nor the time between the leak and the prediction have significant impact on our technique's performance, and that NLP features dominate the prediction power.



There are no comments yet.


page 1

page 2

page 3

page 4


The Role of Reusable and Single-Use Side Information in Private Information Retrieval

This paper introduces the problem of Private Information Retrieval with ...

PrivEdge: From Local to Distributed Private Training and Prediction

Machine Learning as a Service (MLaaS) operators provide model training a...

A Public-Key Cryptosystem Using Cyclotomic Matrices

Confidentiality and Integrity are two paramount objectives of asymmetric...

PP-DBLP: Modeling and Generating Attributed Public-Private Networks with DBLP

In many online social networks (e.g., Facebook, Google+, Twitter, and In...

Talek: Private Group Messaging with Hidden Access Patterns

Talek is a private group messaging system that sends messages through po...

Survival and Neural Models for Private Equity Exit Prediction

Within the Private Equity (PE) market, the event of a private company un...

Power Networks: A Novel Neural Architecture to Predict Power Relations

Can language analysis reveal the underlying social power relations that ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Underground forums facilitate connections between criminals, providing them with a platform to buy, sell, and trade illicit goods and services, such as stolen information, account hacking services, and hacking tools. These forums also enable cybercriminals to discuss illicit topics and share information, providing an ecosystem in which criminals can decompose large tasks, such as profiting from spam (Levchenko et al., 2011), into smaller subtasks that individuals or organizations can focus on solving (Thomas et al., 2015). Much like in the legitimate business world, this subdividing of responsibilities results in increased efficacy and innovation.

Understanding structure of these forums is essential to comprehending the cybercriminal pipeline and how these organizations and individuals behave (Thomas et al., 2015). This has been the focus of significant research (Franklin et al., 2007; Thomas et al., 2015; Motoyama et al., 2011; Holt et al., 2012) as well as several industrial “threat intelligence” firms which study cybercriminal forums in order to sell information about emerging attacks and threats. For instance, monitoring of these forums has revealed massive data leaks from Target (Krebs, 2013) and Yahoo (Perlroth, 2016). Analysis of private parts of the forums have led to making connections which exposed groups of colluding underground forum members (Krebs, 2017).

Previous studies of these forums have shown that while public messages are primarily used to advertise goods and services, private messages are a stronger indicator of potential transactions (Motoyama et al., 2011)

. Thus, studying the private interactions is crucial to fully understand the underlying economy. However, such study has so far been only possible for a few cybercriminal forums for which private messages have been publicly leaked. These leaks are rare and only provide a snapshot of the forum. Thus, they cannot be considered a continuous source of information about private communications. To address the problem that the absence of private messages leaks poses to forum analysts, in this paper we explore methods to infer the existence of private interactions based on information that is publicly available. We first show that the graph of people communicating privately has little overlap with the graph derived from public interactions. Thus, the public interactions cannot be used to directly infer details about private messages. Instead, we propose a method using supervised learning to predict if a public Post will result in a private interaction using the content of the posts and metadata about both the user who created the post and the post itself as features.

We evaluate our approach on three underground forums for which we have the leaked private message data as well as public Posts for the same time period. Our results show that our method performs significantly better than random. We find that metadata features contain a lot of predictive power individually, however, in aggregate, features related to a Post’s content make these content-based features more influential in classification. Our results show that neither the length of the leak period nor the time between the leak and the prediction have significant impact on our technique’s performance, suggesting that activity in the forum is stable over time. Finally, we also find that for the prediction of private interactions to be useful, some training data from the target forum is required. Our methods are useful for analysts to triage a forum and focus additional investigations on likely interesting public Posts.

Determining the ground truth of which private messages are related to a public Post is not only very costly, but also difficult to conclude even when performed manually. We address this challenge by developing a method of automatically determining the likelihood that a private message is related to a public Post. For each Post we then sum these likelihoods and take this sum to label the likelihood that the Post has an associated private reply during the leak in which we possess both private and public messages. This method is not only useful for reducing the amount of effort required to train our technique on new forums, but also mitigates the difficulty of hand labeling which private messages are replies to public Posts. Our automated labeling has as input a parameter that tunes the threshold for determining if a Post is labeled as positive or negative. This inherently affects which features are prioritized: those related to the Post initiator (), or characteristics of the Post itself (high ). Our manual evaluation confirms that high values of focus the labeler on private messages sent close to the Post publication time, thus highly likely to be related to it, whereas when the parameter is low, (i.e., it uses any message received by the user), the labeler detects whether the user is likely to receive private messages in general, independent of any particular post.

In summary, our contributions are as follows:

We provide the first analysis of the relation between public and private communications in Underground forums. We show that there is little overlap between the private graph of people communicating in private and the public graph derived from public Posts. Less than 3% of the users that communicate in public exchange private messages, and roughly 50% of the users that communicate in private ever commented together on a public Post. We conclude that there is no straightforward relation between public and private communications in underground forums.

We develop a supervised machine learning based method to predict which public Posts are likely to trigger private messages.

Our experiments show that the AUC of our classifier ranges from 0.65—0.77, depending on the parameters, when training and testing is performed on the same forum. The AUC decreases to 0.60 when testing in another forum. Also, our feature analysis concludes that the content of the post has better predictive power than its metadata, even though the latter still encode a lot of information regarding the likelihood of receiving a response.

We model the delay distribution of private messages sent to a user after they create a public Post.

We observe a spike in the volume of private messages sent to a public Post creator that degrades at an exponential rate within a few hours of the publishing of the Post. We model the distribution of this delay as a mixture of exponentials and estimate its parameters, fitting this function to the delays of the incoming messages for each leak on each forum. This distribution allows us to determine the likelihood that a private message is related to a public post.

We develop a method for automatically labeling Posts that are likely to trigger private messages. This method assigns weights to Posts to express the uncertainty of the relation between Posts and private messages observed in manual labeling. We perform some manual validation by labeling pairs of Posts and private messages sent the the Post creator. We find that when we look at Posts that were labeled negative by our labeler that 83% of the sampled privates messages were not related to the post.

2. Related Work

We provide a brief overview of related work on forum analysis and predictive machine learning.

2.1. Underground Forum Analysis

Initial studies of underground forums showed that cybercriminals with specialized skills cooperate and trade different products and services (Thomas and Martin, 2006; Fallmann et al., 2010; Franklin et al., 2007; Motoyama et al., 2011). Thus, analyses of these forums can be used to understand how these criminals interact with each other and what goods and services are exchanged (Thomas et al., 2015). Prior research has aimed to understand why cybercriminals organize by either analyzing private messages (Afroz et al., 2014; Afroz et al., 2013; Yip et al., 2013), or evaluating self reported studies of members of these communities (Holt et al., 2012). Most of this prior research based parts of their analysis or evaluation on private information from forum leaks.

Additional research on private messages focuses on how these forums are able to remain operational. Afroz et al. (Afroz et al., 2013)

argued that trust management on underground forums is similar to sustaining a common pool resource. However, the success of markets does not indicate the success of individual criminals or vice-versa. Individual criminals have a higher probability of pay-off contingent on their ability to interpret market signals of quality (of both goods/services and individuals selling/buying the same 

(Décary-Hétu and Leppänen, 2013). As a result, a cybercriminal’s ability to succeed or make profits may depend on their location in the network, which is measured by centrality. Individuals with high betweenness centrality have access to more information both quantitatively and in terms of diversity (Décary-Hétu, 2014), while individuals with higher degree centrality (for private messages) received more responses for public posts on Carders (Motoyama et al., 2011). For example, examination of Russian malware writers noted that individuals with higher technical skills were more centrally located (Holt et al., 2012). Simultaneously, Dupont examined a co-offending network of 10 cybercriminals and noted that the more popular criminal did not control the most botnets (Dupont, 2014). From an enforcement perspective, focusing on degree central criminals is efficient in the former case but not in the latter. Examining the correlation between various centrality measures on underground forums would illuminate the structural properties of the market and thereby inform deterrence measures (Xu and Chen, 2005). Thus, the centrality of a cybercriminal may influence their ability to succeed or make profits. This type of analysis, however, cannot be completed without a leak of the private messages. Our proposed method could potentially enable some of this analysis for longer periods of time based training data from earlier forum leaks.

2.2. Predictive Machine Learning

Several prior studies that have used machine learning to predict a future event. The EMBERS system forecasts civil unrest using open source indicators (et al., 2014). Predicting if a vulnerable website will be compromised and turn malicious based on the if the vulnerability is being commonly exploited and the profile of the website (Soska and Christin, 2014). Predicting if an organization will suffer a data breach based on their externally observable security profile and properties of the organization that might make them a more attractive target (et al., 2015). Our method using similar methods of publicly observable information to infer and predict private interactions that are useful for understanding cybercriminal forums.

Machine learning approaches have also been explored, among others, to predict reversions, promotions, and downvotes of user-generated content in web-based communities such as Slashdot (Brennan and Greenstadt, 2010) and Wikipedia (Segall and Greenstadt, 2013; Javanmardi et al., 2009). These approaches have shown that user reputation features and contextual metadata are useful in these systems. The best of our knowledge, our method is the first attempt to tackle this difficult problem of predicting which public Posts will generate private interactions.

3. Predicting private interactions

3.1. Problem statement

Let us consider a forum in which users communicate with each other. A user can perform the following actions: publish a Post, reply to a Post, or send a private message to a user . The two former are public, i.e., observable by anyone that has access to the forum, while the latter are private, i.e., only observable to the sender and receiver of the private message.

While private messages are usually private, they are sometimes made public via so-called leaks of the forum database (for, 2017). Such leaks consist of information about the forum users as well as Posts, replies, and private messages during a period of time. We denote the time when the leak starts as , and the duration of the leak period as .

Our goal is to gain understanding on whether the leaked information, i.e., users, Post, replies, and private messages between and , can be used to predict future private interactions in a forum. More formally, given a Post published by at time , we aim to predict whether there will be a future private interaction , , with user as receiver.

3.2. Approach

Figure 1. Similarity of private and public connections. In this analysis a public connection is made from to when a comments in a Post created by , and a private connection is made from to when sends a private message to .

A straightforward approach to answer the above question would be to use information from the public graph, i.e., who communicates with whom in public, to infer which users communicate in private. However, a preliminary analysis of the available underground forums leaks immediately shows that private and public graphs are far from being related.

Less than 5% of users that participate in the same public thread ever communicate in private. Figure 1 shows the overlap of connections made in private versus public for every connection observed during the forum leaks. For all forums, at most 3% of connections made from one user to another in private also occur in public. On Carders, the forum with the highest number of private messages proportional to the public interactions, if we only look at the users that do communicate in private, we see that, roughly, only 50% of them ever commented at the same public thread at some point over the entire length of the leak.

These results suggest that the relation between public and private interactions is much more complex. To address this complexity, we frame the question of inferring private communication as a supervised machine learning problem. Assume there is a leak with both public and private interactions over a leak period between and of duration , and that the goal is to predict private interactions during a target period of duration , strictly distinct from the leak period, for which we only have access to public information. Figure 2 illustrates this timeline. The parameter denotes the time passed since the end of the leak period and the start of the target period.

Figure 2. Leak period (, ) of duration and target period (, ) of duration separated by a time .

The public and private data available from the leak period can be used as a training set to build a model that represents the likelihood that a user receives a private message after publishing a Post. This model uses as features public information about a Post (e.g., characteristics of the sender , number of to the Post replies, Post content) published in the target period. The prediction performance greatly depends on the features used to train the model. We provide examples of feature sets, and evaluate their performance, in Section 4.

3.3. Automating labeling

Note that the available leaks do not contain explicit information about public-private messages correspondence. Thus, one challenge in this problem is that there is no ground truth linking private messages to public threads. Even with all of the leaked information available it is unclear which private messages are related to which public posts or even if a public Post received any replies.

One option to solve this problem would be to manually label the data. This would mean going through each of the Posts published in the leak period and searching through the private messages for related conversations. This process is not only prohibitively expensive, but also difficult. Manual linking must be evaluated in terms of content and in many cases, even for a human, the relationship between public Posts is not always clear, hindering the labeling task. As an example, we did manually label a small set of 330 Posts and message pairs with the goal of determining if they were related. We ended up labeling almost half, 47%, as unclear. Section 4 further details this analysis.

Thus, we develop a method to automatically label Posts in the training set. Acknowledging that the relation between Posts and private messages is fuzzy, and that as a result labels will be noisy, our method does not use binary yes/no labels but assigns a weight to each Post. This weight effectively models the Post likelihood of having triggered a private message.

As a first step, we aim to model the relationship between the time when a user publishes a public Post, , and when she receives a private message . We denote this time difference as and compute it as , where indicates the index of subsequent messages received by the Post initiator in increasing order (see Figure 3).

Figure 3. Timing relationship between posting, , and private messages received by the post creator at , , and .

To this end we compute the distribution of intervals for all Posts and private mesages in the leak period. The results are shown in Figure 4 for the three forums we use in our experiments. In this figure each bar represents the volume of incoming messages in intervals of 15 minutes across the entirety of a leak period of 6 weeks. We observe a large spike shortly after a Post is published, which we conjecture is generated by a large amount of messages triggered by such post. Note that users also receive messages when they have not recently posted, as indicated by the long tails of the distribution. Negative values of are a result of private messages received prior to a given Post publication that can be either responses to previous Posts or spontaneous messages between users. We recall that in this paper we are only interested in predicting messages sent after a Post and therefore in the reminder of this document we disregard negative values of .

Figure 4. Distribution of intervals between post publication and reception of private messages for each forum over a leak of size six weeks.

To represent both behaviors, the spike after publication and the low posterior influx, we model the likelihood of a given arrival time of incoming private messages to a user after a Post as a mixture of two exponentials:


The first exponential, defined by coefficients and , aims at capturing the initial steep decline, while the one defined by and captures the slow decay in the number of messages over time.

The value of these coefficients can be estimated by interpolating the function

over the data points from the training data obtained in the leak period. Note, however, that Figure 4 also shows the existence of a continuous flow of messages independent of the existence of the post (distribution tail). We model this post-independent message flow as a constant . Then, instead of directly interpolating , we interpolate from which can easily obtain by substracting the constant. We measure how well the interpolation fits to the data by computing the coefficient of determination where each is the volume of incoming messages at time .

The values of the coefficients obtained via interpolation depend on three parameters. Naturally, on the duration of the leak period, , that determines the number of posts and private messages available to construct the histogram that we use as input to interpolation. Secondly, they depend on the maximum time after a Post for which a private message is considered to be possibly related to this post. This not only affects the number of private messages available, but also the weight of the distribution tail on the interpolation problem. To avoid setting this value arbitrarily, we infer it empirically by testing values for and choosing the one for which the values of the coefficients do not change more than when it is increased. We stress that, though we cap for the interpolation, is not truncated so as to indicate that any message in the future could be related to a Post, though the likelihood is very small.

Finally, the values of the coefficients depend on the granularity used to compute the histogram representing the distribution. In Figure 4 time is arbitraritly divided in 15-minute bins for . For our experiments, we explored a few methods for determining the optimal granularity of the distribution. The naïve method is to choose, as in Figure 4 fix the bin size for any length of the leak period . This method is harmful for smaller leaks, since the bins are too small and many end up empty providing a poor input to the interpolation. Instead, we developed a balanced method that determines the bin granularity based on the average number of items desired in each bucket. This method ensures that one has enough quality data points for interpolation, i.e., all bins have a significant number of points, regardless of the duration leak and the level of activity in the forum.

We tested both of these methods with a number of inputs, and compared the obtained values. As expected, the smaller leaks saw the greatest improvement with the balanced method, with a of one week improving from for one minute bins, and for five minutes bins, to with the balanced method with an average of five items per bucket. We show in Figure 5 the result of the interpolation for L33tCrew.

Figure 5. Distribution of values and for L33tCrew. ( weeks, hours, , )

Since the function models the likelihood of a private message being sent to a Post creator after the thread has been created, it can be used to infer a label that represents the core question in this paper: will a user who posts a public Post receive a private message? For this purpose, we define the aggregated likelihood that a given Post sent by user receives a response as:


such that and there exists a private message to at that time.

The above definition of the likelihood closely resembles that of the joint probability. However, we note that

is not a probability distribution, thus nor is

. First, is fitted to the volume of observed in the training dataset, thus it is possible that the area under the curve does not add up to 1. Normalizing after the interpolation to transform into a distribution is also not possible given that the support for the exponential goes to infinity.

Once the aggregated likelihood is computed, we assign a binary label to posts using a threshold to decide whether the Post has generated messages, or not. When , i.e., when we evaluate that , Posts initiated by a user who ever receives a message during the leak period are labeled positively. When is large positive labels are only assigned to those Posts for which the initiator receives a message close to the time of the Post publication. Thus, this threshold effectively balances whether the labeling focuses on the overall likelihood of the user receiving a message or on the likelihood of the particular post evaluated having triggered the private interaction, as we demonstrate in Section 4.2.

4. Evaluation

4.1. Experimental setup

In this section we describe the conditions in which we carry out the evaluation of our prediction method. We introduce the datasets we use and describe the preprocessing we perform prior to the experiments, as well as the machine learning tools that we use to implement the predictor.

Datasets. We evaluate this method on three leaked underground forums: Carders (CC), L33tCrew (LC), and Nulled(NL). All three forums were leaked anonymously, are publicly available, and have been used in several academic (Motoyama et al., 2011; Afroz et al., 2013; Afroz et al., 2014; Portnoff et al., 2017) as well as non-academic111http://krebsonsecurity.com/tag/carders-cc/ studies. We describe each forum below and we summarize their characteristics in Table 1.

L33tCrew is a German language carding forum specializing in trading stolen credit and debit cards. It was started in May 2007 and was leaked and closed in Nov 2009. At the time of the leak, L33tCrew had 18,834 total members, 7,687 of them participated in private message interaction. After the leak many members of L33tCrew joined Carders.

Carders is a similar German language carding forum. Carders was established in February 2009 and was leaked and closed in December 2010222Details of carders leak: http://www.exploit-db.com/papers/15823/. At the time of the leak, Carders had 8,425 total members among which 4,290 members participated in private message interaction.

Nulled is a large English language forum covering a large variety of topics and is currently still active. At the time of the leak Nulled had 599,085 members. However, only 6.11% (36,606) of the users in the leak sent or received private messages.

Forum Language Dates Users Users w/ PMs
LC German 05/07-11/09 18834 7687
CC German 02/09-12/10 8425 4290
NL English 01/15-05/16 599085 36606
Table 1. Forums leaked data

Though all of these forums are generally similar in content and structure, the way that users interact with each of them varies. In every forum, users can create public threads, submit a comment to an existing thread, or send a private message to another user. Figure 6 shows the distribution of these interactions over time for all forums. Nulled has the largest activity volume as well as a higher percentage of comments compared the posts and messages, while L33tCrew has a relatively higher proportion of private messages. Carders has much less activity than the other two, but a high volume of private interactions.

Figure 6. Type of activity over the time of the leaks for all forums.

Data Pre-Processing. First we select posts that are created in isolation of other posts by the same user. That is, if creates two posts within 12 hours of each other both are discarded. This ensures that, in training we are properly labeling the replies that correspond to that thread, and in testing we can correctly evaluate Post classification inequivocally. Since the Post publication times are known to the analyst this is a plausible assumption to make. For training we also remove Posts that are within 24 hours of the end of . The rationale is that most private replies to these Posts may come after and are therefore unknown by definition causing the Post to be mislabeled increasing the false negative rate. After the Post filtering step, we select chunks of varying length to simulate forum leaks that we use for training, as well as non-overlapping chunks of fixed length that we use for testing, as discussed below.

Features & Classifier. We show in Table 2 the features we use to create the models we use for classification. We include both natural language and Post context features. The Post context include timing, tags, public replies tags, as well as user-related features. Among the latter we include centrality metrics from the public interaction graph, i.e., the graph that links Post creators to Post respondents. These centrality features have been shown in previous work (Garg et al., 2015) to be a measure of users’ popularity and influence on these forums. We run experiments with just the natural language features, just the context features, and all of the features. Note that for the natural language features we remove function words and use the frequency of the stems of the remaining words.

Feature Type Description
Natural Language title bag of words, thread bag of words, and subforum name bag of words

tagged sell Post, tagged buy Post, time, time on forum, reply count, user reputation, views, and graph centrality metrics for the Post creator: clustering, degree, eigenvector, betweenness

Table 2. Summary of the feature set used for classification.

We train a Random Forest Classifier on each of the feature groups. We choose random forest for classification for three primary reasons. First, outliers are common in our data set, and are common in data of this type generally. Second, the results are easily interpreted. That is, we can easily discover which features the decisions are made on. Finally, random forest classifiers are not as susceptible to overfitting as other similar classification techniques, which is particularly important in this problem as the training data and testing data are coming from a different time slice of the forum. Throughout our experiments we vary the classification threshold of the random forest classifier in order to create ROC curves that represent the performance of our prediction.

Simulation parameters. To explore different scenarios, in our simulations we vary several parameters from our model, see Section 3. First, we vary the size of the leak, . Smaller leaks have less Posts to train the model, and also less known private messages to use to infer the labeling function . For simplicity and comparability we fix the span of the testing period, . We empirically determined that a span of six weeks contained enough data to obtain smooth ROC curves resistant to noise. Additionally, fixing allows us to accurately test , the time between the leak and the testing periods. Otherwise, varying would affect the average time between the training and the testing Posts accross experiments. We also vary to evaluate the effect of time passing between the leak and testing periods. In the bulk of our experiments we use weeks. Since understanding how these forums change over time affects performance is vital to evaluate how well our method scales, we use an additional for L33tCrew, the forum with the longest leak.

Finally, we also vary the value of to evaluate the performance of our method when it focuses on user-only features – , i.e., Posts receive positive labels if the user ever receives a private message after posting); and when more weight is given to -oriented features – high , i.e., Posts receive positive labels when the initiator receives private messages close in time to the thread.

4.2. Results

Figure 7. Classification accuracy for different values of and using all features.

Random Forest Classifier results.

First, we study the influence of and have on the accuracy of our prediction methods. We show the results in Figure 7 where for each combination of and we plot the results for the largest possible for which the percentage of positive labels is greater than 10%. This ensures that: i) we do not focus on user-based features that are stable over time, and ii) there is enough data in the positive class for the classification performance to be significant. We observe that varying , or has no notable effect on the accuracy. Even with the larger values of tested on the L33tCrew forum we see no decline in accuracy: and yielded an accuracy of 87.17% and 81.41%. For the rest of the experiments in this section we fix to 7 weeks and to 0.

We then study the effect of on performance. We plot in Figure 8 ROC curves for varying . We see that across all three forums, and in particular for Carders, smaller values of where more posts are labeled as positive perform best, i.e. they result a higher True Positive Rate and a lower False Positive Rate. We conjecture that part of this success is due to small yielding a large percentage of positive labels, thus resulting in a more balanced training distribution.

Figure 8. ROC Curves for different values ( weeks and ).

Finally, we compare the different feature sets, see Figure 9. All of the features together perform the best, followed closely by only NLP. Both of them clearly outperform the context features alone. The context features are most useful on Carders, and on both Nulled and Carders they do better with a smaller than a larger one. This supports our claim that when is set to 0 that what is being predicted is closer to which users are more likely to receive messages rather than which Posts are likely to trigger private replies.

Figure 9. ROC Curves for different feature sets ( weeks and )

Manual validation of labeling. We manually labeled 330 public Post and private message pairs. We sampled these Posts and then we selected uniformly at random one message that was sent to the Post creator at any time in weeks. We then labeled whether the private message appeared to be in response to the Post. Of the 330 pairs we were able to conclusively label 175. Note that, as opposed to our automated labeling method, this manual labeling does not aim at establishing if a Post received any related private message, but whether a Post created by and a concrete private message sent to were related. Therefore, we could not expect our manual labels to fully line up with the automated labels even if both methods output perfect labels. We also label Posts in the manually labeled pairs using our automated labeler for and .

Table 3 displays the results of both labeling processes. Note that when , i.e., when automated labeling assigns a positive label to each Post creator who receives a private message in , there are no negatively labeled Posts for which we can label a public Post and private message pair because there are no positively labeled Posts with private messages. That is, if we are labeling a pair with a private message to , there is at least one message in , thus the label is positive by default.

Automated Manual Percent
0 + + 32.93%
+ - 67.07%
90 + + 46.15%
+ - 53.85%
90 - + 17.07%
- - 82.93%
Table 3. Results of manual and automated labeling. Automated labeling configured with and .

We see that for Posts labeled positively by the automated method with about a third of the sampled privates messages are manually labeled as related. The fact that manual labeling finds many of the posts with positive labels to have unrelated messages indicates that, indeed, the automated labeler is not focusing on the Post itself, but to the user. Also, we note that the Posts that were manually labeled as unrelated may have a private message different from the sampled one that is actually related to the post, so the automated label may still be correct.

When we increase to 90, the number of posts labeled as negative increases: users that receive their messages far after the Post will be labeled as negative, as these messages are unlikely to be related to the Post and thus contribute little to , see 2, which does not reach the threshold . On Posts that the automated method labeled as negative, we manually label 17% of our pairs as related. That is, this automated labeler mislabeled these posts according to our manual labeling.

For the Posts that our automated method labels as positive, there is a difference between the posts labeled with and . The higher percentage of pairs being manually labeled as positive for a larger implies that a larger is more likely to have a related private message than a smaller . Thus, tunes the model to be more or less related to the Post itself, or to be tuned to whether a user generally receives many messages.

Figure 10, where we plot the distribution of values for positively labeled Posts, supports this claim. We observe that for the volume of messages at small is smaller than for other , but larger for large . In other words, large considers a much larger percentage of private messages sent closer to the publication time, and therefore likely to be related to it. On the other hand small consider messages all across , i.e., considers only whether the message has been sent to the target receiver regardless of the time.

Figure 10. Distributions of values for Positive Labels on L33tCrew. ( weeks, )

Feature Analysis. Next we aim to determine which features are the best predictors, for which we calculate the information gain at each experiment. Surprisingly, we find that the highest ranked features in most experiments are the centrality metrics, number of public replies, number of views, and how long the user has been on the forum. We observe no notable difference between the top ranked features for different values of . This seem to contradict the results of the Random Forest Classifier which show that performance is better using the NLP features. This discrepancy is due to the volume of these features. Though context features contain a lot of information and individually are very predictive, they are only a handful. On the other hand, we have thousands of NLP features, thus they are more predictive as an ensemble.

Cross-Forum Classification Here we consider the case where the analyst wants to predict private activity on a forum for which she has no leaked data, but a similar forum exists with leaked private message data. We take as examples L33tCrew and Carders since they are the most similar forums considered in this work, both being German language carding forums. We train the classifier on data from Carders and test it on data from L33tCrew. The results are displayed in Figure 11. While the classifier does outperform random chance, the results are, predictably, worse than the intra-forum results presented above.

Figure 11. Cross Forum Results for L33tCrew and Carders.

5. Discussion

Limitations. The accuracy of our method is not exceptionally high, meaning that its results cannot be used to directly act upon a Post or user. However, we have found that predicting private activity is a difficult problem and even manual analysis often cannot ascertain if a private message was sent in response to public Posts. Thus, providing analysts with automated tools to point out where to concentrate their efforts is of value.

Another key limitation of our method is that, though it is quite effective at predicting which Posts will be followed by private interactions, it cannot predict who will participate in the private communication. Such a graph would be useful for understanding collaborations among members and their relations. However, the revelation of our analysis that there is little overlap between members whom post in the same public Post and those whom communicate privately hints that inferring such graph is a very complex problem.

Also, our analysis has focused on ‘isolated’ Posts, i.e., Posts for which the poster does not start any other Post in within 12 hours. While this simplification still allows us to show that private interactions can indeed be predicted, it limits the applicability of our approach. More research is needed to develop means to jointly infer the likelihood of responses when a series of Posts are published, and to devise whether it is possible to decide which of these public communications has, or have, actually triggered the private interaction.

Finally, we observe that, although our technique outperforms random guessing when trained on one forum and tested on another forum, it currently performs better when there is training data for the target forum. This lays out an interesting problem to be addressed in future work, focused on how to identify stable features that can be used to bridge the difference in domains posed by the diversity between different forums’ users.

Use Cases. Even though our approach has several limitations there are useful and realistic use cases. The first envisioned use case is for forums such as Nulled, where it continues to operate after the forum leak. Researchers or private companies could continue collating the public messages from Nulled or another forum that continued operating after a leak. Researchers or analysts can use our approach to identify public Posts that likely generated private interactions. This could enable measures of forum member centrality based on private interaction or identify members and public Posts that should be further examined based on likely private interactions which might indicate completed sales.

Ethics. All of the data used in our study was publicly leaked and believed to be authentic. Our IRB deemed our study exempt based on the fact that the data was publicly leaked and that by the nature of the data being underground criminal forums it is unlikely anyone used their real name or other Personal Identifiable Information (PII). Our analysis of private messages was focused on identifying links between private messages and Posts, and not analyzing users identities or messages content. If we had discovered PII in these messages we would have notified our IRB and submitted an amended IRB application. However, we did not find any PII during our manual analysis.

6. Conclusions

Private activity in underground forums has been shown to be key for understanding the underlying cybercriminal ecosystem. However, analysts rarely have access to private messages and when they do these private interactions are limited to a snapshot that only represents forum activity during a bounded time interval.

In this work we propose and evaluate a method that provides analysts with the means to predict such private interactions from the information that is available to them. We have presented a supervised machine learning based method able to predict which public Posts will generate private messages, after a partial leak of such messages has occurred. Additionally, we have proposed a method for automated labeling that reduces the cost of our analysis, thus increasing its potential to be deployed to analyze large forums. This automatic labeling method has a parameter This is especially useful as manual labeling turned out to be even more difficult than expected.

To the best of our knowledge, our approach is the first proposed to tackle this important underground forum analysis problem of limited to no information about private interactions in most underground forums. Our work is an initial step to provide forum analysts with tools to understand which posts are likely result in sales or follow up discussion and are more likely important for them to spend more time investigating. Understanding how this learning problem transfers across forums is a crucial part of future work in this area.


  • (1)
  • for (2017) 2017. https://evidencebasedsecurity.org/forums/. (2017).
  • Afroz et al. (2014) Sadia Afroz, Aylin Caliskan-Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy. 2014. Doppelgänger Finder: Taking Stylometry To The Underground. In IEEE Symposium on Security and Privacy. IEEE.
  • Afroz et al. (2013) Sadia Afroz, Vaibhav Garg, Damon McCoy, and Rachel Greenstadt. 2013. Honor Among Thieves: A CommonÕs Analysis of Cybercrime Economies. In eCrime Researcher’s Summit. APWG, IEEE.
  • Brennan and Greenstadt (2010) Michael Brennan and Rachel Greenstadt. 2010. Learning to Extract Quality Discourse in Online Communities. In

    AAAI-2010 Workshop on Collaboratively-built Knowledge Sources and Artificial Intelligence (Wiki-AI)

  • Décary-Hétu (2014) David Décary-Hétu. 2014. Information Exchange Paths in IRC Chat Rooms. In Crime and Networks, C. Morselli (Ed.). Taylor & Francis Group, 218–230.
  • Décary-Hétu and Leppänen (2013) David Décary-Hétu and Anna Leppänen. 2013. Criminals and signals: An assessment of criminal performance in the carding underworld. Security Journal (2013).
  • Dupont (2014) Benoit Dupont. 2014. Skills and Trust: A Tour Inside the Hard Drives of Computer Hackers. In Crime and Networks, C. Morselli (Ed.). Taylor & Francis Group, 195–217.
  • et al. (2014) Naren Ramakrishnan et al. 2014. ‘Beating the news’ with EMBERS: forecasting civil unrest using open source indicators. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD).
  • et al. (2015) Yang Liu et al. 2015. Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents. In Proceedings of the 24th USENIX Security Symposium.
  • Fallmann et al. (2010) Hanno Fallmann, Gilbert Wondracek, and Christian Platzer. 2010. Covertly probing underground economy marketplaces. In Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 101–110.
  • Franklin et al. (2007) Jason Franklin, Adrian Perrig, Vern Paxson, and Stefan Savage. 2007. An inquiry into the nature and causes of the wealth of internet miscreants.. In ACM conference on Computer and communications security. 375–388.
  • Garg et al. (2015) Vaibhav Garg, Sadia Afroz, Rebekah Overdorf, and Rachel Greenstadt. 2015. Computer-supported cooperative crime. In International Conference on Financial Cryptography and Data Security. Springer, 32–43.
  • Holt et al. (2012) Thomas J Holt, Deborah Strumsky, Olga Smirnova, and Max Kilger. 2012. Examining the Social Networks of Malware Writers and Hackers. International Journal of Cyber Criminology 6, 1 (2012).
  • Javanmardi et al. (2009) Sara Javanmardi, Yasser Ganjisaffar, Cristina Lopes, and Pierre Baldi. 2009. User contribution and trust in wikipedia. In 5th International Conference on Collaborative Computing: Networking, Applications and Worksharing.
  • Krebs (2013) Brian Krebs. 2013. Who’s Selling Credit Cards from Target? http://krebsonsecurity.com/2013/12/whos-selling-credit-cards-from-target. (2013).
  • Krebs (2017) Brian Krebs. 2017. Who Is Marcus Hutchins? https://krebsonsecurity.com/2017/09/who-is-marcus-hutchins/. (2017).
  • Levchenko et al. (2011) Kirill Levchenko, Andreas Pitsillidis, Neha Chachra, Brandon Enright, Márk Félegyházi, Chris Grier, Tristan Halvorson, Chris Kanich, Christian Kreibich, He Liu, Damon McCoy, Nicholas Weaver, Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. 2011. Click Trajectories: End-to-End Analysis of the Spam Value Chain. In Proceedings of the 2011 IEEE Symposium on Security and Privacy. IEEE Computer Society, 431–446.
  • Motoyama et al. (2011) Marti Motoyama, Damon McCoy, Kirill Levchenko, Stefan Savage, and Geoffrey M Voelker. 2011. An analysis of underground forums. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference. ACM, 71–80.
  • Perlroth (2016) Nicole Perlroth. 2016. Yahoo Says Hackers Stole Data on 500 Million Users in 2014. https://www.nytimes.com/2016/09/23/technology/yahoo-hackers.html. (2016).
  • Portnoff et al. (2017) Rebecca S Portnoff, Sadia Afroz, Greg Durrett, Jonathan K Kummerfeld, Taylor Berg-Kirkpatrick, Damon McCoy, Kirill Levchenko, and Vern Paxson. 2017. Tools for Automated Analysis of Cybercriminal Markets. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 657–666.
  • Segall and Greenstadt (2013) Jeffrey Segall and Rachel Greenstadt. 2013. The Illiterate Editor: Metadata-driven Revert Detection. In ACM Symposium on Open Collaboration (WikiSym).
  • Soska and Christin (2014) Kyle Soska and Nicolas Christin. 2014. Automatically Detecting Vulnerable Websites Before They Turn Malicious. In Proceedings of the 23rd USENIX Security Symposium.
  • Thomas et al. (2015) Kurt Thomas, Danny Yuxing Huang, David Y. Wang, Elie Bursztein, Chris Grier, Tom Holt, Christopher Kruegel, Damon McCoy, Stefan Savage, and Giovanni Vigna. 2015. Framing Dependencies Introduced by Underground Commoditization. In 14th Annual Workshop on the Economics of Information Security, WEIS 2015, Delft, The Netherlands, 22-23 June, 2015.
  • Thomas and Martin (2006) Rob Thomas and Jerry Martin. 2006. The underground economy: priceless. USENIX; login 31, 6 (2006), 7–16.
  • Xu and Chen (2005) Jennifer J Xu and Hsinchun Chen. 2005. CrimeNet explorer: a framework for criminal network knowledge discovery. ACM Transactions on Information Systems (TOIS) 23, 2 (2005), 201–226.
  • Yip et al. (2013) Michael Yip, Nigel Shadbolt, and Craig Webber. 2013. Why forums?: An empirical analysis into the facilitating factors of carding forums. In Proceedings of the 5th Annual ACM Web Science Conference. ACM, 453–462.