1 Introduction
With increasing amount of data collected from everywhere, such as computer systems, transaction activities, social networks, it becomes more and more important for people to understand the underlying regularity of the data, and to spot the unexpected or abnormal instances [Chandola et al.2009]. Centered around this goal, anomaly detection plays a very important role in many security related applications, such as securing enterprise network by detecting abnormal connectivities, and so on.
However, the problem has not been satisfyingly addressed yet. Many traditional anomaly detection methods focus on either numerical data or supervised settings [Chandola et al.2009]. When it comes to unsupervised anomaly detection in heterogeneous categorical events data, i.e., events containing a collection of categorical values that are considered as entities of different types, there is less existing work [Das and Schneider2007, Das et al.2008, Tong et al.2008, Akoglu et al.2012].
The heterogeneous categorical event data are ubiquitous, such as events of process interactions in computer systems, where each data point is an event that involves heterogeneous types of attributes/entities: time, user, source process, destination process, and so on. In order to detect abnormal events that deviate from the regular patterns, a common approach is to build a model that can capture the underlying factors/regularities of data. However, events with multiple heterogeneous entities are difficult to model in a systematic and unified framework due to two major challenges: (1) the lack of intrinsic distance measures among entities and events, and (2) the exponentially large event space.
Consider that in real computer systems, given two users with ids of 1 and 10, we almost know nothing about their distance/similarity without other information. In addition to the lack of intrinsic distance measure, the exponentially large event space is also an issue. For example, a heterogeneous categorical event, in real systems, can involve more than ten types of entities. If each entity type has more than one hundred possible choices of entities the overall event space will be as large as , which is prohibitively large and makes it challenging to model regularities.
Due to these two difficulties, most existing work relies heavily on heuristics to quantify the normal/abnormal scores for events [Das and Schneider2007, Das et al.2008, Tong et al.2008, Akoglu et al.2012]. However, a more systematic and accurate model is in demand as the vastly emerging of big complicated data in important applications.
To tackle the aforementioned challenges, we propose a probabilistic model that directly models the event likelihood. We first embed entities into a common latent space where distance among entities can be naturally defined. Then to access the compatibility of entities in the event, we quantify their pairwise interactions by the dot product of the embedding vectors. Finally the weighted sum of interactions is used to define the probability of the event.
Compared to traditional methods, the proposed method has several advantages: (1) by modeling the likelihood of event based on entity embeddings, the proposed model can produce normal/abnormal score in a principled and unified framework; (2) by modeling weighted pairwise interaction instead of all possible interactions, the model is less susceptible to overfitting, and can provide better interpretability; and (3) the proposed model can be learned efficiently by NoiseContrastive Estimation with “contextdependent” noise distribution regardless of large event space. Empirical studies on realworld enterprise surveillance data show that by applying our method we can detect unknown abnormal events accurately.
2 Problem Statement
Here we introduce some notations and define the problem.
Heterogeneous Categorical Event. A heterogeneous categorical event is a record contains different categorical attributes, and the th attribute value denotes an entity from the type . In the computer process interaction network, an event is a record involving entities of types such as the user, time^{1}^{1}1Although time is continuous value, it can be chunked into segments of different granularities, such as day and hour, which then can be viewed as entities., source/destination process and folder. In the following, we will call it event for short.
By treating the categorical attributes of an event as entities/nodes, we can also view categorical events as a heterogeneous network of multiple node types [Sun and Han2012]. In the computer process interaction example, the network schema is shown in Figure 1, where event acts as a super node connecting other nodes of different types.
Problem: abnormal event detection. Given a set of training events , by assuming that most events in are normal, the problem is to learn a model , so that when a new event comes, the model can accurately predict whether the event is abnormal or not.
3 The Proposed Model
In this section, we introduce the motivation and technical details about the proposed model.
3.1 Motivations
We directly model the event likelihood as it indicates how likely an event should occur according to the data. An event with unusual low likelihood is naturally abnormal. To achieve this, we need to deal with the two main challenges as mentioned before: (1) the lack of intrinsic distance measures among entities and events, and (2) the exponentially large event space.
To overcome the lack of intrinsic distance measures among entities, we embed entities into a common latent space where their semantic can be preserved. More specifically, each entity, such as a user, or a process in computer systems, is represented as a dimensional vector and will be automatically learned from the data. In the embedding space, the distance of entities can be naturally computed by distance/similarity measures in the space, such as Euclidean distances, vector dot product, and so on. Compared with other distance/similarity metrics defined on sets, such as Jaccard similarity, the embedding method is more flexible and it has nice property such as transitivity [Zhang et al.2015].
To alleviate the large event space issue and enable efficient model learning, we come up with two strategies: (1) at the model level, instead of modeling all possible interactions among entities, we only consider pairwise interaction that reflects the strength of cooccurrences of entities [Rendle2010]; and (2) at the learning level, we propose using noisecontrastive estimation [Gutmann and Hyvärinen2010] with “contextdependent” noise distribution.
The pairwise interaction is intuitive/interpretable, efficient to compute, and less susceptible to overfitting. Consider the following anomaly example we may encounter in real scenarios:

A maintenance program is usually triggered at midnight, but suddenly it is trigged during the day.

A user usually connect to servers with low privilege, but suddenly it tries to access some server with high privilege.
In these examples, abnormal behaviors occur as a result of the unusual pairwise interaction among entities (process and time in the first example, and user and machine in the second example).
3.2 The probabilistic model for event
We model the probability of a single event in event space using the following parametric form:
(1) 
Where is the set of parameters, is the scoring function for a given event that quantifies its normality. We instantiate the scoring function by pairwise interactions among embedded entities:
(2) 
Where is the embedding vector for entity , and the dot product between a pair of entity embeddings and encodes the compatibility of two entities cooccur in a single event. is the weight for pairwise interaction between entity types and , and it is nonnegative constrained, i.e. . Different pairs of entity types can have different importances, interaction among some pairs of entity types are very regular and important, e.g. user and machine, while others may be less regular and important, e.g. day and hour. Using , the model can automatically learn the importances of different pairwise interactions. Finally denotes all parameters used in the model.
Our model APE, which is abbreviated for Anomaly detection via Probabilistic pairwise interaction and Entity embedding, is summarized in Figure 2.
The learning problem is to optimize the following maximum likelihood objective over events in the training data :
(3) 
To solve the optimization problem, the major challenge is that the denominator in Eq. 1 sums over all possible event configurations, which is prohibitively large (). To address this challenging issue, we propose using Noise Contrastive Estimation.
3.3 Learning via noisecontrastive estimation
NoiseContrastive Estimation (NCE) has been introduced in [Gutmann and Hyvärinen2010] for density estimation, and applied to estimate language model [Mnih and Teh2012], and word embedding [Mnih and Kavukcuoglu2013, Mikolov et al.2013a, Mikolov et al.2013b]. The basic idea of NCE is to reduce the problem of density estimation to binary classification, which is to discriminate samples from data distribution and some artificial known noise distribution (the selection of
will be explained later). In another word, the samples fed to the APE model can come from real training data set or being generated artificially, and the model is trained to classify them a posteriori.
Assuming generated noise/negative samples are
times more frequent than observed data samples, the posterior probability of an event
came from data distribution is . To fit the objective in Eq. 3, we maximize the expectation of under the mixture of data and noise/negative samples [Gutmann and Hyvärinen2010, Mnih and Teh2012]. This leads to the following new objective function:(4) 
However, in this new objective function, the model distribution is still too expensive to evaluate. NCE sidesteps this difficulty by avoiding explicit normalization and treating normalization constant as a parameter. This leads to , where , and is the original logpartition function as a single parameter, and is learned to normalize the whole distribution. Now we can rewrite the event probability function in Eq. 1 as follows:
(5) 
To optimize the objective E.q. 4 given the training data , we replace with
(the empirical data distribution), and since the APE model is differentiable, stochastic gradient descent is used: for each observed training event
, first sample noise/negative samples according to the known noise distribution , and then update parameters according to the gradients of the following objective function (which is derived from Eq. 4 on given samples):The complexity of our algorithm is , where is the number of total observed events it is trained on, is number of negative examples drawn for each observed event, is the number of entity type, and is the embedding dimension. The complexity indicates that the APE model can be learned efficiently regardless of the large event space.
3.4 “Contextdependent” noise distribution
To apply NCE, as shown in Eq. 6, we need to draw negative samples from some known noise distribution . Intuitively, the noise distribution should be close to the data distribution, otherwise the discriminating task would be too easy and the model cannot learn much structure from the data [Gutmann and Hyvärinen2010]. Note that, different from previous work (such as language modeling or word embedding [Mnih and Teh2012, Mikolov et al.2013a]) that utilizes NCE, where each negative sample only involves one word/entity. Each event, in our case, involves multiple entities of different types.
One straightforward choice of noise distribution is “contextindependent” noise distribution, where a negative event is drawn independently and does not depend on the observed event. One can sample a negative event according to some factorized distribution on event space, i.e. . Here is the probability of choosing entity of the type , which can be specified uniformly or computed by counting unigram in data. In this work we stick to unigram as it is reported better [Mnih and Teh2012, Mikolov et al.2013a].
Although the “contextindependent” noise distribution is easy to evaluate. Due to the large event space, this noise distribution would be very different from data distribution, which will lead to poor model learning.
Here we propose a new “contextdependent” noise distribution where negative sampling is dependent on its context (i.e. the observed event). The procedure is, for each observed event , we first uniformly sample an entity type , and then sample a new entity to replace and form a new negative sample . As we only modify one entity in the observed event, the noise distribution will be close to data distribution, thus can lead to better model learning. However, by utilizing the new “contextdependent” noise generation, it becomes very hard to compute the exact noise probability . Therefore, we use an approximation instead as follows.
For a given observed “context” event , we define the “contextdependent” noise distribution for sampled event as . Since is sampled by randomly replacing one of the entity with of the same type, the conditional probability (here we assume is chosen uniformly). Considering the large event space, it is unlikely that event is generated from observed events other than , so we can approximate the noise distribution with . Furthermore, as is usually small for most events, we simply set it to some constant , which leads to the final noise distribution term (which is used in E.q. 6):
where is a constant term. Although we do not know the exact value of , we let when plugging the approximated into Eq. 6. We find that ignoring will only lead to a constant shift of learned parameter . Since is just the global normalization term, it will not affect the relative normal/abnormal scores of different events.
To compute for an observed event , since we do not know which entity is replaced as in the negative event case, we will use the expectation as follows:
And again the will be ignored when plugging into Eq. 6.
4 Experiments
In this section, we evaluate the proposed method using real surveillance data collected in an enterprise system during a twoweek period.
4.1 Data Sets
One of the main application scenarios of anomaly detection is to detect abnormal activity in surveillance data collected from computer systems. Hence, in our experiments, a twoweek period of activity data of an enterprise computer system is used. The collected surveillance data include two types of events, which are viewed as two separate data sets.
P2P. Process to process event data set. Each event of this type contains the system activity of a process interacting with another process, the time and user id of the event are also recorded. P2P events are among the most important system activities since modern operating systems are based on processes.
P2I. Process to Internet Socket event data set. Each event of this type contains the system activity of a process sending or receiving Internet connections to/from other machine at destination ports, the time and user id of the event are recorded as well. We only consider the P2I events among the enterprise system since we focus on inside enterprise activities.
The entity types and their number of entities for both data sets are summarized in Table 1.
Data  Types of entity and their arities 

P2P  day (7), hour (24), uid (361), src proc (778), dst proc (1752), src folder (255), dst folder (415) 
P2I  day (7), hour (24), src ip (59), dst ip (184), dst port (283), proc (91), proc folder (70), uid (162), connect type (3) 
We do not have the groundtruth labels for collected events, however, it is assumed that majority of events are normal. In order to evaluate anomaly detection task, similar to [Das and Schneider2007, Das et al.2008, Akoglu et al.2012], we create some artificial anomalies, and ask the algorithms to detect them. The artificial anomaly events are generated as follows: for each event in the test data, we select of its entities (we consider in following experiments), randomly replace them with other entities of the same type, and make sure the new generated events do not occur in both training and test data sets, so that they can be considered more abnormal than observed events.
We split the twoweek data into two of oneweeks. The events in the first week are used as training set^{3}^{3}3With randomly selected portion as validation set for selection of hyperparameters., and new events that only appeared in the second week are used as test sets. The statistics of observed events are summarized in Table 2.
Data  # week 1  # week 2  # week 2 new 

P2P  95,434  107,619  53,478 (49.69%) 
P2I  1,316,357  1,330,376  498,029 (37.44%) 
4.2 Comparing methods and settings
We compare the following stateoftheart methods for abnormal event detection.
Condition. This method is proposed in [Das and Schneider2007]. For each test event, it computes the conditional scores for all pairs of dependent and mutually exclusive subsets having up to attributes, and combine the scores with a heuristic algorithm. The conditional score is calculated based on statistics of events in the training set, and reflect dependencies between two given attribute sets of an event.
CompreX. This method is proposed in [Akoglu et al.2012]. It utilizes a compression technique to encode training data and learns a set of code tables that summarize patterns. When a new event comes, it first encodes it using existing code tables, and then the number of bits used in encoding is treated as abnormal score for the event.
APE. This is the proposed method. Noted that we use the negative of its likelihood output as the abnormal score.
APE (no weight). This method is the same as APE, except that instead of learning , we simply set , i.e. it is APE without automatic weights learning on pairwise interactions. All types of interactions are weighted equally.
For the (hyper)parameter settings, we use part of the training data as validation set to tune (hyper)parameters. For Condition, we set . For CompreX, we adopt their implementation, and since it is parameter free, we do not need to tune any parameters. For both APE and APE (no weight), the following setting is used: the embedding is randomly initialized, and dimension is set to 10; for each observed training event, we draw 3 negative samples for each of the entity type, which accounts for a total of
negative samples per training instance; we also use a minibatch of size 128 for speed up stochastic gradient descent, and 510 epochs are general enough for convergence.
4.3 Evaluation Metrics
Since all methods listed above produce abnormal scores instead of binary labels, and there is no fixed threshold, thus metrics for binary labels such as accuracy are not suitable for measuring the performance. Similar to [Das and Schneider2007, Akoglu et al.2012]
, we adopt ROC curves (Receiver Operating Characteristic curves) and PRC (Precision Recall curves) for evaluation. Both of these two curves reflect the quality of predicted scores according to their true labels at different threshold levels. A detailed discussion about the two metrics can be found in
[Davis and Goadrich2006]. To get a quantitative measurements, the AUC (area under curve) of both ROC and PRC are utilized.4.4 Results for abnormal event detection
P2P  P2I  
Models  c=1  c=2  c=3  c=1  c=2  c=3 
Condition  0.6296 / 0.6777  0.6795 / 0.7321  0.7137 / 0.7672  0.7733 / 0.7127  0.8300 / 0.7688  0.8699 / 0.8165 
APE (no weight)  0.8797 / 0.8404  0.9377 / 0.9072  0.9688 / 0.9449  0.8912 / 0.8784  0.9412 / 0.9398  0.9665 / 0.9671 
APE  0.8995 / 0.8845  0.9540 / 0.9378  0.9779 / 0.9639  0.9267 / 0.9383  0.9669 / 0.9717  0.9838 / 0.9861 
CompreX  0.8230 / 0.7683  0.8208 / 0.7566  0.8390 / 0.7978  0.7749 / 0.8391  0.7834 / 0.8525  0.7832 / 0.8497 
APE  0.9003 / 0.8892  0.9589 / 0.9394  0.9732 / 0.9616  0.9291 / 0.9411  0.9656 / 0.9729  0.9829 / 0.9854 
Table 3 shows the AUC of ROC and PRC of different methods on P2P and P2I data sets. Note the last two rows in Table 3 are mean scores averaged over three sampled smaller test sets, due to the slowness of CompreX at test time (which can takes hundreds of hours to finish on the half million sized P2I events). Figure 3 shows both ROC curves and PR curves for all methods using test set with entity replacement (for , results are similar thus not shown).
From the results we can see, on different number of entity replacement, our method consistently outperforms both Condition and CompreX significantly. When comparing APE with APE (no weight), we see that by considering weights and learning them automatically, the detection results can be further improved.
(a) P2P abnormal event detection. 
(b) P2I abnormal event detection. 
The learned weight matrix for P2P and P2I events can be found in Figure 4 and 5, respectively. The matrix is uppertriangulated since the pairwise interaction is symmetric and model only among different type of entities. From the weights, we can see the importance of different types of interactions in the data sets. For example, in P2P events, the weight for interaction between day and hour is insignificant; while the weight for interaction between source process and destination process is large, indicating they are highly dependent and capture the regularity of P2P events.
Table 4 shows some detected abnormal events (we only highlight the pairs of entities that have the particular low comparability score). In the first event, the interaction between process bash and its folder is irregular and results in small likelihood; in the second event, the abnormality is caused by a main user (who usually active during the work hour) involved in the event on 1 a.m.; in the third example, the process ssh connects to an unexpected port 80 and thus raising the alarm.
Data  Abnormal event 

P2P  …, src proc: bash, src folder: /home/, … 
P2P  …, uid: 9 (some main user), hour: 1, … 
P2I  …, proc: ssh, dst port: 80, … 
4.5 Results for different noise distributions
Table 5 shows performances under different choices of noise distribution. Results shown are collected from test set with (for , the results are similar thus not shown), and using the same number of training events.
First we compare the “contextindependent” noise distribution (first row) and the proposed “contextdependent” noise distribution (third row), clearly the “contextdependent” one performs significantly better. This confirms that by using the proposed “contextdependent” noise distribution, the APE model can learn much more effectively given the same amount of resources.
We also compare the importance of the approximated noise probability term in Eq. 6. Simply ignore the term by setting it to zero (second row) (as similarity used in [Mikolov et al.2013a, Mikolov et al.2013b]) results in much worse performances compared to our proposed approximated one.
Noise distribution  P2P  P2I 

Contextindependent  0.8463  0.7534 
Contextdependent,  0.8176  0.7868 
Contextdependent,  0.8845  0.9383 
Figure 6 shows the detection performance versus the number of negative samples drawn per entity type. As we can see, it only requires a reasonable number of negative samples to learn well, though adding more negative samples may marginally improve performances.
4.6 A case study for entity embedding
In order to see if the learned embedding is meaningful, we use tsne [Van der Maaten and Hinton2008] to find 2d coordinates of the original entity embeddings. Figure (a)a shows the embedding of users in P2P data. We color each user according to the user type. We find that, in the embedding space, similar types of users are clustered together, indicating they play the same role [Chen et al.2016]; and in particular, root users are grouped together and far away from other types of users, reflecting that root users behave very different from other users. Figure (b)b shows the embedding of hours in P2I data. Although not knowing a priori, the APE model clearly learns the separations of working hours and nonworking hours.
Knowing the types of users and differences among hours can be important for detecting abnormal events. The entity embedding learned by the APE model suggests it can distinguish the semantics/similarities of different entities, thus can help better detect anomalies.
5 Related Work
5.1 Anomaly Detection
There are many literatures for anomaly detection, a good summary of the anomaly detection methods can be found in [Chandola et al.2009]. However, most of those work focuses on either numerical data type or supervised settings.
As for unsupervised categorical anomaly detection, recent work includes [Das and Schneider2007, Das et al.2008, Akoglu et al.2012]. Most of these methods try to model the regular patterns behind data, and produce abnormal score of data according to some heuristics, such as the compression bits for an event [Akoglu et al.2012].
There is some work on applying graph mining methods for anomaly detection in graph [Tong et al.2008, Akoglu et al.2014]. However, our setting is different in the sense that, as shown in Section 2, when treating categorical events as a network, it is a heterogeneous network [Sun and Han2013].
There is also some work on anomaly detection for heterogeneous data [Ren et al.2009, Das et al.2010], However, most of them are not suitable for event data due to the lack of distance measure among data points. For example, [Das et al.2010] uses LCS to measure distance between two sequences, but will not work for two events.
5.2 Embedding Methods
Embedding methods are widely studied in graph/network setting [Belkin and Niyogi2001, Tang et al.2015]. And more recently, there is some work [Bengio et al.2003, Mikolov et al.2013a, Mikolov et al.2013b]
on natural language processing, which tries to embed words into some high dimensional space.
Our work also explores the embedding methods, however, there are some fundamental differences between our method and other embedding methods. Firstly, many of those embedding methods aim to embed pairwise interactions, but they only consider one type of entities. For pairwise interaction of different types of entities, we provide a weighted scheme for distinguishing their importance. Secondly, existing embedding methods cannot be directly applied to predicting abnormal score.
There is some work [Agovic et al.2009] applying graph embedding methods for anomaly detection in numerical data where the distance among data points are easy to compute. However, as far as we know, embedding methods have not explored in anomaly detection applications on categorical event data.
6 Conclusions
In this paper, we tackle a challenging problem of anomaly detection on heterogeneous categorical event data. Different from previous work that heavily relies on heuristics, we propose a principled and unified model that directly learns the likelihood of events. The model is instantiated by weighted pairwise interactions among entities that are quantified based on entity embeddings. Using NoiseContrastive Estimation with “contextdependent” noise distribution, our model can be learned efficiently regardless of the exponentially large event space. Experimental results on real enterprise surveillance data show that our method can accurately detect abnormal events compared to other stateoftheart abnormal detection techniques.
As for the future work, it is interesting to consider the temporal correlations among multiple events instead of treating them independently, as many intrusions/attacks can involve a series of events.
Acknowledgement
We would like to thank Zhichun Li, Haifeng Chen, Guofei Jiang, and Jiawei Han for some helpful discussions. This work is partially supported by NSF CAREER #1453800, Northeastern TIER 1, and Yahoo! ACE Award.
References
 [Agovic et al.2009] Amrudin Agovic, Arindam Banerjee, Auroop Ganguly, and Vladimir Protopopescu. Anomaly detection using manifold embedding and its applications in transportation corridors. Intelligent Data Analysis, 13(3):435–455, 2009.
 [Akoglu et al.2012] Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. Fast and reliable anomaly detection in categorical data. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 415–424. ACM, 2012.
 [Akoglu et al.2014] Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery, 29(3):626–688, 2014.
 [Belkin and Niyogi2001] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, volume 14, pages 585–591, 2001.

[Bengio et al.2003]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin.
A neural probabilistic language model.
The Journal of Machine Learning Research
, 3:1137–1155, 2003.  [Chandola et al.2009] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys, 41(3):15, 2009.
 [Chen et al.2016] Ting Chen, LuAn Tang, Yizhou Sun, Zhengzhang Chen, Haifeng Chen, and Guofei Jiang. Integrating community and role detection in information networks. SIAM International Conference on Data Mining, 2016.
 [Das and Schneider2007] Kaustav Das and Jeff Schneider. Detecting anomalous records in categorical datasets. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 220–229. ACM, 2007.
 [Das et al.2008] Kaustav Das, Jeff Schneider, and Daniel B Neill. Anomaly pattern detection in categorical datasets. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169–176. ACM, 2008.
 [Das et al.2010] Santanu Das, Bryan L Matthews, Ashok N Srivastava, and Nikunj C Oza. Multiple kernel learning for heterogeneous anomaly detection: algorithm and aviation safety case study. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 47–56. ACM, 2010.
 [Davis and Goadrich2006] Jesse Davis and Mark Goadrich. The relationship between precisionrecall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM, 2006.

[Gutmann and
Hyvärinen2010]
Michael Gutmann and Aapo Hyvärinen.
Noisecontrastive estimation: A new estimation principle for
unnormalized statistical models.
In
International Conference on Artificial Intelligence and Statistics
, pages 297–304, 2010.  [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [Mnih and Kavukcuoglu2013] Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noisecontrastive estimation. In Advances in Neural Information Processing Systems, pages 2265–2273, 2013.
 [Mnih and Teh2012] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.

[Ren et al.2009]
Jiadong Ren, Qunhui Wu, Jia Zhang, and Changzhen Hu.
Efficient outlier detection algorithm for heterogeneous data streams.
In Fuzzy Systems and Knowledge Discovery, 2009. FSKD’09. Sixth International Conference on, volume 5, pages 259–264. IEEE, 2009.  [Rendle2010] Steffen Rendle. Factorization machines. In 2010 IEEE 10th International Conference on Data Mining, pages 995–1000. IEEE, 2010.
 [Sun and Han2012] Yizhou Sun and Jiawei Han. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(2):1–159, 2012.
 [Sun and Han2013] Yizhou Sun and Jiawei Han. Mining heterogeneous information networks: a structural analysis approach. ACM SIGKDD Explorations Newsletter, 14(2):20–28, 2013.
 [Tang et al.2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. 2015.
 [Tong et al.2008] Hanghang Tong, Yasushi Sakurai, Tina EliassiRad, and Christos Faloutsos. Fast mining of complex timestamped events. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 759–768. ACM, 2008.
 [Van der Maaten and Hinton2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 9(25792605):85, 2008.
 [Zhang et al.2015] Kai Zhang, Qiaojun Wang, Zhengzhang Chen, Ivan Marsic, Vipin Kumar, Guofei Jiang, and Jie Zhang. From categorical to numerical: Multiple transitive distance learning and embedding. SIAM International Conference on Data Mining, 2015.