Event sequence metric learning

02/19/2020 ∙ by Dmitrii Babaev, et al. ∙ 0

In this paper we consider a challenging problem of learning discriminative vector representations for event sequences generated by real-world users. Vector representations map behavioral client raw data to the low-dimensional fixed-length vectors in the latent space. We propose a novel method of learning those vector embeddings based on metric learning approach. We propose a strategy of raw data subsequences generation to apply a metric learning approach in a fully self-supervised way. We evaluated the method over several public bank transactions datasets and showed that self-supervised embeddings outperform other methods when applied to downstream classification tasks. Moreover, embeddings are compact and provide additional user privacy protection.



There are no comments yet.


page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

We address the problem of learning representations for event sequences generated by real-world users which we call lifestream

data or lifestreams. Event sequence data is produced in many business applications, some examples being credit card transactions and click-stream data of internet site visits, and the event sequence analysis is a very common machine learning problem 

(Laxman et al., 2008)(Wiese and Omlin, 2009)(Zhang et al., 2017)(Bigon et al., 2019). Lifestream is an event sequence that is attributed to a person and captures his/her regular and routine actions of certain type, e.g., transactions, search queries, phone calls and messages.

In this paper, we present a novel Metric Learning for Event Sequences (MeLES) method for learning low-dimensional representations of event sequences, which copes with specific properties of lifestreams such as their discrete nature. In a broad sense, MeLES method adopts metric learning techniques (Xing et al., 2003)(Hadsell et al., 2006). Metric learning is often used in a supervised manner for mapping high-dimensional objects to a low-dimensional embedding space. The aim of metric learning is to represent semantically similar objects (images, video, audio, etc.) closer to each other, while dissimilar ones further. Most metric learning methods are used in such applications as speech recognition (Wan et al., 2017)

, computer vision 

(Schroff et al., 2015)(Mao et al., 2019) and text analysis (Reimers and Gurevych, 2019). In these domains, metric learning is successfully applied in a supervised manner to datasets, where pairs of high-dimensional instances are labeled as the same object or different ones. Unlike all the previous metric learning methods, MeLES is fully self-supervised and does not require any labels. It is based on the observation that lifestream data obeys periodicity and repeatability of events in a sequence. Therefore, one can consider some convenient sub-sequences of the same lifestream as auxiliary high-dimensional representations of the same person. The idea of MeLES is that low-dimensional embeddings of such sub-sequences should be closer to each other.

Self-supervised learning approach allows us to train rich models using the internal structure of large unlabelled or partially labeled training datasets. Self-supervised learning have demonstrated effectiveness in different machine learning domains, such as Natural Language Processing (e. g. ELMO

(Peters et al., 2018), BERT (Devlin et al., 2019)) and computer vision (Doersch et al., 2015).

MeLES model trained in self-supervised manner can be used in two ways. Representations, produced by the model can be directly used as a fixed vector of features in some supervised downstream task (e. g. classification task) similarly to (Mikolov et al., 2013). Alternatively, trained model can be fine-tuned (Devlin et al., 2019) for the specific downstream task.

We conducted experiments on two public bank transaction datasets and evaluated performance of the method on downstream classification tasks. When MeLES representations is directly used as features the method achieve strong performance comparable to the baseline methods. The fine-tuned representations achieve state-of-the-art performance on downsteam classification tasks, outperforming several other supervised methods and methods with unsupervised pre-training by a significant margin.

Moreover, we show superiority of MeLES embeddings over supervised approach applied to partially labeled raw data due to insufficient amount of the target to learn a sufficiently complex model from scratch.

Embedding generation is a one-way transformation, hence it is impossible to restore exact event sequence from its representation. Therefore, the usage of embeddings leads to better privacy and data security for the end users than when working directly with the raw event data, and all this is achieved without sacrificing valuable modelling power.

In this paper, we make the following contributions. We

  1. adopted the ideas of metric learning to the analysis of the lifestream data in a novel, self-supervised, manner;

  2. proposed a specific method, called Metric Learning for Event Sequences (MeLES), to accomplish this task;

  3. demonstrated that the proposed MeLES method significantly outperforms other baselines for both the supervised and the semi-supervised learning scenarios on lifestream data.

2. Related work

The metric learning approach, which underlies our MeLES method, has been widely used in different domains, including computer vision, NLP and audio domains.

In particular, metric learning approach for face recognition was initially proposed in

(Chopra et al., 2005)

, where contrastive loss function was used to learn a mapping of the input data to a low-dimensional manifold using some prior knowledge of neighborhood relationships between training samples or manual labeling. Further, in 

(Schroff et al., 2015), authors introduced FaceNet, a method which learns a mapping from face images to 128-dimensional embeddings using a triplet loss function based on large margin nearest neighbor classification (LMNN) (Weinberger et al., 2006). In FaceNet, authors also introduced online triplet selection and hard-positive and hard-negative mining technique for training procedure.

Also, metric learning has been used for the speaker verification task (Wan et al., 2017), where the contrast loss is defined as embedding of each utterance being similar to the centroid of all that speaker’s embeddings (positive pair) and far from other speaker’s centroids with the highest similarity among all false speakers (hard negative pair).

Finally, in (Reimers and Gurevych, 2019), authors proposed a fine-tuned BERT model (Devlin et al., 2019) that use metric learning in the form of siamese and triplet networks to train sentence embeddings for semantic textual similarity tasks using semantic proximity annotation of sentence pairs.

Although metric learning was used in all these domains, it has not been applied to the analysis of the lifestream problems involving transactional, click-stream and other types of lifestream data, which is the focus of this paper.

Importantly, previous literature applied metric learning to their domains in a supervised manner, while our MeLES method adopts the ideas of metric learning in a novel fully self-supervised manner to the event sequence domain.

The another idea of applying self-supervised learning to sequential data has been previously proposed in Contrastive Predictive Coding (CPC) method (van den Oord et al., 2018)

, where meaningful representations are extracted by predicting future in the latent space by using autoregressive methods. CPC representations demonstrated strong performance on four distinct domains: audio, computer vision, natural language and reinforcement learning.

In computer vision domain, there are many other different approaches to self-supervised learning that are nicely summarized in (Jing and Tian, 2019)

. There are several ways to define a self-supervision task (pretext task) for an image. One option is to somehow change an image and then try to restore the original image. The examples of this approach are super-resolution, image colorization and corrupted image restoration. Another option is to predict context information from the local features e. g. predict the place of image patch on the image with several missing patches.

Note that almost every self-supervised learning approach can be reused for the representation learning in the form of embeddings. There are several examples of using a single set of embeddings for several downstream tasks (Song et al., 2017), (Zhai et al., 2019).

One of the common approaches to learn self-supervised representations is either traditional autoencoder (

(Rumelhart et al., 1985)) or variational autoencoder ((Kingma and Welling, 2013)). It is widely used for images, text and audio or aggregated lifestream data ((Mancisidor et al., 2019)). Although autoencoders has been successfully used in several domains listed above, they has not been applied to the raw lifestream data in the form of event sequences, mainly due to the challenges of defining distances between the input and the reconstructed input sequences.

In the next section, we describe how the ideas of metric learning are applied to the event sequences in a self-supervised manner.

3. The Method

3.1. Lifestream data

We designed the method specially for the lifestream data. Lifestream data consists of discrete events per person in continuous time, for example, behavior on websites, credit card transactions, etc.

Considering credit card transactions, each transaction have a set of attributes, either categorical or numerical including the timestamp of the transaction. An example of the sequence of three transactions with their attributes is presented in the Table 1. Merchant type field represents the category of a merchant, such as ”airline”, ”hotel”, ”restaurant”, etc.

Amount 230 5 40
Currency EUR USD USD
Country France US US
Time 16:40 20:15 09:30
Date Jun 21 Jun 21 Jun 22
Merchant Type Restaurant Transportation Household Appliance
Table 1. Data structure for a single credit card

Another example of lifestream data is click-stream: the log of internet pages visits. The example of a click-stream log of a single user is presented in Table 2.

Time Date Domain Referrer Domain
17:40 Jun 21 amazon.com google.com
17:41 Jun 21 amazon.com amazon.com
17:45 Jun 21 en.wikipedia.org google.com
Table 2. Click-stream structure for a single user

3.2. General framework

Figure 1. General framework

The overview of the method is presented at Figure 1. Given a sequence of discrete events in a given observation interval [1, T] the ultimate goal is to obtain a sequence embedding for the timestamp in the latent space . To train the encoder to generate meaningful embedding from we apply a metric learning approach such that the distance between embeddings of the same person is small, whereas embeddings of the different persons (negative pairs) is large.

One of the difficulties with applying metric learning approach to the lifestream data is that the notion of semantic similarity as well as dissimilarity requires underlying domain knowledge and human labor-intensive labeling process to constrain positive and negative examples. The key property of the lifestream data domain is periodicity and repeatability of the events in the sequence which allows us to reformulate the metric learning task in a self-supervised manner. MeLES learns low-dimensional embeddings from person sequential data, sampling positive pairs as sub-sequences of the same person sequence and negative pairs as sub-sequences from different person sequences. See section 3.6 for details of the positive pairs generation.

Embedding is generated by encoder neural network which is described in section 3.3. Metric learning losses are described in section 3.4. Positive pairs generation strategies are described in section 3.6. Negative pairs sampling strategies are described in section 3.5.

Sequence embedding obtained by the metric learning approach is then used in various donwstream machine learning tasks as a feature vector. Also, a possible way to improve the downstream task performance is to feed a pre-trained embedding

(e. g. the last layer of RNN) to a task-specific classification subnetwork and then jointly fine-tune the model parameters of the encoder and classifier subnetworks.

3.3. Encoder architecture

To embed a sequence of events to the fixed-size vector we use approach similar to the E.T.-RNN card transaction encoder proposed in (Babaev et al., 2019). The whole encoder network consists of two conceptual parts: the event encoder and the sequence encoder subnetworks.

The event encoder takes the set of attributes of a single event and outputs its representation in the latent space : . The sequence encoder takes latent representations of the sequence of events: and outputs the representation of the whole sequence in the time-step : .

The event encoder consists of the several embedding layers and batch normalization

(Ioffe and Szegedy, 2015) layer. Each embedding layer is used to encode each categorical attribute of the event. Batch normalization is applied to numerical attributes of the event. Finally, outputs of every embedding layer and batch normalization layer are concatenated to produce the latent representation of the single event.

The sequence of latent representations of event representations is passed to sequence encoder to obtain a fixed-size vector . Several approaches can be used to encode a sequence. One possible approach is to use the recurrent network (RNN) as in (Sutskever et al., 2014). The other approach is to use the encoder part of the Transformer architecture presented in (Vaswani et al., 2017). In both cases the output produced for the last event can be used to represent the whole sequence of events. In case of RNN the last output is a representation of the sequence.

Encoder, based on RNN-type architecture like GRU(Cho et al., 2014), allows to calculate embedding by updating embedding instead of calculating embedding from the whole sequence of past events : . This option allows to reduce inference time to update already existing person embeddings with new events, occurred after the calculation of embeddings. This is possible due to the recurrent nature of RNN-like networks.

3.4. Metric learning losses

Metric learning loss discriminates embeddings in a way that embeddings from same class are moved closer together and embeddings from the different class are moved further. Several metric learning losses have been considered - contrastive loss (Hadsell et al., 2006), binomial deviance loss (Yi et al., 2014), triplet loss (Hoffer and Ailon, 2014), histogram loss (Ustinova and Lempitsky, 2016) and margin loss (Wu et al., 2017). All of this losses address the following challenge of the metric learning approach: using all pairs of samples is inefficient, for example, some of the negative pairs are already distant enough thus this pairs are not valuable for the training ((Simo-Serra et al., 2015)(Wu et al., 2017)(Schroff et al., 2015)).

In the next paragraphs we will consider two kinds of losses, which are conceptually simple, and yet demonstrated strong performance on validation set in our experiments (see Table 5), namely contrastive loss and margin loss.

Contrastive loss has a contrastive term for the negative pair of embeddings which penalizes the model only if the negative pair is not distant enough and the distance between embeddings is less than a margin :


where is the count of all pairs in a batch, - is a distance function between a i-th labeled sample pair of embeddings and , is a binary label assigned to a pair: means a similar pair, means dissimilar pair, is a margin. As proposed in (Hadsell et al., 2006) we use euclidean distance as the distance function: .

Margin loss is similar to the contrastive loss, the main difference is that there is no penalty for positive pairs which are closer than threshold in a margin loss.


where is the count of all pairs in a batch, - is a distance function between a i-th labeled sample pair of embeddings and , is a binary label assigned to a pair: means a similar pair, means dissimilar pair, and define the bounds of a margin.

3.5. Negative sampling

Negative sampling is another way to address the issue that some of the negative pairs are already distant enough thus this pairs are not valuable for the training ((Simo-Serra et al., 2015), (Wu et al., 2017), (Schroff et al., 2015)). Hence, only part of possible negative pairs are considered during loss calculation. Note, that only current batch samples are considered. There are several possible strategies of selecting most relevant negative pairs.

  1. Random sample of negative pairs

  2. Hard negative mining: generate k hardest negative pairs for each positive pair.

  3. Distance weighted sampling, where negative samples are drawn uniformly according to their relative distance from the anchor. (Wu et al., 2017)

  4. Semi-hard sampling, where we choose the nearest to anchor negative example, from samples which further away from the anchor than the positive exemplar ((Schroff et al., 2015)).

In order to select negative samples, we need to compute pair-wise distance between all possible pairs of embedding vectors of a batch. For the purpose of making this procedure more computationally effective we perform normalization of the embedding vectors, i.e. project them on a hyper-sphere of unit radius. Since and , to compute the the euclidean distance we only need to compute: .

To compute the dot product between all pairs in a batch we just need to multiply the matrix of all embedding vectors of a batch by itself transposed, which is a highly optimized computational procedure in most modern deep learning frameworks. Hence, the computational complexity of the negative pair selection is

where is the size of the output embeddings and is the size of the batch.

3.6. Positive pairs generation

The following procedure is used to create a batch during MeLES training. initial sequences are taken for batch generation. Then, sub-sequences are produced for each initial sequence.

Pairs of sub-sequences produced from the same sequence are considered as positive samples and pairs from different sequences are considered as negative samples. Hence, after positive pair generation each batch contains sub-sequences used as training samples. There are positive pairs and negative pairs per sample in batch.

There are several possible strategies of sub-sequence generation. The simplest strategy is the random sampling without replacement. Another strategy is to produce a sub-sequence from random splitting sequence to several sub-sequences without intersection between them (see Algorithm 1). The third option is to use randomly selected slices of events with possible intersection between slices (see Algorithm 2).

Note, that the order of events in generated sub-sequences is always preserved.

hyperparameters: - amount of sub-sequences to be produced.
input: A sequence of length .
output: - sub-sequences from .
Generate vector of length with random integers from [1,k].
for  to  do
end for
Algorithm 1 Disjointed sub-sequences generation strategy
hyperparameters: - minimal and maximal possible length of sub-sequence. - amount of sub-sequences to be produced.
input: A sequence of length .
output: - sub-sequences from .
for  to  do
       Generate random integer ,
       Generate random integer ,
end for
Algorithm 2 Random slices sub-sample generation strategy

4. Experiments

4.1. Datasets

In our research we used several publicly available datasets of bank transactions.

  1. Age group prediction competition111https://onti.ai-academy.ru/competition - the task is to predict the age group of a person within 4 classes target and accuracy is used as a performance metric. The dataset consists of 44M anonymized transactions representing 50k persons with a target labeled for only 30k of them (27M out of 44M transactions), for the other 20k persons (17M out of 44M transactions) label is unknown. Each transaction includes date, type (for example, grocery store, clothes, gas station, children’s goods, etc.) and amount. We use all available 44M transactions for metric learning, excluding 10% - for the test part of the dataset, and 5% for the metric learning validation.

  2. Gender prediction competition222https://www.kaggle.com/c/python-and-analyze-data-final-project/ - the task is a binary classification problem of predicting the gender of a person and ROC-AUC metric is used. The dataset consists of 6,8M anonymized transactions representing 15k persons, where only 8,4k of them are labeled. Each transaction is characterized by date, type (for ex. ”ATM cash deposit”), amount and Merchant Category Code (also known as MCC).

4.2. Experiment setup

For each dataset, we set apart 10% persons from the labeled part of data as the test set that we used for comparing different models.

If we do not explicitly mention alternative, in our experiments we use contrastive loss and random slices pair generation strategy.

For all methods random search on 5-fold cross-validation over the train set is used for hyper-parameter selection. The hyper-parameters with the best out-of-fold performance on train set is then chosen.

The final set of hyper-parameters used for MeLES is shown in the Table3.

Age task Gender task
Learning rate 0.002 0.002
Number of samples in batch 64 128

Number of epochs

100 150
Number of generated sub-samples (see Section 3.6) 5 5
Table 3. Hyper-parameters for MeLES training

For evaluation of semi-supervised/self-supervised techniques (including MeLES), we used all transactions including unlabeled data, except for the test set, as far as those methods are suitable for partially labeled datasets, or does not require labels at all.

4.2.1. Performance

Neural network training was performed on a single Tesla P-100 GPU card. For the training part of MeLES, the single training batch is processed in 142 millisecond. For age prediction dataset, the single training batch contains 64 unique persons with 5 sub-samples per person, i.e. 320 training samples in total, the mean number of transactions per sample is 90, hence each batch contains around 28800 transactions.

4.2.2. Baselines

We compare our MeLES method with the following two baselines. First, we consider the Gradient Boosting Machine (GBM) method 

(Friedman, 2001) on hand-crafted features. GBM can be considered as a strong baseline in cases of tabular data with heterogeneous features. In particular, GBM-based approaches achieve state-of-the-art results in a variety of practical tasks including web search, weather forecasting, fraud detection, and many others (Wu et al., 2010; Vorobev et al., 2019; Zhang and Haghani, 2015; Niu et al., 2019).

Second, we apply recently proposed Contrastive Predictive Coding (CPC) (van den Oord et al., 2018), a self-supervised learning method, which has shown an excellent performance for sequential data of such traditional domains as audio, computer vision, natural language, and reinforcement learning.

GBM based model requires a large number of hand-crafted aggregate features produced from the raw transactional data. An example of an aggregate feature would be an average spending amount in some category of merchants, such as hotels of the entire transaction history. We used LightGBM(Ke et al., 2017) implementation of GBM algorithm with nearly 1k hand-crafted features for the application. Please see the companion code for the details of producing hand-crafted features.

In addition to the mentioned baselines we compare our method with supervised learning approach where the encoder sub-network and with classification sub-network are jointly trained on the downstream task target. Note, that no pre-training is used in this case.

4.2.3. Design choices

In the Table4, Table5, Table6 and Table7 we present the results of experiments on different design choices of our method.

As shown in Table4, different choices of encoder architectures show comparable performance on the downstream tasks.

It is interesting to observe that even contrastive loss that can be considered as the basic variant of metric learning loss allows to get strong results on the downstream tasks (see Table 5). Our hypothesis is that an increase in the model performance on metric learning task does not always lead to an increase in performance on downstream tasks.

Also observe that hard negative mining leads to significant increase in quality on downstream tasks in comparison to random negative sampling (see Table7).

Another observation is that a more complex sub-sequence generation strategy (e. g. random slices) shows slightly lower performance on the downstream tasks in comparison to the random sampling of events (see Table6).

Econder type
Table 4. Comparison of encoder types
Loss type
Contrastive loss
Binomial deviance loss
Histogram loss
Margin loss
Triplet loss
Table 5. Comparison of metric learning losses
Pair generation method
Random samples
Random disjoint samples
Random slices
Table 6. Comparison of pair generation strategies
Negative sampling strategy
Hard negative mining
Random negative sampling
Distance weighted sampling
Table 7. Comparison of negative sampling strategies
Figure 2. Embedding dimensionality vs. quality for age prediction task
Figure 3. Embedding dimensionality vs. quality for gender prediciton task

Figure 2

shows that the quality on downstream task increases with the dimensionality of embedding. The best quality is achieved at size 800. Further increase in the dimensionality of embedding reduces quality. The results can be interpreted as bias-variance trade-off. When embedding dimensionality is too small, too much information can be discarded (high bias). On the other hand, when embedding dimensionality is too large, too much noise is added (high variance).

At Figure 3 we see a similar dependency. We can find a plateau between 256 and 2048, when quality on downstream tasks does not increase. The final embedding size used in the other experiments is 256.

Note, that increasing embedding size will also linearly increase the training time and the volume of consumed memory on the GPU.

4.2.4. Embedding visualization

In order to visualize MeLES embeddings in 2-dimensional space, we applied tSNE transformation (Maaten and Hinton, 2008) on them. tSNE transforms high-dimensional space to low-dimensional based on local relationships between points, so neighbour vectors in high-dimensional embedding space are pushed to be close in 2-dimensional space. We colorized 2-dimensional vectors using the target values of the datasets.

Note, that embeddings was learned in a fully self-supervised way from raw user transactions without any target information. Sequence of transactions represent user’ behavior, thus the MeLES model captures behavioral patterns and outputs embeddings of users with similar patterns nearby.

tSNE vectors from the age prediction dataset are presented in the Figure 4. We can observe 4 clusters: clusters for group ’1’ and ’2’ are on the opposite side of the cloud, clusters for groups ’2’ and ’3’ are in the middle.

Figure 4. 2D tSNE mapping of MeLES embeddings trained on age prediction task dataset, colored by age group labels

4.3. Results

4.3.1. Comparison with baselines

As shown in Table 8 our method generates sequence embeddings of lifestream data that achieve strong performance, comparable to performance on manually crafted features when used on downstream tasks. Moreover fine-tuned representations obtained by our method achieve state-of-the-art performance on both bank transactions datasets, outperforming all previous learning methods by a significant margin.

Furthermore note that the usage of sequence embedding together with hand-crafted aggregate features leads to better performance than usage of only hand-crafted features or sequence embeddings, i.e. it is possible to combine different approaches to get even better model.

LightGBM on hand-crafted features
LightGBM on MeLES embeddings
LightGBM on both hand-crafted features and MeLES embeddings
Supervised learning
MeLES fine-tuning
LightGBM on CPC embeddings
Fine-tuned Contrastive Predictive Coding
Table 8. Final results on the downstream tasks

4.3.2. Semi-supervised setup

To evaluate our method in condition of a restricted amount of labeled data we use only part of available target labels for the semi-supervised experiment. As well as in the supervised setup we compare proposed method with ligthGBM over hand-crafted features and Contrastive Predictive Coding (see Section 4.2.2). For both embedding generation methods (MeLES and CPC) we evaluate both performance of the lightGBM on embeddings and performance of fine-tuned models. In addition to this baselines we compare our method with supervised learning on the available part of the data.

In figures 5 and 6 we compare the quality of hand-crafted features and embeddings by learning the lightGBM on top of them. Moreover, in figures 7 and 8 one can find comparison of a single models trained on downstream tasks considered in the paper. As you can see in figures, if labeled data is limited, MeLES significantly outperforms supervised and other approaches. Also MeLES consistently outperforms CPC for different volumes of labeled data.

The rightmost point correspond to all labels and supervised setup. X-axis is shown on a logarithmic scale.

Figure 5. Age group prediction task quality on features for different dataset sizes

The rightmost point correspond to all labels and supervised setup. X-axis is shown on a logarithmic scale.

Figure 6. Gender prediction task quality on features for different dataset sizes

The rightmost point correspond to all labels and supervised setup. X-axis is shown on a logarithmic scale.

Figure 7. Age group prediction task quality of single model for different dataset sizes

The rightmost point correspond to all labels and supervised setup. X-axis is shown on a logarithmic scale.

Figure 8. Gender prediction task quality of single model for different dataset sizes

5. Conclusions

In this paper, we adopted the ideas of metric learning to the analysis of the lifestream data in a novel, self-supervised, manner. As a part of this proposal, we developed the Metric Learning for Event Sequences (MeLES) method that is based on self-supervised learning. In particular, the MeLES method can be used to produce embeddings of complex event sequences that can be effectively used in various downstream tasks. Also, our method can be used for pre-training in semi-supervised settings.

We also empirically demonstrate that our approach achieves strong performance results on several downstream tasks by significantly (see Section 4.3) outperforming both classical machine learning baselines on hand-crafted features and neural network based approaches. In the semi-supervised setting, where the number of labelled data is limited, our method demonstrates even stronger results: it outperforms supervised methods by significant margins.

The proposed method of generating embeddings is convenient for production usage since almost no pre-processing is needed for complex event streams to get their compact embeddings. The pre-calculated embeddings can be easily used for different downstream tasks without performing complex and time-consuming computations on the raw event data. For some encoder architectures, such as those presented in Section 3.3, it is even possible to incrementally update the already calculated embeddings when additional new lifestream data arrives.

Another advantage of using event sequence based embeddings, instead of the raw explicit event data, is that it is impossible to restore the exact input sequence from its embeddings. Therefore, the usage of embeddings leads to better privacy and data security for the end users than when working directly with the raw event data, and all this is achieved without sacrificing valuable information for downstream tasks.


  • D. Babaev, M. Savchenko, A. Tuzhilin, and D. Umerenkov (2019) E.t.-rnn: applying deep learning to credit loan applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, New York, NY, USA, pp. 2183–2190. External Links: ISBN 9781450362016, Link, Document Cited by: §3.3.
  • L. Bigon, G. Cassani, C. Greco, L. Lacasa, M. Pavoni, A. Polonioli, and J. Tagliabue (2019) Prediction is very hard, especially about conversion. predicting user purchases from clickstream data in fashion e-commerce. arXiv preprint arXiv:1907.00400. Cited by: §1.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.3.
  • S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    Vol. 1, pp. 539–546. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §1, §2.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §1.
  • J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine.. Ann. Statist. 29 (5), pp. 1189–1232. External Links: Document, Link Cited by: §4.2.2.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, Washington, DC, USA, pp. 1735–1742. External Links: ISBN 0-7695-2597-0, Link, Document Cited by: §1, §3.4, §3.4.
  • E. Hoffer and N. Ailon (2014) Deep metric learning using triplet network. pp. . External Links: Document Cited by: §3.4.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. Cited by: §3.3.
  • L. Jing and Y. Tian (2019) Self-supervised visual feature learning with deep neural networks: a survey. External Links: 1902.06162 Cited by: §2.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017)

    LightGBM: a highly efficient gradient boosting decision tree

    In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3146–3154. External Links: Link Cited by: §4.2.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • S. Laxman, V. Tankasali, and R. W. White (2008) Stream prediction using a generative model based on frequent episodes in event sequences. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 453–461. Cited by: §1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.2.4.
  • R. A. Mancisidor, M. Kampffmeyer, K. Aas, and R. Jenssen (2019) Learning latent representations of bank customers with the variational autoencoder. External Links: 1903.06580 Cited by: §2.
  • C. Mao, Z. Zhong, J. Yang, C. Vondrick, and B. Ray (2019) Metric learning for adversarial robustness. In Advances in Neural Information Processing Systems, pp. 478–489. Cited by: §1.
  • T. Mikolov, G. Corrado, K. Chen, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    pp. 1–12. Cited by: §1.
  • X. Niu, L. Wang, and X. Yang (2019) A comparison study of credit card fraud detection: supervised versus unsupervised. arXiv preprint arXiv:1904.10604. Cited by: §4.2.2.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. pp. 2227–2237. External Links: Document Cited by: §1.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §1, §2.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §2.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: a unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. Cited by: §1, §2, item 4, §3.4, §3.5.
  • E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126. Cited by: §3.4, §3.5.
  • Y. Song, Y. Li, B. Wu, C. Chen, X. Zhang, and H. Adam (2017) Learning unified embedding for apparel recognition. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 2243–2246. Cited by: §2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3104–3112. External Links: Link Cited by: §3.3.
  • E. Ustinova and V. Lempitsky (2016) Learning deep embeddings with histogram loss. pp. . Cited by: §3.4.
  • A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. External Links: Link, 1807.03748 Cited by: §2, §4.2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §3.3.
  • A. Vorobev, A. Ustimenko, G. Gusev, and P. Serdyukov (2019) Learning to select for a predefined ranking. In International Conference on Machine Learning (ICML’19), Vol. , pp. 6477–6486. Cited by: §4.2.2.
  • L. Wan, Q. Wang, A. Papir, and I. L. Moreno (2017) Generalized end-to-end loss for speaker verification. External Links: 1710.10467 Cited by: §1, §2.
  • K. Q. Weinberger, J. Blitzer, and L. K. Saul (2006) Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pp. 1473–1480. Cited by: §2.
  • B. Wiese and C. Omlin (2009) Credit card transactions, fraud detection, and machine learning: modelling time with lstm recurrent neural networks. In Innovations in neural information paradigms and applications, pp. 231–268. Cited by: §1.
  • C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: item 3, §3.4, §3.5.
  • Q. Wu, C. J. Burges, K. M. Svore, and J. Gao (2010) Adapting boosting for information retrieval measures. Information Retrieval 13 (3), pp. 254–270. Cited by: §4.2.2.
  • E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng (2003) Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pp. 521–528. Cited by: §1.
  • D. Yi, Z. Lei, and S. Li (2014) Deep metric learning for practical person re-identification. Proceedings - International Conference on Pattern Recognition, pp. . External Links: Document Cited by: §3.4.
  • A. Zhai, H. Wu, E. Tzeng, D. H. Park, and C. Rosenberg (2019) Learning a unified embedding for visual search at pinterest. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, New York, NY, USA, pp. 2412–2420. External Links: ISBN 978-1-4503-6201-6, Link, Document Cited by: §2.
  • Y. Zhang and A. Haghani (2015) A gradient boosting method to improve travel time prediction. Transportation Research Part C: Emerging Technologies 58, pp. 308–324. Cited by: §4.2.2.
  • Y. Zhang, D. Wang, Y. Chen, H. Shang, and Q. Tian (2017)

    Credit risk assessment based on long short-term memory model

    In International conference on intelligent computing, pp. 700–712. Cited by: §1.