DeepAI
Log In Sign Up

Generating multi-type sequences of temporal events to improve fraud detection in game advertising

04/07/2021
by   Lun Jiang, et al.
0

Fraudulent activities related to online advertising can potentially harm the trust advertisers put in advertising networks and sour the gaming experience for users. Pay-Per-Click/Install (PPC/I) advertising is one of the main revenue models in game monetization. Widespread use of the PPC/I model has led to a rise in click/install fraud events in games. The majority of traffic in ad networks is non-fraudulent, which imposes difficulties on machine learning based fraud detection systems to deal with highly skewed labels. From the ad network standpoint, user activities are multi-type sequences of temporal events consisting of event types and corresponding time intervals. Time Long Short-Term Memory (Time-LSTM) network cells have been proved effective in modeling intrinsic hidden patterns with non-uniform time intervals. In this study, we propose using a variant of Time-LSTM cells in combination with a modified version of Sequence Generative Adversarial Generative (SeqGAN)to generate artificial sequences to mimic the fraudulent user patterns in ad traffic. We also propose using a Critic network instead of Monte-Carlo (MC) roll-out in training SeqGAN to reduce computational costs. The GAN-generated sequences can be used to enhance the classification ability of event-based fraud detection classifiers. Our extensive experiments based on synthetic data have shown the trained generator has the capability to generate sequences with desired properties measured by multiple criteria.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/18/2019

Language Modeling through Long Term Memory Network

Recurrent Neural Networks (RNN), Long Short-Term Memory Networks (LSTM),...
12/01/2016

Anomaly Detection in Video Using Predictive Convolutional Long Short-Term Memory Networks

Automating the detection of anomalous events within long video sequences...
08/01/2017

MM2RTB: Bringing Multimedia Metrics to Real-Time Bidding

In display advertising, users' online ad experiences are important for t...
12/27/2020

Multi-Channel Sequential Behavior Networks for User Modeling in Online Advertising

Multiple content providers rely on native advertisement for revenue by p...
02/23/2020

Sequence Preserving Network Traffic Generation

We present the Network Traffic Generator (NTG), a framework for perturbi...

1. Introduction

Game developers can monetize their games by selling in-game ad placements to advertisers. In-game ads can be integrated to the game either through a banner in the background or commercials during breaks (when a certain part of the game is completed). There are four main elements in the game advertising ecosystem: publishers or developers, advertisers 111Advertisers can be publishers. (demand), advertising network, and users (supply) (Mouawi et al., 2019). Game advertising networks connect advertisers with game developers and serve billions of ads to user devices triggering an enormous amount of ad events. For example, Unity Ads reports 22.9B+ monthly global ad impressions, reaching 2B+ monthly active end-users worldwide 222https://www.businesswire.com/news/home/20201013005191/en/.

There are multiple types of ad events in the real-world, e.g. request, start, view, click, install, etc. Each type stands for one specific kind of ad-related user action happening at a specific time. A complete ad life cycle can be depicted as a temporal sequence of ad events, each of which is a tuple of event type with corresponding time interval. Click and install are two kinds of ad events commonly associated with ad revenue. Pay-Per-Click (Kapoor et al., 2016) and Pay-Per-Install (Thomas et al., 2016) are the most widely used advertising models for pricing.

Naturally, as advertisers allocate more of their budgets into this ecosystem, more fraudsters tend to abuse the advertising networks and defraud advertisers of their money (Nagaraja and Shah, 2019). Fraudulent ad activities aimed at generating illegitimate ad revenues or unearned benefits are one of the major threats to these advertising models. Common types of fraudulent activities include fake impressions (Haider et al., 2018), click bots (Haddadi, 2010; Kudugunta and Ferrara, 2018), click farms (Oentaryo et al., 2014), etc. (Zhu et al., 2017a). Fraud in the advertising ecosystem is top of mind for advertisers, developers, and advertising networks. Having their reputation and integrity on the line, a huge amount of effort has been focused on fraud detection activities from the advertising network’s side (Jianyu et al., 2017; Dong et al., 2018; Mouawi et al., 2019; Nagaraja and Shah, 2019).

Studying the behaviors of ads temporal sequences helps to identify intrinsic hidden patterns committed by fake or malicious users in advertising networks. Given the massive ad activity data in game advertising networks, machine learning-based approaches have become popular in the industry.

However, it is not a straightforward task to train machine learning models directly on fraudulent and benign sequences collected from ad activities (Choi and Lim, 2020). The vast majority of ad traffic is non-fraudulent and data labeling by human experts is time-consuming, which results in low availability of labeled fraud sequences and a high class imbalance between the fraud/non-fraud training data. Simply oversampling the minority fraud class can cause significant overfitting while undersampling the majority non-fraud may lead to information loss and yield a tiny training dataset (Ba, 2019). To mitigate this data availability problem, in this study we present a novel data generator which is able to learn the intrinsic hidden patterns from sequential training data and generate emulated sequences with high qualities.

The main contributions of our work can be summarized as follows:

  1. We build a data generator which is able to generate multi-type temporal sequences with non-uniform time intervals.

  2. We present a new application for event-based sequence GAN for fraud detection in game advertising.

  3. We propose a new way of sequence GAN training by employing a Critic network.

2. Related Work

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have drawn a significant attention as a framework for training generative models capable of producing synthetic data with desired structures and properties (Killoran et al., 2017). Ba, 2019 proposed using GANs to generate data that mimics training data as an augmented oversampling method with an application in credit card fraud. The generated data is used to assist the classification of credit card fraud Ba (2019).

Despite of the remarkable success of GANs in generating real-looking data, very few studies focus on generating sequence data. This is due to the fact that taking advantage of GAN to generate temporal sequence data with intrinsic hidden patterns can be more challenging. Recurrent Neural Network (RNN) solutions have become state-of-the-art methods on modeling sequential data. Hyland et al., 2017 developed a Recurrent Conditional GAN (RCGAN) to generate real-valued multi-dimensional time series, and then used the generated series for supervised training

(Esteban et al., 2017). The time series data in their study were physiological signals sampled at specific fixed frequencies, whereas ad events data have a higher complexity in terms of non-uniform time intervals and discrete event types, and thus can not be modeled as wave signals. In ad event sequences, two events with a short time interval tend to be more correlated than events with larger time intervals.

Killoran et al., 2017 proposed a GAN-based generative model for DNA along with a activation maximization technique for DNA sequence data. Their experiments have shown that these generative techniques can learn important structure from DNA sequences and can be used to design new DNA sequences with desired properties (Killoran et al., 2017)

. Similar to the previous study, their focus is on the fixed interval sequences. Zheng et al., 2019 adopted the LSTM-Autoencoder to encode the benign users into a hidden space. They proposed using One-Class Adversarial Network (OCAN) for the training process of the GAN model. In their training framework, the discriminator is trained to be a classifier for distinguishing benign users and the generator produces samples that are complementary to the representations of benign users

(Zheng et al., 2019). However, since OCAN has not been trained on the malicious users dataset, it is hard to measure the quality of the generated sequences and understand the pattern they follow.

2.1. Time-LSTM

The common LSTM cells have shown remarkable success to generate complex sequences with long-range structures in numerous domains (Graves, 2013). Recently, a combination of a LSTM cell with a dimension-reducing symbolic representation was proposed to forecast time series (Elsworth and Güttel, 2020)

. However, RNN models usually consider the sequence of the events and ignore their intervals. Thus, these models are not suitable to process non-uniformly distributed events generated in continuous time. This major drawback of the traditional recurrent models has led to development of Phased-LSTM

(Neil et al., 2016). An LSTM variant to model event-based sequences. Neil et al., 2016 proposed adding a new time gate to the traditional LSTM cells. The time gate is controlled by a parametrized oscillation and has three phases, i.e. rises from to in the first phase, drops from to in the second phase, and it remains inactive in the third phase. Xiao et al., 2017 proposed using an intensity function modulated synergistically by one RNN (Xiao et al., 2017). Further, the Time-Aware LSTM (T-LSTM) cells were proposed in (Baytas et al., 2017) to handle non-uniform time intervals in longitudinal patient records. They used the T-LSTM cell in an auto-encoder to learn a single representation for sequential records of patients.

Zhu et al., 2017 proposed a LSTM variant named Time-LSTM to model users’ sequential actions in which LSTM cells are equipped with time gates to model time intervals (Zhu et al., 2017b)

. In this paper, Time-LSTM was used for recommending items to users. We find it useful to model event types and time intervals using Time-LSTM cells because of its ability in capturing the intrinsic patterns of fraudulent sequences, and thus to understand the internal mechanisms of how fraudulent activities are generated. We implemented our version of Time-LSTM cells in Keras and used it in the architecture of the generators and discriminators of our GAN models.

2.2. GAN for Sequence Data

When generating continuous outputs, gradient updates can be passed from the discriminator to the generator. However, for discrete outputs, due to lack of differentiability, the backpropagation does not work properly. Yu et al., 2017 addressed the issue of training GAN models to generate sequences of discrete tokens. They proposed a sequence generation framework, called SeqGAN that models the data generator as a stochastic policy in Reinforcement Learning (RL)

(Yu et al., 2017). They regarded the generator as a stochastic parametrized policy. Their policy gradient employs MC search to approximate the state values, which is a computationally expensive process in the training loop. Moreover, the SeqGAN is limited only to discrete token generation, in our work we propose a modified version of seqGAN in combination with Time-LSTM cells that can generate both discrete tokens and continuous time intervals. To efficiently train the policy network, we employ a Critic network to approximate the return given a partially generated sequence to speed up the training process. This approach also brings the potential to use a trained Critic network for early fraud detection from partial sequences.

Zhao et al., 2020 presents an application of SeqGAN in recommendation systems. The paper solves the slow convergence and unstable RL training by using Actor-Critic algorithm instead of MC roll-outs (Zhao et al., 2020). Their generator model produces the entire recommended sequences given the interaction history while the discriminator learns to maximize the score of ground-truth and minimize the score of generated sequences. In each step, the generator

generates a token by top-k beam search based on the model distribution. In our work, we directly sample from the distribution of the output probabilities of the tokens. While our methodologies are close, we are aiming for different goals. We optimize the generated data to solve the sample imbalance problem while they optimize for better recommendations. Therefore, different evaluation metrics are needed. Our methodologies also differ in the training strategy. For example, we used a Critic network as the baseline whereas they used Temporal-Difference bootstrap targets. They pre-trained the discriminator on the generated data to reduce the exposure bias while we pre-trained the discriminator on the actual training data for improving the metrics we use in our experiments. More importantly they do not include time intervals as an attribute in their model while we have time intervals in our models.

Recently, Smith and Smith 2020, proposed using two generators, a convolutional generator that transforms a single random vector to a RGB spectrogram image, and a second generator that receives the 2D spectrogram images and outputs a time series

(Smith and Smith, 2020). In our work, we find the RL training process a more natural way to address the issue of generating discrete outputs.

3. Methodology

Notation. In this paper, all sequences, sets are denoted by bold letters like . We use to refer to the size/length of a sequences or set.

In this section, we introduce a new methodology to generate multi-type sequences using seqGAN and Time-LSTM cells.

3.1. Definitions

An original ad event sequence of length is composed of two subsequences, the subsequence of event types and the subsequence of time stamps. First, we transform the time stamps into time intervals and , and . Then, we combine the event types and time intervals into a joint multi-type sequence :

(1)

where denotes a partial sequence from time step to time step .

3.2. Time-LSTM

In this paper, we adopt the type Time-LSTM cell from (Zhu et al., 2017b). The update equations of this Time-LSTM cell are as follows:

(2)
(3)
(4)
(5)
(6)
(7)

is the input feature vector at time step , which in our case would be the embedding of the event type. is the input time interval at time step . , , , are the activations of input, forget, time, output gates, respectively. , are the cell activation and hidden state. is the cell state of the previous time step . , ,

are the sigmoid function and

, are the function. , , , , , , , , , , , , are the weight parameters of the cell. , , are peephole parameters.

3.3. RL and Policy improvement to train GAN

We implement a modified version of seqGAN model to generate multi-type temporal sequences. Time-LSTM cells are utilized in our implementations for both the generator and the discriminator .

The sequence generation process of our generator can be modeled as a sequential decision process in RL. The state at time step is defined as:

(8)

where is the time gate activate in (4) and is the hidden state in (7). is the partial sequence at time step and is the trainable parameters of the Time-LSTM cell.

The action at time step is a combination of two parts: , , where is the action to find the next event type and is the action to find the next time interval . Thus a new partial sequence can be formed step by step, until a complete sequence of length described in (1) is constructed.

To make decisions in this sequence generation process, we employ a hybrid policy to represent action spaces with both continuous and discrete dimensions (similar to the idea in (Neunert et al., 2020)

). This policy is designed to choose discrete event types and continuous time intervals, assuming their action spaces are independent. Then we use a categorical distribution and a Gaussian distribution to model the policy distributions for the event types and the time intervals respectively. So the hybrid generator policy can be defined as:

(9)

where . is the set of all possible event types.

When generating a new event type and time interval at each step, we follow the generator policy and sample from categorical and normal distributions independently and concatenate them to obtain the action vector

, then append them to the current partial sequence to obtain a new partial sequence . Once a complete sequence of length has been generated, we pass the sequence to the Discriminator which predicts the probability of the sequence to be real against generated:

(10)

The feedback from can be used in training such that can better learn how to generate sequences similar to real training data to deceive . Because the discrete data is not differentiable, gradients can not passed back to generator like in image-base GANs.

The original seqGAN training uses Policy Gradient method with Monte-Carlo roll-out to optimize the policy.(Yu et al., 2017)

. In order to reduce variance in the optimization process, seqGAN runs the roll-out policy starting from current state till the end of the sequence for multiple times to get the mean return. Here we use an Actor-Critic method with a Critic network instead of MC roll-out to estimate the value of any state, which is computationally more efficient.

(Bhatnagar et al., 2007).

The Critic network models a state-dependent value for a partially generated sequence under policy , which is defined as the expected future return for a complete sequence provided by the Discriminator :

(11)

The value function parameters are updated during training by minimizing the mean squared error between the true return and the value function :

(12)

The difference between them, , is named the advantage function, which is used in training and helps to reduce variance.

The goal of training is to choose actions based on a policy that maximizes expected return. The object function of follows Policy Gradient method (Sutton and Barto, 2018) which can be derived as:

(13)

Because of the independence assumption we made, the policy gradient term can be broken down and written into a categorical cross-entropy and a Gaussian log-likelihood as follows:

(14)

The goal of training to use distinguish generated sequences with true sequences from training data.333Definitions of the training data, positive and negative datasets are in section 4.1. is updated through minimizing binary cross-entropy loss.

We keep training and alternatively. The Pseudo code of the entire process is shown in Algorithm 1.

Require: generator policy ; Critic ; discriminator ; positive dataset , negative dataset

1:Initialize , , with random weights , ,
2:Pre-train using MLE on .
3:Pre-train via minimizing binary cross-entropy on
4:repeat
5:     for -steps do
6:         Generate a batch of complete sequences
7:         Get total rewards from discriminator
8:          initial token
9:         
10:         for  in  do
11:              Calculate current state via Eq. (8)
12:              Sample
13:              Sample
14:              Compute value estimate from Critic
15:              Compute the advantage
16:         end for
17:         Update Critic param. by minimizing Eq. (12)
18:         Update generator param. via policy gradient Eq. (13)
19:     end for
20:     for -steps do
21:         Generate a batch of sequences
22:         Sample a batch of sequences from
23:         Train discriminator on and update param. via minimizing binary cross-entropy
24:     end for
25:until terminate condition satisfied
Algorithm 1 Sequence Generative Adversarial Nets

4. Data Experiments

Due to the concerns about data privacy laws (e.g. GDPR 444General Data Protection Regulation, CCPA 555California Consumer Privacy Act), and to protect confidential details of the Unity Fraud Detection service, we decide not to use real-world ad events data and patterns for this study, thus to avoid data privacy issues and prevent fraudsters from reverse-engineering the presented algorithms and rules to circumvent fraud detection systems. Instead, we conduct our experiments on a synthetic dataset emulating real-world ad events.

4.1. Synthetic Dataset

We define the synthetic dataset as , and types of events where

is the set of hypothetical ad event types; PAD is reserved for padding and end token; INI is the dummy initial token marking the beginning of a sequence, which always comes with a zero initial time interval.

Each sequence in the synthetic dataset has a uniform length , including the dummy initial step . For the following steps, each event type is randomly sampled from with an equal probability, and each time interval

is sampled from a Chi-Square distribution with the degree of freedom conditioned on

, i.e.:

(15)
(16)

One example of a complete synthetic sequence is as below:

Then, we split the synthetic dataset into a positive dataset and a negative dataset , by a set of human defined rules. These rules are variants of real-world rules observed in ad activities. In this study, we intentionally avoid using real patterns or rules that appear in real fraud detection work ,in order to prevent potential information leakage to fraudsters.

There rules we defined are as follows:

  1. A sequence starts with an event.

  2. There are more than three distinct types of events after the initial token, and at least one of them is .

  3. Each event is paired with one and only one previous event. Each event can be paired with at most one later event.

  4. The total number of events is greater or equal to that of events; The total number of events is greater or equal to that of events; The total number of events is greater or equal to that of events.

  5. The time delay between any two consecutive same events is no smaller than 10

  6. The time delay between any two paired and events is no greater than 50.

If a sequence follows more than three rules out of six, it is classified as a positive sequence , otherwise as a negative sequence .

The goal of GAN training is to teach the generator to learn the intrinsic human-defined patterns in the synthetic dataset , and generate sequences satisfying as many above-mentioned rules as possible. In a real-world application, those patterns can be hidden or unknown to human experts, but a GAN is expected to learn and reproduce the patterns that are not intuitive to humans.

4.2. Evaluation Metric

In the last few years, several different evaluation metrics for GANs have been introduced in the literature. Among them, Fréchet Inception Distance (FID) (Heusel et al., 2017) has been used extensively (DeVries et al., 2019). However, it is not enough to show the effectiveness of our training on multi-type sequences using only one metric. This is due to the fact that our sequences consist of a discrete categorical part (event type) and a continuous numerical part (time interval). We propose using multiple metrics including a Rule-Based Quality (RBQ) score (to check if the sequences follow our validity rules), Mean Absolute Deviation (MAD) metric (event types are diverse), and Maximum Mean Discrepancy (MMD) (Fortet and Mourier, 1953) (dissimilarity between event types or time intervals) in addition to FID for time intervals. The arrows () show the improvement directions.

RBQ . The quality of a generated sequence is measured by a metric derived from the six rules we defined in section 4.1. The general intuition behind the RBQ score is that it is less probable for a generated sequence to follow multiple human-defined rules, so that a sequence with more desired patterns deserves a higher quality score. As a result, in RBQ, rule combinations are weighted by their length, where an individual rule is considered as a combination of length . The six rules are considered equally important in calculation. We employ a geometric series with common ratio for weighting. The RBQ score for a sequence is defined as follows:

(17)

where is the set of all rule combinations that sequence follows, is one rule combination, and is its length. For example, if a sequences follow rule and rule described in section 4.1, then it contains different rule combinations , thus yields an RBQ score of .

MAD

. We propose using MAD to measure statistical dispersion of the categorical part of the multi-type sequences, i.e., the event types. We use dispersion as a proxy for diversity of the generated event types. Basically, we one-hot encode the event types and compute the mean absolute deviation of each sequence from the median of all sequences. Median is known to be more robust to noise and fits our need to have categorical values as opposed to mean. Given the diversity oracle, we compare the MAD score of any batches of sequences against the MAD score of a batch of sampled sequences from our

dateset as the comparison base. MAD can be computed using:

(18)

where is a batch of sequences, is the batch size, is a sequence of length in , is the event type of step in , is the median of the event types at step across the batch .

FID . We use FID to evaluate the numerical part of the multi-type sequences, i.e., the time intervals. It focuses on capturing certain desirable properties including the quality and diversity of the generated sequences. FID performs well in terms of robustness and computational efficiency (Borji, 2019). The Fréchet distance between two Gaussians is defined as:

(19)

where and are the means and covariances of the samples from the real data distribution and model distribution, respectively.

MMD

. We also employ MMD to evaluate the time intervals. This measure computes the dissimilarity between two probability distributions

and

using samples drawn independently from each. We use an estimator with Radial Basis Function (RBF) kernel

, which is:

(20)

FIDH . This metric is a variant of FID, which views the Time-LSTM hidden state in as a continuous multivariate Gaussian distribution. Then the FID score will be computed between two hidden states using Eq. 19. Hidden states can be viewed as representations of the input sequence. FIDH uses the information from both the discrete part and the continuous part of our multi-typed sequence.

4.3. Experiment Setup

We use the and datasets defined in section 4.1 for model training and evaluation. Both datasets contain around data samples. As described in Algorithm 1, we first pre-train and until convergence, and then start RL training for the pre-trained and . In this section, we compare the generated sequences from the following models:

  • : Generator with initial random model parameters.

  • : Generator pre-trained using MLE.

  • : Generator trained by algorithm 1.

We monitor the training process and use the metrics defined in section 4.2 to evaluate model performance during training, which are plotted in Figure 1.

Figure 1. Performance metrics calculated during the training process.

The ratio between -steps and -steps is set to . Both and have the same batch size , and use the SGD optimizer with learning rate . We save the models during the training process and present the best performed models for evaluation.

As shown in Figure 1, is the randomly initialized model (blue line), is the model pre-trained by MLE after   steps (orange line), and is the generator which is trained for steps (green line) using algorithm 1. During the pre-training process, RBQ and MAD increase, while FID and MMD decrease. After the pre-training finishes and RL training starts, RBQ keeps increasing, and FID surges dramatically after certain point ( steps in Figure 1), where we stop training because it indicates a mode-collapsed generator. Notably, the FID score measured in training process is calculated between the training batch sampled from and the evaluation batch generated by , both of size , which is different from the FID and FIDH scores in Table 1.

4.4. Experiment Results

To evaluate the performance of a generator after training, we use the trained generator to create test datasets. Each model generates batches of size

. Then, we performed a two-sample t-test to compare each test dataset with samples of the same size drawn from

and .

Oracle scores. The Tables 1 show the mean values of the different metrics over batches. In particular, the MAD, FID score and FID score for hidden units (FIDH) are calculated respectively using data sampled from as the base for the comparisons. The results are presented in 1.

Samp. RBQ MAD FID MMD FIDH
G0 14.7290 1.1114 9719.9854 0.1563 5.4784
G1 81.0501 1.3208 103.7343 0.0002 3.3819
G2 123.5881 1.2503 187.6541 0.0002 2.6775
122.1438 1.2880 0.0000 0.0000
12.2944 1.4534 64.6673 0.0002
Table 1. Oracle metrics calculated using as base

The results demonstrate that the sequences generated by network have a significantly higher RBQ score than that generated by MLE pre-trained generator and randomly initialized generator . The RBQ score of is close to the samples drawn from , which shows the high quality of these sequences. It indicates the generator is able to learn the intrinsic patterns and rules in during RL-training, and then generate sequences that mimics these patterns to deceive the discriminator.

From the perspective of FID and MAD, has lower scores compared to the MLE pre-trained generator . As we discussed, in section 4.2, FID evaluates the continuous distribution of the time interval , and MAD measures the dispersion of the discrete event type in a sequence. In each sequence from the training datasets, the two features are intrinsically correlated. However, FID only evaluates the continuous part of the sequences and MAD evaluates the discrete part of the sequences separately as two independent distributions without paying attention to their internal correlations. As a result, although have a higher FID score for time intervals and higher MAD score for event types, it has a lower RBQ score compared to

. This is because the RBQ score is calculated based on the joint distribution of time intervals and event types. Same logic applies to FIDH score as well,

has a lower FIDH score than , because FIDH is calculated between the hidden state representations of two sequences, which utilizes the information from both time intervals and event types. Moreover, MMD converges fast during the training and both and have very similar performance with regards to MMD.

4.5. Discussion

Most metrics such as FID only yield one-dimensional scores and fail to distinguish between different failure cases. Given that in Table 1 both and

have similar performances, we propose using Precision and Recall for Distributions (PRD)

(Sajjadi et al., 2018) to explain their differences. PRD is used to compare a distribution to a reference distribution . The intuition behind PRD is that precision measures how much of can be generated by while recall measures how much of can be generated by . Figure 2 illustrates the comparison of the PRD curves among , , and :

Figure 2. Comparison of PRD curves of , , and

We can interpret the differences between and from Figure 2: for a desired recall level of and lower, generates sequences closer to the training data. However, if one desires a recall higher than , enjoys higher precision.

We next examine the representations learned by the discriminator that underpinned the successful performance of it. We use t-SNE (Maaten and Hinton, 2008) to visualize the output of the Time-LSTM cell. Basically, the nearby points in the representation space have similar rewards given by the discriminator.

We generate sequences from each of , , and and pass them to the discriminators that trained with the corresponding generators.

(a) D0
(b) D1
(c) D2
Figure 3. Two-dimensional t-SNE embedding of the representations of the outputs of Time-LSTM cells

Figure 3 depicts the two-dimensional t-SNE embedding of the representations of the outputs of the Time-LSTM cells for each discriminator. The points are colored according to the predicted rewards from the discriminator. Darker colors mean higher reward. We can clearly see in Figure 3, discriminator has more points with relatively darker colors than discriminators and , which means the discriminator that is trained with returns higher rewards for the generated sequences. On the other hand, the discriminator seems to return a mean value for all the sequences (same color). This is due to the fact that

is a randomly initialized discriminator. The t-SNE embeddings can be also used for feature extraction on labeled data.

Radford et al., 2015 has shown a way to build high quality representations by training a GAN model and reusing parts of the generator and discriminator networks as feature extractors for other supervised tasks (Radford et al., 2015). This can be potentially a promising future direction for this work.

5. Conclusions

In this paper, we have described, trained, and evaluated a seqGAN methodology for generating artificial sequences to mimic the fraudulent user patterns in ad traffic. We have additionally employed a variant of Time-LSTM cell to generate synthetic ad events with non-uniform time intervals between events. As this task poses new challenges, we have presented a new solution by training seqGAN using a combination of MLE pre-training and a Critic network. The generator proposed in this paper is capable of generating multi-type temporal sequences with non-uniform time intervals, which is one of the novelties of our developed methodology. We have also proposed using multiple criteria to measure the quality and diversity of the generated sequences. Through numerous experiments, we have discovered that the generated multi-type sequences are of desired properties.

Furthermore, we compared the performance of our generator under different settings with randomly sampled data from our training datasets. We concluded that the seqGAN-trained generator has a higher performance compared to pre-trained generators using MLE, measured by multiple criteria including RBQ and FIDH scores that are appropriate for evaluating multi-type sequences.

6. Acknowledgments

The authors would like to thank Unity for giving the opportunity to work on this project during Unity’s HackWeek 2020.

References

  • H. Ba (2019) Improving detection of credit card fraudulent transactions using generative adversarial networks. arXiv preprint arXiv:1907.03355. Cited by: §1, §2.
  • I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, and J. Zhou (2017) Patient subtyping via time-aware lstm networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 65–74. Cited by: §2.1.
  • S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee (2007) Naturalgradient actor-critic algorithms. Automatica. Cited by: §3.3.
  • A. Borji (2019) Pros and cons of gan evaluation measures. Computer Vision and Image Understanding 179, pp. 41–65. Cited by: §4.2.
  • J. Choi and K. Lim (2020) Identifying machine learning techniques for classification of target advertising. ICT Express. Cited by: §1.
  • T. DeVries, A. Romero, L. Pineda, G. W. Taylor, and M. Drozdzal (2019) On the evaluation of conditional gans. arXiv preprint arXiv:1907.08175. Cited by: §4.2.
  • F. Dong, H. Wang, L. Li, Y. Guo, T. F. Bissyandé, T. Liu, G. Xu, and J. Klein (2018) Frauddroid: automated ad fraud detection for android apps. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 257–268. Cited by: §1.
  • S. Elsworth and S. Güttel (2020) Time series forecasting using lstm networks: a symbolic approach. arXiv preprint arXiv:2003.05672. Cited by: §2.1.
  • C. Esteban, S. L. Hyland, and G. Rätsch (2017) Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633. Cited by: §2.
  • R. Fortet and E. Mourier (1953) Convergence de la répartition empirique vers la répartition théorique. In Annales scientifiques de l’École Normale Supérieure, Vol. 70, pp. 267–285. Cited by: §4.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • A. Graves (2013)

    Generating sequences with recurrent neural networks

    .
    arXiv preprint arXiv:1308.0850. Cited by: §2.1.
  • H. Haddadi (2010) Fighting online click-fraud using bluff ads. ACM SIGCOMM Computer Communication Review 40 (2), pp. 21–25. Cited by: §1.
  • C. M. R. Haider, A. Iqbal, A. H. Rahman, and M. S. Rahman (2018) An ensemble learning based approach for impression fraud detection in mobile advertising. Journal of Network and Computer Applications 112, pp. 126–141. Cited by: §1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.2.
  • W. Jianyu, W. Chunming, J. Shouling, G. Qinchen, and L. Zhao (2017) Fraud detection via coding nominal attributes. In Proceedings of the 2017 2nd International Conference on Multimedia Systems and Signal Processing, pp. 42–45. Cited by: §1.
  • K. K. Kapoor, Y. K. Dwivedi, and N. C. Piercy (2016) Pay-per-click advertising: a literature review. The Marketing Review 16 (2), pp. 183–202. Cited by: §1.
  • N. Killoran, L. J. Lee, A. Delong, D. Duvenaud, and B. J. Frey (2017) Generating and designing dna with deep generative models. arXiv preprint arXiv:1712.06148. Cited by: §2, §2.
  • S. Kudugunta and E. Ferrara (2018) Deep neural networks for bot detection. Information Sciences 467, pp. 312–322. Cited by: §1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.5.
  • R. Mouawi, I. H. Elhajj, A. Chehab, and A. Kayssi (2019) Crowdsourcing for click fraud detection. EURASIP Journal on Information Security 2019 (1), pp. 11. Cited by: §1, §1.
  • S. Nagaraja and R. Shah (2019) Clicktok: click fraud detection using traffic analysis. In Proceedings of the 12th Conference on Security and Privacy in Wireless and Mobile Networks, pp. 105–116. Cited by: §1.
  • D. Neil, M. Pfeiffer, and S. Liu (2016) Phased lstm: accelerating recurrent network training for long or event-based sequences. In Advances in neural information processing systems, pp. 3882–3890. Cited by: §2.1.
  • M. Neunert, A. Abdolmaleki, M. Wulfmeier, T. Lampe, J. T. Springenberg, R. Hafner, F. Romano, J. Buchli, N. Heess, and M. Riedmiller (2020) Continuous-discrete reinforcement learning for hybrid control in robotics. arXiv preprint arXiv:2001.00449. Cited by: §3.3.
  • R. Oentaryo, E. Lim, M. Finegold, D. Lo, F. Zhu, C. Phua, E. Cheu, G. Yap, K. Sim, M. N. Nguyen, et al. (2014) Detecting click fraud in online advertising: a data mining approach. The Journal of Machine Learning Research 15 (1), pp. 99–140. Cited by: §1.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.5.
  • M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly (2018) Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pp. 5228–5237. Cited by: §4.5.
  • K. E. Smith and A. O. Smith (2020) Conditional gan for timeseries generation. arXiv preprint arXiv:2006.16477. Cited by: §2.2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §3.3.
  • K. Thomas, J. A. E. Crespo, R. Rasti, J. Picod, C. Phillips, M. Decoste, C. Sharp, F. Tirelo, A. Tofigh, M. Courteau, et al. (2016) Investigating commercial pay-per-install and the distribution of unwanted software. In 25th USENIX Security Symposium (USENIX Security 16), pp. 721–739. Cited by: §1.
  • S. Xiao, J. Yan, M. Farajtabar, L. Song, X. Yang, and H. Zha (2017) Joint modeling of event sequence and time series with attentional twin recurrent neural networks. arXiv preprint arXiv:1703.08524. Cited by: §2.1.
  • L. Yu, W. Zhang, J. Wang, and Y. Yu (2017) Seqgan: sequence generative adversarial nets with policy gradient. In

    Thirty-first AAAI conference on artificial intelligence

    ,
    Cited by: §2.2, §3.3.
  • P. Zhao, T. Shui, Y. Zhang, K. Xiao, and K. Bian (2020) Adversarial oracular seq2seq learning for sequential recommendation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI, pp. 1905–1911. Cited by: §2.2.
  • P. Zheng, S. Yuan, X. Wu, J. Li, and A. Lu (2019) One-class adversarial nets for fraud detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1286–1293. Cited by: §2.
  • X. Zhu, H. Tao, Z. Wu, J. Cao, K. Kalish, and J. Kayne (2017a) Fraud prevention in online digital advertising. Springer. Cited by: §1.
  • Y. Zhu, H. Li, Y. Liao, B. Wang, Z. Guan, H. Liu, and D. Cai (2017b) What to do next: modeling user behaviors by time-lstm.. In IJCAI, Vol. 17, pp. 3602–3608. Cited by: §2.1, §3.2.