1 Introduction
1.1 Background
Network infrastructure design and evaluation often relies on software simulation and hybrid softwarehardware approaches using testbeds. The chief use of these tools is to see how the network responds or supports certain patterns of user behavior. Naturally, the best way to do so is to collect traces of such user behavior and run them through the network simulator/emulator. However, this approach is not always feasible, because collecting representative traces of every possible user behavior is not only very expensive and timeconsuming but very likely impossible given the long tail of user behavior even when measured in terms of app usage, for example. Moreover, privacy concerns may lead to fewer users consenting to the use of their personal usage data in this way. In practice, there are likely to be relatively few datasets of highquality curated usage data available for software and hardware tools to employ in order to evaluate network performance.
When validating network infrastructure innovations in simulations and testbeds, it is therefore common to use one of two approaches: (1) replay traces from prerecorded telemetry of real usage, or (2) apply statistical models of interpacket arrival time (IAT), traffic load, etc. The first approach suffers from the following problems:

privacy liabilities

large amounts of data needed to simulate many users

number of users that can be simulated are capped by measurements

discrepancy between measurement and test environment leading to unrealistic replay load

trace replay does not react to environment.
It however tends to provide more accurate, and user (reality)based load dynamics compared to the second approach of modelbased trace generation that is just based on distributional properties. Modelbased trace generation, however, also suffers from the lack of responsiveness to environment changes. So, the question is how we can combine the best of both worlds and achieve:

no risk of exposure of private data

realistic trace replay

deployment with limited data

ability to scale dynamically to any number of distinct users

reactivity to the environment.
Furthermore, the solution needs to be easy to deploy in: (1) software simulators such as NS3; (2) real hardware such as routers, and (3) mobile phones as a frontends to traffic replay tools, such as iPerf.
The approach in the present work proposes an autonomous agent that can both replay realistic workloads as well as react
to conditions appearing in the environment. In order to do either, this agent needs to be able to approximate with high accuracy the probability distribution governing the behavior and usage pattern of arbitrary users of the communication network. In other words, the autonomous agent must incorporate a generative model.
The power of deep learning models in various applications has, in recent years, prompted a lot of research into their use for generative models.
1.2 Deep Generating Models and GANs
A deep generative model (DGM) is a deep learning model trained to approximate a complicated highdimensional probability distribution about which too little is known to allow us to define a parametric family of distributions containing it. The authors of
[14] call such a distribution intractableand provide a unified overview of three kinds of DGMs: normalizing flows (NFs), variational autoencoders (VAEs), and generalized adversarial networks (GANs). All of these approaches assume that we can approximate the intractable distribution by transforming a known, simpler probability distribution (usually a Gaussian) in a latent space of known dimension. The choice of this latent space dimension is both difficult and important.
NFs apply only to the small set of problems where the latent space dimension equals the intrinsic dimension of the data. VAEs overcome this limitation by using a probabilistic model to infer the latent variable, but this inference is not straightforward because the generator is nonlinear. GANs further avoid the challenges of estimating the latent variable and sample directly from the latent distribution. However, quantifying the similarity between the generated samples and the training data from the true intractable distribution is highly nontrivial. The GAN model addresses this by training another deep learning model, namely the discriminator, in tandem with the deep learning generator model.
1.3 Our proposed GAN model
In the present work, we propose a generative adversarial network (GAN) [5] generator based on the CRNNGAN work on novel music score generation in [10]
. Our approach extends the generator loss function to fit the data to various statistical properties in the original trace, adds a conditional gradientdescent optimization heuristic, and allows for contextdependent training and generation. Similar time series GANs have been proposed for music score generation in
[18]. A survey of spatiotemporal GAN research is available in [4].The GAN model (including the generator and discriminator networks) is first trained on traces of user traffic measured in a real system. The generator component of the trained GAN model can then be deployed in any simulation or experimental environment. This not only preserves privacy (because the traces used to train the GAN are never available in the deployment environment), but also reduces the volume of data that needs to be shipped for deployment (because the deployed model arrives pretrained). Training typically requires GPU power but the trained generator model can be run on a CPU. We also expose a REST API for generating traces more easily from constrained environments, without the need to deploy a full neural network software stack.
We also note that, to achieve even greater privacy protection and robustness in the generated data, we may train the GAN model using Federated Learning [7] with the local learners (clients) training their copies of the model on locally generated and recorded data. The advantage of Federated Learning is that this local data is never exchanged with any other learner. Only GAN model parameter updates during training are sent to a central controller that aggregates them to obtain the overall model parameter updates, which are then broadcast to all local learners. Note that this aggregation could be done securely in a way that further protects the privacy of the local data at the learners, as described in [16]. A survey of GANs focusing on privacy and security including under distributed learning is given in [1].
In terms of contextual awareness and reaction to changing conditions we are mainly interested in task and RF capacity awareness. Task awareness is the knowledge of which app is running, as it would impact the network demand. Similarly, RF capacity awareness can be inferred from the signal strength, RSSI or similar measures capping the achievable throughput.
Our basic approach is to begin by generating different time series of network usage based on different conditions, i.e. apps running and RSSI values. Then we apply tasktransition probabilities to “jump” between the time series on parallel timelines. In an experiment, these jumps could also be triggered by the experimental conditions, e.g. when a station moves closer to an access point or base station, it jumps to the time series corresponding to the “strongestsignal” timeline.
1.4 Intended Use and Contribution
We call our new GAN model and the system to provision it MASS, Mobile Autonomous Station Simulation. We anticipate that the set of tools that comprises the system, as well as the general approach may be valuable to any entrants into a new market, where they could have developed some great ML algorithm, but do not have enough data to train their model or fully verify it to put it into production. One example would be Cable operators entering the mobile operator market. We also anticipate that the separation of model training and trace replay would help individual researchers, and practitioners evaluate their innovations without direct access to network operator data. Even for organizations with access to rich data it would limit the need to use up bandwidth to move around collected telemetry data that instead could be used for customer traffic.
Our key contribution is threefold:

First, we propose a novel extension to GAN models to retain statistical properties from the original data (correlations and moments).

Second, we implement a mechanism to train and replay traces produced by multiple GAN models based on environmental contexts (e.g., RF signal and task).

Third, we design a system and a set of tools that make it easy to replay traces on demand from experiments and simulations in a wide range of environments (e.g., Android, NS3, OpenWrt).
The rest of this paper is organized as follows. We discuss related work and the foundation behind Generative Adversarial Networks in Section 2 and Section 3, respectively. In Section 4, we present metrics used to evaluate generated traces. Section 5 details our model, including the extensions to preexisting GAN models. Then we present the data collected from real systems and used in our study in Section 6. In Section 7, we evaluate the traces our model generates using the metrics defined. Section 8 gives an overview of the system design and describes how the MASS tools we developed can be used in three different settings, Unix Shell, Android, and NS3. In Section 9 we present a WiFi experiment use case utilizing our tools, and finally in Section 10 we provide concluding remarks.
2 Related Work
GANs were first proposed in [5]. As remarked in the Introduction, the GAN uses one deep learning model for the generator (thereby making a GAN a kind of DGM) and another deep learning model (called the discriminator) to quantify the difference or similarity between the samples generated by the generator and the training data samples obtained from the realworld intractable distribution that the generator is trying to approximate. The generator and discriminator are set up in an adversarial framework, thus giving rise to the name GAN for the combination of the generator and discriminator.
General properties of the adversarial framework in general, and GANs in particular, for implicitly learning probability distributions from a statistical point of view, are discussed and studied in [8]. A survey of GAN models for generation of spatio temporal data is available in [4], where the authors provide examples of GAN models to generate time series, spatiotemporal events, spatiotemporal graphs, and trajectory data.
For our purposes, the most relevant application is time series generation. One of the earliest GAN models for time series generation was the socalled continuous RNN GAN (CRNNGAN) model proposed in [10]
for music score generation. The generator and discriminator of this model are not feedforward but recurrent neural networks (RNNs), hence the name. Since it can generate continuousvalued sequential data of any kind, it is a good candidate for adapting to generate app usage traces for our scenarios of interest. As we will describe in Sec.
5, the key changes we make to the CRNNGAN model are to make it contextaware, and to change the definitions of the loss functions at the generator and discriminator from the conventional definitions used by CRNNGAN.A more recent GAN model for generating time series data is the TimeGAN model proposed in [18]
. The explicit design goal of TimeGAN is to preserve temporal autocorrelations in the generated sequences. For this purpose, the authors propose a supervised loss function (in addition to the conventional loss functions) as well as an embedding network to provide a reversible mapping between the latent space and the generated feature space. The embedding and generator networks are jointly trained to minimize the supervised loss. The authors claim that TimeGAN combines the flexibility of a GAN model with the control over the correlations of the generated sequences that is possible with a classical autoregressive model. However, we see that in the scenario of interest to us, the requirement that the embedding network implement a
reversible mapping implies that TimeGAN can only generate traces for the same number of users as in the training data, whereas our adaptation of CRNNGAN does not suffer from this limitation. We evaluate and compare our CRNNGAN versus TimeGAN in Sec. 7.Yet another recent GAN model for generating time series is RGAN [3], proposed for generating electricity consumption traces. Like CRNNGAN, RGAN also uses RNN architectures in the generator and discriminator networks. However, [3] imposes a preprocessing requirement, namely manual extraction of timeseries features from the raw data by first fitting a classical autoregressive integrated moving average (ARIMA) model to it. We will not compare our results against [3]
since we are interested in a machine learning workflow that does not rely on manual feature extraction at any stage.
3 Generative Adversarial Networks
In this section, we will briefly review the adversarial framework and specifically the GAN architecture before providing an overview of the CRNNGAN model that we adapt and employ in the present work. For purposes of comparison, we will also provide a brief description of the TimeGAN model.
3.1 Adversarial Framework and GANs
We shall adopt the formalism of [8] in this section. The general formulation of the adversarial framework is as follows: we are given a target probability distribution that we need to approximate with a simulated probability distribution obtained from a class of generator distributions . The approximation should be so as to minimize the loss incurred over a family of functions inside a discriminator class as follows:
(1) 
where we see that the discriminator class of functions induces the socalled integral probability metric (IPM) which quantifies the closeness between the generated distribution and the actual target distribution . Different choices for the function class describe the various GAN models in use today, including Wasserstein GAN, Maximum Mean Discrepancy GAN, and Sobolev GAN among others.
In a GAN model, both the generator distribution class and the discriminator class are parameterized by deep neural networks. So, instead of working with the discriminator class of test functions , we shall work directly with the test functions themselves. Since these functions are implemented by neural networks, we will refer to a function in by instead of , where represents the parameters of the neural network architecture that implements .
Moreover, is the class of implicit distributions realized by neural network transformations of a simple lowerdimensional latentrandom variable, for instance a multidimensional Gaussian distribution. Instead of working with the class of distributions , we will now work directly with the neural networkimplemented functions that transform the latent input to a generated sample , where represents the parameters of the functions of the generator neural network architecture . The relationship between the generator function class and the generator distribution class is given by
Note that the IPM is not known analytically because neither the target distribution nor the generated distribution is known analytically. However, we have training samples from the target distribution , and we generate, say, samples from the generator distribution . In other words, the GAN generator network should be trained (i.e., its parameters selected) so as to
(2) 
where the two expectations in (1) are now replaced by empirical estimates and based on the generated and observed samples respectively. In place of , [8] derives bounds on the difference between the implicit distribution estimator, i.e., the distribution of , and the target under various metrics.
3.2 The CRNNGAN Model
The CRNNGAN [10]
employs recurrent neural networks (RNNs) implemented by the Long ShortTerm Memory (LSTM) architecture for both the generator and the discriminator, with the additional feature that the discriminator’s RNN uses a bidirectional LSTM architecture so as to enhance its ability to discriminate between generated samples and samples from the true distribution.
Although (2) represents the optimization problem for joint training of the generator network and the discriminator network , these two networks are not trained in practice by directly trying to solve (2). Instead, the generator network is trained to generate samples that fool the discriminator network , which in turn is trained to be able to discriminate between generated samples and samples from the real target distribution.
In the notation of the original GAN model [5]
, the output of the discriminator network is the probability that the input to the discriminator network is classified as a sample coming from the true distribution. Thus, the training objective of the generator network is to maximize this probability for generated samples, or equivalently to minimize the complement of this probability. Similarly, the training objective of the discriminator network is to maximize the complement of this probability for generated samples while also maximizing this probability for training samples. Thus, the loss functions for training the generator and discriminator networks can be defined as follows
[5, Alg. 1]:(3)  
(4) 
where represents a minibatch of samples from the latent distribution and is a minibatch of training samples from the target distribution.
The CRNNGAN retains the of (4) but modifies the loss function for the generator to be the sumsquared difference (over the minibatch) between the representations of the true samples and the generated samples , where the representations are defined by the logits of the discriminator neural network, i.e., the last layer before the softmax output layer of the discriminator [10, Sec. 3] – see also Sec. 5. This is done in order to induce feature matching and reduce the possibility of overfitting of generator network output to the discriminator network. As will be seen in Sec. 5, in the present work we modify the CRNNGAN loss functions for training the generator and discriminator networks in several ways from the choices in [10].
3.3 The TimeGAN model
Unlike other GAN models including the CRNNGAN model, in the TimeGAN model [18] the output of the generator network is in the latent variable space instead of in the simulated sample space. Mappings from the simulated sample space to the latent space (called embedding) and vice versa are done by two other new neural networks, called the embedding and recovery networks respectively. Given the recovery network function, the generator and discriminator are trained with the losses (3) and (4) respectively. However, the binary adversarial feedback from the discriminator network may not itself constitute sufficient information for the generator network to capture stepwise conditional distributions in the data. Thus [18] proposes an additional supervised loss function for training the generator function, where the labeled target for the output of the generator is the next latent variable given the previous values of the sequence of latent variables. The generator and discriminator neural networks are trained on a weighted sum of this supervised training loss and the sum of the usual GAN loss functions (3) and (4), whereas the embedding and recovery neural networks are trained on a weighted sum of this supervised training loss and the meansquared reconstruction loss in going from the simulated sample space to the latent space and back again.
4 Preliminaries: Evaluation Metrics
Recall that in Sec. 3.2, we said that we would need to make considerable modifications to the CRNNGAN loss functions for the generator and discriminator networks in order to adapt the CRNNGAN architecture for our intended purposes of trace generation. We will describe in detail our choices for these loss functions in Sec. 5
. However, even before designing and evaluating the model, we need to decide on evaluation metrics that can be used to both validate generated traces during training and benchmark against alternative models.
The details of the Mobile Phone Use Dataset [12] of user traces on which we develop and evaluate our model may be found in Sec. 6. For now it suffices to mention that although our general approach is data agnostic, we have observed from the dataset that there tends to be a strong correlation between download and upload volumes in the same time period. Hence, we cannot simply evaluate how well we fit statistical properties of individual upload and download traces. Instead, we need to fit the dynamics between features as well.
Note that the GAN discriminator loss function (4) does not explicitly incorporate correlations in generated sample sequences. Moreover, the discriminator network just acts on single traces as inputs, classifying them as either generated by the generator network, or coming from the true distribution. In particular, the discriminator network does not have at its input a pair of traces, one generated and one from real data, whose statistics it can compare. Thus the GAN generator cannot expect to receive any feedback from the discriminator network regarding the match between the statistical properties of the generated traces to the traces from the true distribution. It follows that the GAN generator can only be made to fit these statistical properties in its generated traces if the generator training loss function incorporates such properties. Below, we describe three metrics that we use to measure the goodness of fit between generated samples and samples from the training data, i.e., the true distribution. In Sec. 5, we show how to define a training loss for the generator network of the GAN using these metrics.
4.1 Correlation distance metric
Our first metric is the Pearson crosscorrelation coefficient between features. In contrast to traditional approaches where correlations are either minimized to pick optimal predictive features or maximized to find interesting proxy predictors, here we want to fit the level of correlation between the original trace and the generated trace.
To capture the typical user behavior, we proceed as follows:

For each pair of features (e.g., download and upload), compute the Pearson correlation coefficient of the two sequences of samples corresponding to these two features in the trace data for every user in turn. Thus, if there are users whose traces are recorded in the training data, then we have values of the Pearson correlation coefficient for each pair of features.

Compute the mean of the above values of the Pearson correlation coefficient for every pair of features. This averaging across users is important since some users may have longer traces captured than others and the longer traces may dominate the shorter ones if this averaging is not done.

If there are features, then the values of the Pearson correlation coefficient computed in the above step fill an symmetric matrix, with distinct (offdiagonal) entries. These distinct entries (each entry being the average across users of a Pearson correlation coefficient between a pair of features) may be obtained by simply extracting the (offdiagonal entries from the) uppertriangular part of the above matrix.

Write the above
extracted entries into a vector by following some fixed order. Denote by
this dimensional vector of averaged (across users) Pearson correlation coefficients for all pairs of features (computed from the trace data). 
Now repeat all the above steps for the set of traces generated for each user by the generator network, and denote by the resulting dimensional vector of averaged (across users) Pearson correlation coefficients for all pairs of features (computed from the generated traces).

The correlation distance is defined as the Euclidean distance between the two vectors of Pearson correlation coefficients:
(5) where
For just features such as download and upload, the vectors and reduce to scalars and respectively, where is the correlation coefficient between uploads and downloads for the source traces and is the corresponding statistic for the generated data. In this case, the correlation distance is just
(6) 
Note that this metric is easily differentiable, which is a requirement for including it in a loss function for training the generator by gradient descent, as we will see in Sec. 5.
4.2 Moments distance metric
In addition to a good crossfeature correlation fit, we also want our generated traces to match distributional properties of the original traces.
In the present work, we propose to use metrics that measure the discrepancies between statistical moments computed on the original trace data and on the generated traces. Our motivation for this choice of metric is partly because it yields easily differentiable metrics, and partly because our experiments show that it suffers less from overfitting than direct measures of distribution fitting like KL divergence.
Our second metric moments distance is the squared Euclidean distance between two threedimensional vectors, one vector computed on source trace data, the other computed on generated traces. Each entry in one of these vectors corresponds to a certain moment computed on the appropriate traces (either source data or generated)^{1}^{1}1Users with very short traces may not produce reliable values, of higher moments in particular, and we hence compute this metric across all user trace steps lumped together.. The moments that define the three entries of each vector are respectively the mean
(i.e., the first moment), the standard deviation
(i.e., the square root of the second central moment) and skewness
(i.e., the third standardized moment) of the sequence of samples (across all users) that constitute the corresponding trace data (source data or generated).The moments distance metric is thus given by:
(7) 
Note that in contrast to , the moments distance metric is the square of a Euclidean distance and not a Euclidean distance itself. We can work with the squared Euclidean distance, which is easier to differentiate, for the moments distance metric because the moments themselves are not normalized. In fact, we observed that the generator models trained using the moments distance metric with normalized moments by the algorithm discussed in Sec. 5 have worse performance than if we leave the moments unnormalized.
4.3 Novelty metric
The final metric we define is the novelty. It is a measure of the novelty of the set of generated traces across all users. Hence it is not a measurement between the source and generated traces. Rather, it is an internal crossuser metric computed across a set of traces (one trace per user).
The novelty metric is based on the cross correlations (for a selected feature, say download) between the two sequences of samples corresponding to the traces for a pair of users. For each user , where is the total number of users, compute the crosscorrelation between the trace for this user and the trace for user at lags , where is the number of samples in each trace. The novelty metric is defined as:
(8) 
Note that here we are not interested in matching the novelty of the source trace and the generated trace, but in simply maximizing the value of the novelty metric of the set of generated traces. A larger value of means that the traces generated for the different users are distinct from one another.
It is common with misconfigured and overfitted GANs to generate traces for different users that are virtually identical^{2}^{2}2This is a special case of mode collapse, where the generator managed to produce a trace that successfully fooled the discriminator and thereafter simply repeats that single trace every time it is asked to generate one., and they would hence have a novelty metric close to 0 as for each . The key purpose of this metric is to enable us to avoid such a scenario and to give an indication of the uniqueness of the generated user traces. The metric is not part of the GAN training and hence does not need to be differentiable. In our experiments, for batches of 100 users we computed on the original data traces.
4.4 Validation of generated traces
We also note that as part of the postvalidation of generated traces, we split and hold out a test data set of users that we then compare the above three metrics to, in order to evaluate: (a) the stability across training and test data, and (b) the ability of benchmarks to predict, versus just fit, these three metrics.
Lastly, we remark that we have also tested a large number of other metrics, such as partial autocorrelation functions, Hurst exponents, and KLdivergence, but found the above three metrics to be the most stable and indicative of desirable properties in the generated traces.
5 MassGAN
In this section, we describe in detail the design of the specific GAN model proposed in the present work, which we call MassGAN. Recall from Sec. 3 that while all GAN models have the same adversarial architecture of a generator network and a discriminator network, the details of the loss functions used for training the generator and discriminator are what distinguish one GAN model from another.
5.1 Adversarial architecture of MassGAN
Like all GAN models, MassGAN also has a generator network and a discriminator network. One salient feature of MassGAN is that, like the CRNNGAN models proposed in [10] on which MassGAN is based, both the generator and the discriminator are recurrent neural networks with long shortterm memory (LSTM). Moreover, the discriminator network is a bidirectional LSTM in order to enhance its ability to detect differences between samples from the true distribution and samples simulated by the generator network.
5.2 MassGAN discriminator loss function for training
The discriminator loss function (4) is unchanged from that of CRNNGAN [10, Sec. 3], which in turn is the same as for the original GAN model [5, Alg. 1]. This loss function accounts for the two objectives of training the discriminator, namely maximizing both the probability of correctly identifying samples from the true distribution and correctly rejecting samples from the generated distribution.
Just as the CRNNGAN modified the generator loss function from (3) in the original GAN model, we shall also modify the generator loss function in several ways, as discussed in the next three sections.
5.3 MassGAN generator loss function for training
Just as in the CRNNGAN, the MassGAN generator’s objective is to fool the discriminator into classifying the samples simulated by the generator as having come from the true distribution instead. In the notation of Sec. 3, the generator network (with parameters ) operates on a sequence of latent vectors which are independent identically distributed (i.i.d.) uniformly in .
5.3.1 Discrimination loss
The discriminator network (with parameters ) has as its output the probability that the input is classified as coming from the true distribution. Since the input to the discriminator is the output of the generator, i.e., , the generator tries to maximize the probability , or equivalently, the generator network is trained to minimize what we call the discrimination loss^{3}^{3}3Note that although the network being trained is the generator network, this component of its training loss function is called the discrimination loss, which is not to be confused with the discriminator loss function (4) used to train the discriminator network. function (3)
(9) 
where is the batch size of sequences used during each step of the optimization, which in our case also corresponds to the number of users in the training data. In other words, the loss function measures the ability of the generator to induce classification errors in the discriminator. Note that for brevity we have dropped the parameter subscripts and for the discriminator function and the generator function respectively.
Recall that the objective of MassGAN is to generate time series with temporal correlations matching those from the true distribution. Although the CRNNGAN works off musical score features such as pitch and volume it is easy to map out upload and download features to the model. In our investigation, we found that the desired temporal correlations are lost, and the novelty of the generated traces is often poor, if we simply use the discrimination loss (9) alone to train the generator. For example, when replaying network traffic we would expect a high download period to be matched by a higher than usual upload period too, but a generator trained by minimizing (9) alone does not have this property.
We define the new components of the overall generator loss function as follows: we first compute the desired statistic from the original data (the true distribution). Then we define the generator training loss as the Euclidean distance between the statistic computed from the true (training) data and the statistic computed from the generated trace. This in turn is done in two parts, as discussed in the next two sections.
5.3.2 Correlation distance loss
Without loss of generality in the sequel, we will refer to traces with two features only, namely download and upload. It should be noted, however, that the proposed approach has been formulated for, and applies to, a trace with features (see Sec. 4.1).
We will use the Pearson correlation coefficient between two sequences of samples of the features and from a trace, corresponding to the download and upload samples respectively, as the metric that we want to preserve. Following [10], we define the sequences and themselves by their representation in the logits
(i.e., the outputs of the last layer before the softmax layer) of the discriminator network when the input is a generated trace. In other words, the download sample sequence
for the th trace in the batch (i.e., the trace of the th user) is given by , where is the representation in the logits layer of the discriminator network, and is the part of the generated trace that corresponds to the download activity. Similarly, the upload sample sequence for the th trace in the batch is given by , where is the part of the generated trace that corresponds to the upload activity.We define the correlation distance loss function as the correlationdistance metric for the single pair of features (namely the download and upload). From (6), we have:
(10) 
where is the target Pearson correlation coefficient in the training data. In other words, the motivation behind the definition of is to train the generator network to minimize any mismatch between the Pearson correlation coefficient of the generated traces and the Pearson correlation coefficient of the true data.
5.3.3 Moments distance loss
We also want to preserve the first three moments of the true data in the generated traces – specifically, the mean (i.e., the first moment), standard deviation (i.e., the square root of the second central moment), and skewness (i.e., the third standardized moment). To this end, we define the moments distance loss function as the moments distance metric , i.e., the sum of the squared errors between each of these moments (, , ) of the true data and the estimate of this moment (, , ) computed from the generated traces, but using the representations of the traces from the logits layer of the discriminator network, as in the definition of the correlation distance loss function above. From (7), we have:
(11)  
5.3.4 Defining an overall generator training loss
We observe that if either or is treated as the only loss function for training the generator network, then training converges rapidly. However, the discrimination loss (9) cannot be ignored when training the generator, since we have seen that including it in the training loss function improves the novelty of the generated traces.
Since we want to match both the moments and the Pearson correlation coefficient of the generated traces with those from the true distribution, we need to define a suitable overall training loss function for the generator network that incorporates , , and . Our investigation of generator training loss functions defined as simple linear combinations of , , and did not yield good matches for moments or correlations, and often got stuck at false minima.
5.4 Generator network training method
Instead of searching for a definition of an overall loss function for training the generator, we propose a training method that we call conditional gradient descent. This method has proven effective in yielding good matches between generated traces and training data across both correlation and moments statistics.
In conditional gradient descent, the generator loss function is defined as
(12) 
where we initialize . In other words, we initialize the training loss function for the generator network as a linear combination of the discrimination loss, correlation distance loss, and moments distance loss.
We then follow the standard stochastic gradient descent (SGD) method of drawing random minibatches of training examples from the original traces and descending along the gradient of the sum of the three loss functions. Note that this intuitively corresponds to descending along the gradients of all the three loss functions simultaneously. During this initial stage of training, we continue to keep
. Also initialize to a value larger than the largest value that or can reach (see below for why this is needed).Next, we describe when and how and are updated. Either at periodic intervals, or when the total loss (12) is lower than a previous minimum, we run a benchmark validation on the current generator model with parameters as follows:

Generate a new batch of traces with the current generator model, and also generate a batch of traces (of the same size) with the uniformfit model described in Sec. 7.

Compute the losses and from the two new batches of traces generated in the previous step, and update .

If then:

Save the generator model with the current parameters as a candidate generator model;

Update ;

For a fixed preset number of SGD training epochs starting from the present one, set
and as follows:(13) (14) where is the indicator function, and choose to be either or with equal probability (by tossing a fair coin, say) when .

Note that this is equivalent to dropping the statistic that is currently performing better during training (i.e., has lower training loss) from the overall training loss function (12) for a certain number of SGD training epochs, and descending along the gradient of the remaining statistic (the one whose equals unity). Thus the two quantities, exactly one of which can be nonzero, act as indicator functions selecting the component loss function whose gradient is to be descended.


Training is complete either after a certain maximum number of training epochs, or once the condition is no longer satisfied.

The candidate generator network is the trained final generator network.
The above approach can be generalized to account for other, or additional, statistics measuring matches between generated and training traces beyond correlations and moments. The only requirement is that the appropriate loss function for that statistic be efficiently differentiable so that SGD training on it is possible.
An important aspect of the above training process is that it is done in batches and the statistical properties of the original trace are only maintained in subsamples of the same number of users and with the same trace lengths. In our case, we run with 100 batches and sequences of 12 steps. Although the same generator can generate arbitrarysized batches and trace steps, in all the numerical results reported in the present work, the statistics are always validated using these same batch and sequence values.
5.5 Transfer Learning for ContextAware GAN
We want to generate traces that are contextaware, where for the specific application of user activity trace generation, the context is a categorical parameter describing an attribute or property of all or part of the user activity recorded in the trace, and which cannot be modified by the network or the trace measurement procedure. Although users’ activity (as recorded in traces) may vary greatly depending on a variety of factors, without loss of generality in the present work we focus on two of these factors, namely the signal strength and the type of application, as they significantly affect user data usage.
In other words, in the present work we define the context to be either the signal strength (HIGH or LOW) or the type of application being run (STREAM or INTERACT), or a combination of signal strength and application type (STREAM_HIGH, STREAM_LOW, INTERACT_HIGH, or INTERACT_LOW).
The training dataset we use, namely the Telefonica Mobile Phone Use Dataset (see Sec. 6) has no contextual labels, which requires us to infer context from the data itself. To create a contextaware GAN we split the training data traces into contextual cohorts corresponding to the 8 context labels defined above (HIGH, LOW, STREAM, INTERACT, STREAM_HIGH, STREAM_LOW, INTERACT_HIGH, and INTERACT_LOW), where we assign a context label to all or part of a trace based on received signal strength and application type. We also check that each collection of traces has enough entries in terms of users and sequence length of traces^{4}^{4}4In the present work, in order for us to train a model there need to be at least 5 users with at least the configured sequence length number of data points (12 in our case).. More precisely, we cluster the split traces into 8 clusters, one for each context label, based on dissimilarity metrics which we define to be the relative difference in means between the collections of traces assigned to the different clusters, and the relative difference between the mean of the traces in a given cluster and the mean of the full training data. We expect this context labeling to yield collections of traces with qualitatively different characteristics.
Next, we use techniques of transfer learning
[17] to train a GAN model for each of the 8 contexts defined above, by starting with the “global” GAN model trained on the entire training data using the algorithm described in Sec. 5.4 and then “finetuning” that model by training on the collection of traces with that context label for a certain number of training epochs (but a fewer number of training epochs than employed when training the global model). In other words, we now have not one generator network (corresponding to the generator of the global GAN model) but a whole family of generator networks, comprising the global GAN model’s generator and one generator for each context defined above.Note that our approach above created a total of 8 contexts from two context categories (signal strength and application type) with two groups in each category (HIGH/LOW for signal strength, STREAM/INTERACT for application type). Although this approach generalizes to quantizing into more than two groups in two contexts, it quickly becomes inefficient if we have too many contexts stemming from too many combinations of groups and/or context categories.
In the latter case, more aggressive filtering of the traces in the ensuing collections may have to be performed, possibly involving not just the mean , but also the standard deviation and/or the skewness . Alternatively, the threshold for difference between the means computed over the traces assigned each context label may be set to be higher than the value of 10% that we set in the present work^{5}^{5}5The threshold may be treated as a hyperparameter that is optimized through an outer iterative loop as part of an AutoML workflow [6], but we do not elaborate on this here.. More discriminative splits of the training data across the different context labels will also lead to many context label trace collections failing the threshold of the minimum users with the minimum length sequences assigned that context label.
If a sufficient (in size and quality) training dataset does not exist for a particular context, then we cannot finetune the global GAN model for this particular context by training it on traces that are assigned this context. In that case, when the family of generators is deployed for contextaware trace generation purposes and a context is specified for which the corresponding specialized generator has not been trained, then we just run the global generator.
6 Network Traffic Data
Our approach of realistic user traffic trace replay is rooted in mimicking (without copying exactly) original traces measured with real user traffic.
Any trace that has a timestamped series of upload, download volumes and signal strength values grouped by users could be applied.
Here, we use a trace collected by Telefonica in Spain in 2016, called the Mobile Phone Use Dataset [12]. The dataset contains traces from more than 300 mobile phone users over several weeks. Every 10 minutes a sample is taken where the transmitted and received data over the last 10 seconds are recorded with the current WiFi and LTE signal strengths. We notice that the WiFi data transmissions are more prevalent and thus focused on those in order to capture interesting behavior between signal strength and data volume. The traces also capture the application that is at the top of the stack, i.e. currently or latest used.
We use WiFi signal strength to partition data where all time steps with RSSI get labeled with the signal strength LOW context and the rest are labeled HIGH.
For applications we first map each application into its official app store category, and then we map all applications that fall into one of the categories: MUSIC_AND_AUDIO, MAPS_AND_NAVIGATION, SPORTS, or VIDEO_PLAYERS into the STREAM context. All other apps are mapped to the INTERACT context.
Given that the 10 s sample every 10 minutes is noisy we average 6 values to produce hourly samples. For applications we create a memory where the last known application that could be successfully mapped to a category like above becomes the app for that hour.
We look at sequences of 12h, and any users whose traces are less than 10h after a context split are discarded from the 50/50 user split training and test data sets. For the global context we gather 100 users in each data set, a context split that has less than 5 users is also discarded.
The context splits are as follows. Hourly values with application STREAM vs INTERACT, and signal LOW vs HIGH, and any combination of the two contexts are checked whether they pass this cutoff. If they do we compare the difference in mean download and upload rates between the global context (all values) and the context split. If it is more than we call that context significant and keep it, otherwise the context is discarded.
The resulting significant contexts are trained separately and obtain separate GAN generator models. The insignificant contexts that were discarded can still be used when requesting a context trace through an API that we will discuss later, but they are generated from the default global context.
For comparison and illustration of generality of the model training process we also use a second data set collected from 200 cable modems in a residential area over the period of a month. The advantage of this data is that it is very fine grained and time synchronized. We collect 6minute upload and download volumes averaged over minutebyminute samples on each modem separately ^{6}^{6}6This is the same level of aggregation (6values) as with the Mobile Phone trace.. More information about the data can be found in [15]. However, since we do not have app or signal information in the data, we can only use the data to train a contextunaware model.
The pretraining data transformation pipeline is depicted in Figure 1.
7 Model Evaluation
We use the three previously mentioned metrics, correlation, moments and novelty to evaluate our GAN model against three benchmarks, Uniform Fit (Uni) , Generic Distribution Fit (Dist) and timeseries GAN (TimeGAN), discussed next.
Uni
. The uniform fit model simply computes the min and max of a time series to reproduce and then generates a new time series with a uniform distribution within that range.
Dist. The generic distribution fit model evaluates over 100 distributions of different characteristics and picks the best fit with the Generalized Additive Models for Location, Scale and Shape (GAMLSS) proposed in [13] and implemented in the gamlss R package ^{7}^{7}7https://www.gamlss.com/.
TimeGAN. The TimeGAN benchmark uses the method proposed in [18]. To use the model we map each user trace of download and upload samples to a couple of new features, and ensure all user traces have the same sequence length. Note this only allows us to generate the same number of users as in the input, which for the following evaluation is done for all benchmarks.
Here we only compare the global, contextunaware models. We split the data in 50/50 between train and test users with 100 users in each group. In the evaluation we compare the generated traces both to the original training data and the test data that was hidden during training.
Benchmark  Data  Correlation  Moments  Novelty 
Distance  Distance  
Uni  train  .48  6.0  .55 
test  .49  5.6  
Dist  train  .42  3.9  .28 
test  .44  3.5  
TimeGAN  train  .08  4.8  .39 
test  .10  4.3  
MASS  train  .04  .21  .42 
test  .06  .62 
Benchmark  Data  Correlation  Moments  Novelty 
Distance  Distance  
Uni  train  .24  8.1  .56 
test  .22  8.3  
Dist  train  .22  5.3  .16 
test  .21  5.4  
TimeGAN  train  .01  7.5  .52 
test  .03  7.7  
MASS  train  .003  .63  .53 
test  .01  .79 
We can see from the results in Tables 1 and 2 that MASS dominates the other benchmarks in terms of correlation and moment fits across both the original data and the test data prediction. TimeGAN also reproduces correlations well, but has the limitation of fixing the number of users that can be generated to the original set, and does not reproduce moments well. Interestingly we also see that novelty suffers if we try to make the best possible distribution fit. Recall that a high novelty score allows us to expand to more generated users beyond the original data set without ending up regenerating the same traces. Novelty is defined in terms of crosscorrelations within the generated trace so it will be computed the same way regardless of whether you compare it to the training user data or the test users, which is why it only has a single value per benchmark in the table.
The lower correlation distances in the cable modem data set can be explained by there in general being a lower correlation between download and upload traffic volume compared to in the mobile user data set^{8}^{8}8Although lower, the correlations are still significant and around compared to about for the Moble Phone data.. Another difference between the data sets is that the moment fit is worse across the board for the cable modem trace, which could indicate that it is more noisy data, due to the shorter time frame it was aggregated over (6 minutes versus an hour).
We note here that there are analytical models for distribution fits with preserved correlations between time series known for Gaussian and generalized beta distributions
[11, 9], but not for general distribution fitting. The general distribution fitting did surprisingly poorly in reproducing moments compared to MASS.All these test were run with 2000 epochs. We will evaluate the impact the epochs have on the results later on.
7.1 Context Evaluation
Next, we compare the metrics performance of contextual models versus global models to determine how valuable contextbased generation is. We compare the original contextual split to the contextual generator as well as a noncontextual generator and compute as the proportional change, where is the global value and the context aware metric. For the distance metrics (correlation, moments) a positive value means a reduction in distance, which is expected and good. However, a higher novelty value is good, so a negative value for novelty in this comparison implies an improvement. The novelty metrics is not expected to improve as a specialization into context could potentially reduce the novelty of the generated trace, in particular if the number of users matching the context is low.
The results are shown in Table 3. Note that we only run these evaluations for the contexts that were deemed significant, as they are the only ones with a trained generator.
Context  Correlation  Moments  Novelty 
Distance  Distance  
LOW  .43  .82  .31 
INTERACT_LOW  .33  .81  .52 
STREAM  .47  .97  .09 
STREAM_HIGH  .42  .96  .05 
STREAM_LOW  .82  .82  .85 
It is clear that there is a tradeoff between better moment and correlation matching versus reduced novelty in some cases. Therefore, based on this analysis further contexts may be pruned, like STREAM_LOW, and INTERACT_LOW, which have very high degradations in novelty.
7.2 Learning Evaluation
Next, we study the time it takes to train the different metrics. Note here that we only have loss functions that try to minimize correlation distance and moments distance and that novelty is only indirectly kept in check. From Figure 2 we can see that the moments distance reaches its minimum after about 1800 epochs, whereas the correlation distance reaches its minimum after about 2800 epochs. In training time that is about 3 minutes versus 5 minutes ^{9}^{9}9The training was run using an NVIDIA GeForce RTX 3060 6GB RAM, 1.785GHz GPU with CUDA on an 8 quadcore CPU 11.4Ghz Linux Ubuntu 20 desktop PC.. Note that these two metrics are jointly optimized with the GAN discriminator loss. Training each metric separately converges much faster, but is less interesting for our purpose. Novelty as expected has very minimal variation regardless of training time. There seems to be a marginal downwards trend in novelty with increased training time, but the range of novelty values produced (yaxis) is very small (). A downwards trend in the epoch range of in correlation distance seems to coincide with an upwards trend in moments distance, highlighting that minimizing for both of these metrics at the same time is a challenge and results in various tradeoffs having to be made, in terms of which of the metrics is the most important to get under a given level. Currently we optimize both without any bias or weights. Another observation is that there seems to be a plateau around epochs for the correlation metric, which we eventually overcome. These kind of plateaus is the reason we employ our conditional gradient descent mechanism.
The graphs are using rolling means of 10 periods to smoothen the curves, and make trends more apparent. We also ran each training epoch length in five repeats and then took the average value for all metrics. The standard error regions are calculated from the same 10period windows as the means.
8 System Design
Next, we discuss how training and replay was implemented and integrated into various networking tools.
The training and replay component interactions are shown in Figure 3.
8.1 GAN Training
The GAN implementation is built on top of a PyTorch port of CRNNGAN
^{10}^{10}10https://github.com/cjbayron/crnngan.pytorch. The discriminator model is used as is, but the generator was modified with new loss functions. The training process was rewritten as well as all data loading and validation functions.The validation process as well as the basic distributional benchmarks are implemented in R. The TimeGAN benchmark was used without code modifications, again using a PyTorch port ^{11}^{11}11https://github.com/d9n13lt4n/timeganpytorch.
Training is run on a GPU via CUDA drivers to speed up the process.
Although Python and PyTorch can be installed on most platforms without GPU support, we also wanted to target embedded platforms with limited resources. Even if the tools are available, making sure the right versions are installed and being abe to import the saved models on many different platforms is a cumbersome undertaking. Hence, we implement a simple REST API with both JSON and clear text output to generate traces on demand from a pretrained set of models.
Although the models take a long time to train on a CPU as opposed to a GPU they may easily be used for execution anywhere. The saved model files ( per generator) can be transferred to the web server hosting the REST API if it is a different machine than where training took place.
8.2 Shell Replay
To replay traces we, optionally as a dropin replacement for iPerf3, implement a custom UDP and TCP socket clientserver tool that behaves similar to iPerf3 in that it allows for transmission of randomly generated data at given rates with specified message sizes on both upload and download directions across a pair of endpoints. The performance servers also log their performance history, which in turn is picked up by the REST server and served using html5 canvas charts for easy realtime monitoring. The performance servers may be deployed anywhere in your network and do not have to be collocated with the REST server generating traces.
The reason for not mandating use of iPerf3 is to simplify use from platforms where iPerf clients are difficult to use, e.g. Android. The client is written in Java without any dependencies beyond the JRE and can run both inside an Android app and standalone where a JRE is available. The server is written in Python. A simple protocol was designed to let clients communicate: 1) direction, 2) duration, 3) rate and 4) message size of the replay. Separate endpoints are also given for UDP vs TCP. Endpoints like iPerf3 can only serve a single stream at a time, so separate endpoints are used to replay upload and download traffic concurrently.
On some constrained devices, e.g. OpenWRT routers, it may however be easier to install iPerf3 than a Java runtime. So we also implement a hook to replay traffic through iPerf3 if available. Apart from iPerf or a JRE to replay traces and curl to generate traces via the REST API, the script has no other dependencies other than Bash, so it should be easy to run from a multitude of platforms.
The replay script can be controlled with environment variables that are sourced from a file before a replay starts. Variables that can be configured include MASS server and performance server endpoints, maximum upload and download rates ^{12}^{12}12The trace generator produces normalized rates that need to be multiplied with the desired maximum rates., duration of each replay time step called epoch, and app context transition probabilities. The current WiFi signal strength may be used to drive the context trace selection as well. If the currently connected WiFi network has an RSSI less than 75 a trace with the LOW signal context will be replayed. The probability of a replay step being replayed with UDP as opposed to TCP traffic may also be set.
A replay run is defined by how many steps it replays, and the same number of steps are requested from the trace generation API for each known context that is supported by the client. The sequence could be replayed multiple times or whenever a sequence has finished replaying a new trace sequence may be downloaded. During replay the current app and signal contexts are obtained and the corresponding context trace is then picked to replay the upload and download rates of the current epoch. This precaching of context traces allows for quick jumping between parallel traces without having to interact with the trace REST API for each step, and thus avoiding any impact to the performance of the replay.
8.3 Android Replay
We developed an Android app that allows for configuration and replay similar to the shell replay script described above. The signal context is inferred from the current cellular provider or the currently connected WiFi network. The Java rate replay client is embedded in the app through a native Java API and can run both in the emulator during development and on real devices. The idea here is to allow a single app to mimic the network behavior of many apps while still reacting to changes in RF bandwidth e.g. due to mobility or obstruction. This design avoids having to script or automate launching of real apps during an experiment to produce realistic network traffic. The backend servers of an Android replay are identical to the shell replay so the same monitoring tools may be used, and real UDP and TCP traffic will traverse the WiFi or cellular network. The app may be downloaded from an app store or be sideloaded, and does not require any custom deployment or root access. In terms of dependencies it only relies on the Volley and JSONObject libraries for the REST interaction with the trace generation API.
8.4 NS3 Simulated Replay
Finally, we also implemented an integration with NS3 to allow you to replay MASS traces inside your simulated NS3 networks. We have tested the integration with WiFi and LTE network simulations, but it should work with any network where you have at least a pair of endpoints reachable by IPv4 addresses.
We have developed a custom UDP and TCP client NS3 Application that sends variable rate data based on MASS traces and contexts such as signal strength and app type (STREAM or INTERACT). Transmissions are done epoch by epoch with upload and download replays in parallel, just like in the real replay clients described above.
Traces are pregenerated for each supported context before a simulation is started and then pulled in based on the current context. We developed a C++ API to install a set of apps on a pair of endpoints from a Node container and IPv4 interface container inside the NS3 simulation code. Each endpoint deployed may be configured separately with parameters identical to the Android and shell clients, such as UDP or TCP traffic, max download and upload rates etc.
We use the trace source callback mechanism to detect changes in signal strength from WiFi and LTE PHY stacks that are detected automatically. The sources we listen to include MonitorSnifferRx and ReportUeMeasurements for WiFi and LTE PHYs respectively. RSSI less than or RSRQ less than are mapped to the LOW signal strength context.
9 Use Case: WiFi Contention Window Control
To illustrate how to use MASS, we now describe an experiment that evaluates an MLbased WiFi Contention Window Backoff algorithm, called MLBA [15], designed to set contention window ranges according to traffic load.
In this particular example we want to investigate the fairness between upload streams and download streams under contention across interfering APs in a WiFi 5 versus a WiFi 6 setup, with our custom contention window algorithm versus the default backoff algorithm on the chipset under investigation, called here simply BA.
The contextaware MASS GANs for the Mobile Phone Use data set were used in this case and deployed behind a Web server in the WiFi test bed.
Trace replay is based on the live sensed signal (RSSI) and for simplicity we only make use of the STREAM application context in this example ^{13}^{13}13If we expect different traffic behaviors like interactive and batch apps to impact the results we could add that as a condition too..
We have 4 STAs each connected to a dedicated AP. The APs run iPerf servers to allow traffic generation. All APs are set up on the same channel and both STAs and APs are positioned in close proximity to each other to ensure there is interference.
The MASS shell script is deployed on all STAs along with an iPerf client.
Each STA generates a trace independently, and traces are reused across experiment condition, i.e. with and without our controller.
For each iteration we first replay the traces against the default backoff algorithm, and measure upload and download throughput achieved and compare them to the intended throughputs from the trace to measure bias.
Next we replay the traces during a calibration phase where we train our ML algorithm, and then finally we run our ML algorithm in prediction or execution mode, measuring the throughput and bias again.
Both APs and STAs have identical hardware and software stacks (OpenWrt 21.02), and there is no bias in positioning between APs and STAs in the testbed.
We use the HE80 (WiFi 6 80MHz band) and VHT80 (WiFi 5 80MHz band) modes and set both max upload and download rates to 300Mbps. The single STA to AP and AP to STA stream throughput without interference was measured to be between 300 and 400 Mbps.
The range of contention windows (CWMIN,CWMAX) we train with is between sizes 7 and 127.
We measure bias and aggregate throughput in epochs of 5, with 30 unique traces generated by each STA (total of 120 users simulated).
Bias is defined here as:
(15) 
where is the throughput measured in the given direction that is attempted based on the trace generated and the measured throughputs.
Given this definition we observe that a bias that is positive means download traffic is getting a higher share of bandwidth than intended, we call this a download bias. Conversely, if the bias is negative it means that upload traffic is prioritized and we call this an upload bias. Clearly the closer the bias is to zero the more fair the bandwidth allocation is.
Figure 5 shows the results for VHT80 and HE80. Throughput values are aggregate Mbps achieved across all STAs within a sequence replayed. Each sequence is 5 epochs. The graphs show rolling averages of 10 sequence replays. The dotted lines show throughput on the yaxis on the right. The filled circles on the MLBA line are proportional in size to the contention window selected for the sequence. In general, a lower CW is selected with a lower load. This is most prominently on display towards the end of the 30 sequences run in the HE80 experiment, where the load goes down, and a lower CW is also selected (smaller circles)^{14}^{14}14We note here that there is no expectation that the learning will improve over the 30 iterations as each iteration has its own independent training and prediction phases..
Looking at pairwise ttests we observe that bias was reduced and the reduction is statistical significant on a 95% significance level. The difference in throughput, was however not significant. The means are shown in Table
4.We also observed that the download bias is stronger in HE80 compared to VHT80 with the default backoff algorithm (% vs %), and to compensate for that we give our ML algorithm a lower cutoff in contention range probed (50 instead of 127). The throughput, as expected is also about 25% higher with HE compared to VHT.
Benchmark  Protocol  Bias  Throughput 

BA  VHT80  .23  3273 
HE80  .36  4091  
MLBA  VHT80  .027  3273 
HE80  .052  4108 
One could argue that changing the EDCF transmission queue parameters on the APs also could have achieved increased fairness, but clients may come and go and emit variable load, making it nontrivial to determine what is optimal. The point of these type of ML algorithms in general is to avoid having to change configuration settings manually when load conditions change.
Learning the load and evaluating a learning algorithm is a challenge without realistic user traces. Here, MASS allowed us to investigate the difference between the demand (original trace values) and achieved value (throughput/goodput).
10 Discussion and Conclusion
Traditional user load simulation techniques based on simple statistical models are increasingly inadequate for evaluating and testing complex models to optimize network operations. This is especially true for machine learning models like deep neural networks.
An inability to introduce a machine learning model to sufficiently realistic scenarios during the training phase may lead to a false sense of confidence in the model’s performance on the part of researchers. This may result in disappointment and a feeling of wasted time and effort when the trained model shows surprisingly poor results in a production deployment.
GANs using RNNs have been proposed in several fields (starting with medical applications, see [2]) as a means to generate realistic time series training data for machine learning model training for applications where realworld data is scarce or expensive to collect. The machine learning workflow is called TSTR, for “train on synthetic, test on real” (data) [2].
The present work proposes a specific GAN model, which we call MASS, to address the machine learning model training needs of network operators. We believe the ability to validate an increasingly automated, machinelearning driven network infrastructure calls for more sophisticated ways of simulating real user traffic, and that GANs in general and MASS in particular provide a good starting point.
MASS was designed with user privacy in mind, as opposed to being an anonymization afterthought or by obscuring true behavior with noise, and hence it is ideally suited for sharing models trained on private data in various wireless networking scenarios, such as residential MDU WiFi or Enterprise 5G or CBRS networks to name a few. In addition it was designed to be cognizant of the environment the trace is deployed in by autonomously reacting to environmental signals such as RF quality and the task at hand, for example the application running on the UE or mobile station.
We envision the use of MASS in situations where some party who traditionally has not been willing to share raw data because of privacy concerns may be more willing to share GAN models of the data. The level of privacy protection for the original users then depends on the number of users producing the original data employed to train the GANs. It is clear that only having a handful of users train the GANs may lead to GANs that mimic the individual behavior of these few users to the point where the generated traces may reveal more private behavior than desired. It is thus important from a privacy perspective that data be collected from a sufficiently large number of users to train the GANs.
We also note that our GAN models do not produce absolute time series of user load, but normalized load. Each user’s load is normalized for both uploads and downloads in the range , and the GAN generated trace would also then produce a value in the same range. When an experimenter uses a GAN, max upload and download rates have to be specified to scale the replay to the testbed used. This normalization helps with training the GAN more efficiently and makes it easy to scale the traces to the target system, but it also provides another level of privacy protection.
In conclusion, we have seen that our trace generation method is able to produce novel, realistic network traffic traces that can easily be integrated in simulations as well as experiments in testbeds.
Acknowledgments
We would like to thank Irene Macaluso, Lili Hervieu, Bernardo Huberman, and Aaron Quinto for their feedback and reviews of different sections of this paper.
References
 [1] (202107) Generative adversarial networks: a survey toward private and secure applications. ACM Comput. Surv. 54 (6). External Links: ISSN 03600300, Link, Document Cited by: §1.3.
 [2] (2017) Realvalued (Medical) Time Series Generation with Recurrent Conditional GANs. Cited by: §10.
 [3] (2020) Generating energy data for machine learning with recurrent generative adversarial networks. Energies 13 (1). External Links: Document, ISSN 19961073, Link Cited by: §2.
 [4] (2020) Generative adversarial networks for spatiotemporal data: a survey. arXiv preprint arXiv:2008.08903. Cited by: §1.3, §2.
 [5] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Cited by: §1.3, §2, §3.2, §5.2.
 [6] F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2018) Automated machine learning: methods, systems, challenges. Springer. External Links: Link Cited by: footnote 5.
 [7] (2021) Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14 (1–2), pp. 1–210. Cited by: §1.3.
 [8] (2021) How well generative adversarial networks learn distributions. Journal of Machine Learning Research 22 (228), pp. 1–41. External Links: Link Cited by: §2, §3.1, §3.1.
 [9] (2016) On The Dirichlet Distribution. Master’s Thesis, Queen’s University Kingston, Ontario, Canada. Cited by: §7.
 [10] (2016) CRNNGAN: continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904. Cited by: §1.3, §2, §3.2, §3.2, §5.1, §5.2, §5.3.2.
 [11] (2015) Constructions for a bivariate beta distribution. Statistics & Probability Letters 96, pp. 54–60. Cited by: §7.
 [12] (2017) Beyond interruptibility: predicting opportune moments to engage mobile phone users. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1 (3), pp. 1–25. Cited by: §4, §6.
 [13] (2005) Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54 (3), pp. 507–554. Cited by: §7.
 [14] (2021) An introduction to deep generative modeling. arXiv preprint arXiv:2103.05180. Cited by: §1.2.
 [15] (2019) Learning to wait: wifi contention control using loadbased predictions. arXiv preprint arXiv:1912.06747. Cited by: §6, §9.
 [16] (2021) SAFE: Secure Aggregation with Failover and Encryption. arXiv preprint arXiv:2108.05475. Cited by: §1.3.
 [17] (2018) A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning – ICANN 2018, V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Eds.), Cham, pp. 270–279. External Links: ISBN 9783030014247 Cited by: §5.5.
 [18] (2019) Timeseries generative adversarial networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett (Eds.), Vol. 32. Cited by: §1.3, §2, §3.3, §7.
Appendix A Shell Replay Options
Appendix B Rest Api
Url
Method
URL Params
Payload
All JSON fields are optional. Context defaults to global context, users to 1, seq_len to 100, normalize to ”pos” ^{15}^{15}15all values are positive as opposed to forced into 01 range as with minmax, shuffle ^{16}^{16}16shift values a random timestep forward with values falling off the end inserted at the beginning to false.
Response
If format=json (default). Last dimension in the array is the 2tuple .
If format=text
Note, an empty line separates two user traces.