1 Introduction
The behaviors of many of the world’s inhabitants are fundamentally bound by the cycle of the sun and the moon which creates day and night. It is the reason why across the days of an average person, there often exist periodical patterns for their mobility or more generally, their behavior [26, 27]. Utilizing such reoccurring patterns could drastically benefit various modern ubiquitous applications. For example, the ability to predict a day’s power consumption of many individual houses at midday will be profoundly beneficial for the smart grid to manage dynamically its power supply resources. While in the scenario of smart location tracking [14, 34]
, with a replenishable energy budget the system either aims to minimize the energy efficiency of location tracking, or attempts to maximize the tracking accuracy given a fixed energy budget. A crucial challenge involved in such a smart tracking system is to estimate at any time of day how much further the moving entities will move for the remainder of the day. Ideally, with a greater estimated value of the total travel distance, the system will employ a more conservative sampling strategy (lower sampling frequencies) to cover as much as possible of the whole trip using the restricted energy budget, whereas a more aggressive strategy (higher sampling frequencies) will be favored on the presence of a smaller estimated total travel distance, so that better tracking precision will be achieved. Clearly the estimation of the entity’s daily travel distance using partial information is a challenging yet crucial ingredient for the system’s success.
Approaches have been proposed to predict generic timeseries and many of them have capitalized on the phenomenon that for each individual there often exist reoccurring small fragments of time (which we call “snippets”) in their histories. By detecting and reusing such snippets, we are able to reconstruct a day with the elements from previous relevant days. We show an example of snippet learning for daily traveling time prediction and the difficulties it faces by using a commuter’s daily routines. It is worth noting that throughout the entire paper, we assume that besides the timeseries itself, no other support information such as locations are available to the prediction algorithm. For example, to predict a day’s travel distance, the algorithm’s only input is a partial timeseries of the distances traveled in each interval. With a 30minutes interval, the whole day will have 48 timeseries entries, and we aim to use the first half of them to predict the accumulated travel distance for the whole day.
Imagine that a person in our example has two usual routines: 1) on workdays the person goes to work by a particular bus line that stops outside the apartment every 8 a.m., and arrives at the workplace around 9 a.m. The person gets lunch around 12 p.m. at someplace near the workplace everyday, and finishes work around 5 p.m., 2) on weekends the person prefers going to the beach in the morning and coming home in the evening. In the ideal case, the person begins and finishes the same activity at the exact same time on every workday, and the resulting timeseries for travel distances would be identical across days. With snippets, a timeseries for a workday would then be transformed into a series of snippets like , , , , …. Now to predict how much further the moving object will move for the remainder of the day at a certain time on the day (e.g. midday), we are left with a simple task. For every interval of the snippet sequence in the example, if the current day shows an identical partial timeseries for that interval, the person is likely to be working that day and is likely to yield the same total travel distance as any other workday. The same method works for the weekends too.
In reality, such patterns do repeat themselves, only not in such a perfectly aligned way but instead often on a shifted timeline and at a differing pace. Instead of having high coherences at all times between two working days of a person, in reality a day’s timeseries may often be partially similar to and partially divergent from another day’s, posing a serious challenge for the aforementioned prediction method. There are many possible causes which prevent a perfect resembler for a snippet sequence from happening. For example, the bus in the morning may be 20 minutes late, or the person may wait for a coffee to miss the bus he/she is supposed to take. Then, the person may have a later than usual lunch at work. Finally, the person on one day decides to do usual item A/B in the order of B/A. Coupled with the huge number of nonworkrelated locations a person could go to and the numerous possible sequences of visiting them, the resulting timeseries could have a huge variety of distortions to the regular timeseries. In such cases, how to effectively learn representative snippets and how to use them effectively remains a major challenge.
To solve this complex problem, we adopt the concept of snippets but take a step forward and propose a robust learning and timeseries prediction model to systematically reduce the effect of such distortions. Specifically, we make the following contributions in this paper:

We propose a novel regression model, which is based on convolutional neural networks, to solve the robust snippets learning and periodical timeseries prediction problem.

We propose a novel technique called temporal embedding to improve the classical convolutional neural networks’ capability for learning robust snippets and for predicting accurately. We design a network layer based on this concept, devise a complete four layer network (TeNet) for regression, and solve the corresponding backpropagation problem. We also offer a detailed case study to illustrate the effect of temporal embedding.

We conduct extensive experiments on 15 individual datasets representing three data modalities and one synthetic dataset to evaluate the advantages and characteristics of the proposed model.
The rest of the paper is organized as follows. Next in Section 2 we present the background and relevant literature of the problem studied. In Section 3 we give the intuition behind TeNet, describe in detail the technique of temporal embedding and other layers of TeNet, and offer solutions to the backpropagation of TeNet. We then enter Section 4 and evaluate the proposed model. Finally we conclude our work in Section 5.
2 Background and Related Work
Learning abstract features (with neural networks in many cases) has been extensively studied in recent years and has proved effective in many applications. For instance, numerous studies [3, 2, 15, 18, 9]
have shown that deep neural networks perform well for complex computer vision classification tasks, while many demonstrate that success can be achieved with deep learning architectures for audio classification tasks as well
[19, 22]. These wellperforming deep neural networks have a variety of core ideas, ranging from restricted boltzmann machines that utilize an energy model
[13, 17, 13], to sparse autoencoders that introduce an unsupervised “denoising” mechanism to remove insignificant, noisy signals from data
[29, 3, 30], to using convolution as an effective way to learn representative features robust to geometric locations of images [18, 5].The main advantage of such methods is that they have a strong capability of unravelling the hidden hierarchical structure of data to derive representative features. Moving from a shallower architecture to a deeper architecture, these models progressively detect essential components of the data from local parts like strokes in human handwriting, to global compositions such as digits or objects. Among the variations of neural networks, inspired by biological processes [20], convolutional networks in particular excel in finding such abstract features that are robust to geometric variations in images [18]. Interestingly, such advantages of convolutional neural networks are present not only in vision tasks, but also in speech recognition [1, 8, 12]
and natural language processing
[6, 7].Now we consider the periodical timeseries prediction problem for data such as daily traveling distances or daily household power consumptions. To tackle this problem, conventionally statistical models such as autoregression and its variants are strongly favored. While in the past decade, realizing there is abstract and structural information beneath the raw numeric values in the timeseries, researchers have experimented to discover such patterns by clustering or “motif” discovery [23, 26, 27]. Though conceptually similar, these “motifs” usually are concrete subsequences that are restricted by specific mathematical definitions, which differentiate themselves from the concept of abstract, representative snippets in our paper. However, how to design a method that can find abstract patterns as well as predict future values, that meanwhile is robust to various temporal distortions and misalignment, is yet to be answered. Inspired by the success of convolutional neural networks, we investigate using convolutionbased neural networks to address this problem.
3 The Model
3.1 Intuition
The two main challenges for the periodical timeseries prediction are: 1) how to find representative snippets for the prediction of future changes; and 2) how to minimize the effect of distortions in the temporal domain and get accurate regression results. Here we examine the two challenges separately and propose solutions to them from a neural networks perspective.
The first challenge, i.e. snippet learning, involves finding abstract sequences in the training timeseries. Naturally there is an assumption that the snippets should only be of moderate length. For example, if we were to predict daily human mobility, a time window of from one halfhour to a few hours would be a reasonable setting, as intuitively such a period of time should be enough to cover most of the common trips in daily life. Hence in the prediction model, we examine such periods of time using a convolutional approach. We create randomly initialized filters that have a given, moderate length as the length of the target snippets. In 2D image classification tasks, filters in convolutional neural networks are often used as edge detectors, while in ours, the filters will act as “snippet detectors”. In the training phase, the weights for the filters will be adjusted during the backpropagation so that they respond maximally to the reoccurring and significant components in the training data.
We then solve the second challenge by adding a “temporal embedding” operation in the neural network. The temporal embedding process provides a supervised way of denoising subspace learning. When dealing with timeseries, a naïve technique is to “shift” the training data forward and backward along the timeline. For example, a shifting routine with windows size 1 would transform a training sample into three training samples . Though useful sometimes, this naïve approach introduces heavy noise by including artificial training samples that may never actually happen in the real world. Also it is unable to benefit case where the order of the subsequence is changed. We argue that the naïve technique can evolve to a much more effective approach called temporal embedding that integrates into the learning process mechanisms for removing distortions. With temporal embedding, two temporallyshifted copies are created for each sample during the learning process, and then the original sample and the two shifted copies are encoded into a single sample so that the processed sample will not only carry its own information, but also bear a piece of information for each of its shifted neighbors. Again, the weights for the encoding are learned in a supervised way during backpropagation.
Next we present an overview of the TeNet model.
3.2 Model Overview
We propose a convolutional neural network to learn the snippets from the periodic timeseries as illustrated in Figure 1. The model has three invisible layers, namely the temporal embedding layer, the convolution/maxpooling layer, and the sigmoid layer. The output layer is an l1regularized least squares regression layer. The illustrated model is an example instantiation of the proposed model, with the input size, embedding window size, number of snippets, snippet size, maxpooling and sigmoid layer sizes to be 6, 1, 2, (1,3) and (1,2) and 3 respectively. The model implements the following work flow:

It takes an input sample, and applies the temporal embedding. This layer transforms the sample into a denser representation with not only the sample itself but also information of its potential temporal neighbors. The weights of the transformation are iteratively updated during the training process.

The embedded input is sent into a convolution layer where a set of filters, or snippet detectors, scan through the sample using the convolution operator. Each snippet will be convolved against the sample, resulting in a feature map considered as the snippet’s response to that sample.

The snippets’ responses to the sample, being supposedly sparse and representative, are input into a sigmoid layer to combine some of the responses into higherlevel and more abstract representations in lower dimensions. This transformation also involves a set of weights that is learned over the training process.

Finally the abstract representation of the sample is used to perform an l1regularized leastsquares regression to obtain the predicted value. The intuition behind the l1 regularization is that if we consider the previous layer’s output, ie. the highlevel neuron’s responses to the sample, as highlevel pattern recognizers responses to the signal, a sparse solution will utilize the most significant responses and hence will be less sensitive to noise
[21, 25].
In the following subsections we discuss the layers separately in detail. In the rest of the paper, the technical details of the neural network will be described mostly in vector forms, and we will use the assumptions and notations listed in Table
1.Notation  Description 

the input timeseries of length  
the layer number  
the weights for the layer  
the bias for the layer  
the input of neurons in the layer  
the intermediate values for the layer  
the activation function for the layer 

the intermediate error (cost) of the layer  
the network’s cost given  
the transpose operator  
the dot product operator  
the elementwise product operator  
the convolution operator  
the derivative of function 
3.3 Temporal Embedding
The temporal embedding layer aims to align less dominant samples to the dominant patterns by reducing the temporal distortions and misalignment (e.g. shifting or skewed sequence of events), corresponding to two cases in our previous example: 1) the commuter starts the day 30 minutes earlier than usual, so every event in the morning rush hour is shifted ahead equally by 30 minutes , 2) for some reason the commuter does not take the usual bus line which directly stops at his workplace, instead he/she takes a train and walks 1km to work from the station. In the resulting timeseries we will see two distinct effects as a result of 1) and 2). For example, assume that on normal day the travel distance timeseries segment in the morning will be
, then for case 1 we will have , and in case 2 it will be . Now we assume both cases happen on the same day, giving us , which is heavily distorted from . It is a significant challenge for a prediction algorithm to realize that for the two days the travel distances should be very similar despite the sequences and the values of their timeseries are so different.Temporal embedding addresses this issue, by optimally embedding a value’s temporal neighbors into itself, so that for the whole dataset the dominant pattern remains unchanged but the distorted patterns are realigned. The layer is configured by one hyperparameter
that controls how many neighbors of an element in each direction should be embedded to the element itself (the embedding size). This layer has sets of parameters, represented by matrices and , and the same number of constant sparse matrices and . The subscriptions and represent the direction of the neighbors on the timeline, and here means the weights for the neighbor in the final embedding. In the case of , there are three matrices and three matrices in this layer. The six matrices together implement the embedding operators. Here we use the input dimensions in Figure 1 (where ) as an example for how this layer works.The constant matrices, are defined as:
(1)  
(2) 
Weights in , that correspond to the s in , and represent the weights for the embedding of the sample’s left neighbor (forward), the sample itself and its right neighbor (backward) respectively, and they are initialized with corresponding constant matrices respectively. The layer’s output is subsequently defined as follows:
(3)  
(4) 
enforces a constraint that the connections between this layer and its input are restricted, and only the weights at the desired neighboring positions for each element are used in the final embedding for that element. The layer yields the temporal embedded output , or
in this example, as the output of the layer. One can also use the sigmoid function as the activation function in the temporal embedding layer, though our experiments show that the difference it makes on the prediction accuracy is insignificant (most of the times adding the sigmoid activation will slightly decrease the prediction accuracy).
The layer’s output is a vector of the same size as the input, however the embedded sample is now significantly more robust to temporal distortions. With temporal embedding, the model detects dominant patterns in the training timeseries, and tries to correct the systematical distortions within the specified time window. Using the commuter example, the model will find that the person’s regular time for the bus to work, and will try to realign the systematical misalignment on those unusual days. Some readers may argue that a simple moving average algorithm might be able to solve the distortion problem; however temporal embedding is far more effective, as the concrete example below shows.
Discussion and Case Study
Recall our example with and , where represents the dominant pattern in the dataset, while represents a day that in fact will yield a similar endofday result but shows very distorted patterns in its timeseries. Now given the parameter matrices and the constant matrices initialized as in Equation 2, our objective is to realign with by eliminating the distortion, and meanwhile keeping as unchanged as possible, which is effectively equivalent to solving the following minimization problem in Equation 7:
(5)  
(6)  
(7) 
where and are the embedded new timeseries. By solving the optimization, the nonzero weights in , and are determined as
, and
respectively. Now and can be calculated according to Equations 5 and 6, and we subsequently investigate how temporal embedding performs in terms of preserving and realigning to , compared with the moving average approach, with and being the output of and of a moving average of window size 3 ().
Squared Error  Intersection  Pearson’s  
4.5  4  0.11  
0  8  1  
0  8  1  
2  6.3  0.87  
3.1  5.2  0.02 
Table 3 measures the relations between the vectors before and after the transformations with three metrics, namely squared error, intersection and Pearson’s correlation. First we note that is so distorted that the correlation between and is merely , which can be considered “uncorrelated”. Now we examine the differences between the effects of temporal embedding and moving average.
Ideally, the transformation should show the following properties: 1) since represents the reoccurring pattern in the training set, we want to be as unchanged as possible after the transformation 2) after the transformation, should be as similar to as possible, indicating that the misalignments in has been minimized and is realigned to the representative sample . We verify the two aspects by examining the relations between and , and that between and , and observe that temporal embedding has achieved both goals.
First we observe that is identical to (with squared error), while has been transformed to a form that is perfectly identical to and now, with the dominant values at the second and third positions swapped and realigned to the third and forth position to be more inline with . However, we can see moving average resulted in a squared error of between and , showing that has not been preserved successfully in the transformation. Second, though moving average does strengthen the relation between and by reducing the squared error () and by increasing the similarity by intersection, it has even resulted in a drop in the correlation ( compared with the original and ). We conclude its result is clearly less successful compared to temporal embedding ( in squared error, in intersection, and ).
It is worth noting that although the temporal embedding layer in the proposed neural network is not exactly the same as in Equation 7 as it does not have knowledge initially about which samples hold the representative patterns, as the training proceeds, the weights will progressively favor the reoccurring patterns, and eventually approach the solution of Equation 7. Next we describe the convolution, the maxpooling and the sigmoid layers.
3.4 Convolution, Maxpooling and Sigmoid
The convolution/pooling layer performs a series of discrete 1d convolutions with a specified number of filters of a specified length . Each of the filters “sweeps” through the entire input signal and takes the input signal segment at the corresponding position as input. With a filter kernel (taking the convention of reverselyordered weights for convolution kernels and outputs), the filter’s output has the element:
(8) 
In the example in Figure 1 we have set two filters with size 1x3, hence in the convolution layer, each neuron will only be connected to three neurons from the temporal embedding layer. Such sparse connectivity between the filters to their inputs enforces that the convolution layer will be focusing on finding the local snippets with moderate lengths.
Though the convolution traverses the entire timeseries in a slidingwindow style and seemingly has a positive effect in reducing the temporal distortions, it is very different from temporal embedding. The main factor differentiating them is in the weightsharing scheme (see Figure 1). A filter in the convolution layer has its weights shared among all its output neurons (meaning a filter is sliding through the data, trying to match the same particular pattern), while in temporal embedding each neuron has individualized weights to enable optimal local embedding for each position. Such flexibility enables it to identify and realign much more complex distortions and misalignments. For example, given , convolution will not be able to recognize the close relation between and because of the heavy distortions in both the positions and the sequences. In the experiments we will also show that without the temporal embedding layer, convolutional neural network does not work well on such timeseries.
The output of the convolution will be of the size . In Figure 1’s example where , we have the 8 neurons in the convolution layer. The output is then received by the maxpooling layer, where only the maximal value is kept from any pool of . The filter’s output will hence be down sampled and transformed by an elementwise hyperbolic tangent function, reducing the output to 4dimensional. Then as the last hidden layer, the sigmoid layer will perform a projection from the convolution/pooling’s output to a further reduced dimension as a means of both learning nonlinear features and dimension reduction. Finally, the input is transformed into a dense, robust and representative feature representation of . Intuitively we can consider the sigmoid layer as a higherlevel feature learner, after the convolution layer has discovered those relatively more “local” snippets.
3.5 regularized Leastsquares
The output layer of the proposed model is a l1regularized leastsquares regression layer, defined as:
(9) 
with the cost function in the from of:
(10) 
where is a hyperparameter for the weight of the regularization term.
The advantage of using the regularizer over is that the regularizer forces the optimization to find a sparse solution that only uses the most distinctive highlevel features to conjure the final prediction [21, 25]. With the regularizer the weights tend to have smaller variance, often making the model spread the energy thinly across all features, hence making the model less distinctive and less accurate.
3.6 Backpropagation
The parameters in the network are updated by stochastic gradient descent. In particular,
can be learned by:Where is the sign of a vector. One can speed up this optimization process using the methods proposed in [28].
To update the parameters in the temporal embedding layer, taking
as an example, we apply the chain rule and arrive at:
(11) 
Since the elementwise product has the property:
(12) 
we have the partial derivative of w.r.t. as:
(13) 
We calculate the error propagates from layer 2 to layer 1 as:
(14) 
where returns the input vector in reversed order. With the convolution layer’s back propagated error being (which can be calculated by the method described in [16]), can therefore be updated with the gradient:
(15) 
and can be updated using similar procedures. Meanwhile, is updated with the gradient:
(16) 
Next we present the experimental results and offer indepth analysis and discussion.
ID  n  [~]  HitRate (%, @20%@30%)  Error (MRE/MSE)  
TeNet  SVLN  SVSIG  SVPOLY  MKR  TeNet  SVLN  SVSIG  SVPOLY  MKR  
Household Power Consumption  Australia (HPCAU)  
8  874  3.136.4  5.6  7189  5574  5983  6082  4875  0.167.7  0.2414.4  0.2310.5  0.239.5  0.2917.9 
15  870  1.530.9  5  6584  5878  5679  5980  5375  0.211.2  0.2514.2  0.2615  0.2413.6  0.2618.5 
14  670  1.759.2  7.2  6579  4264  4060  5072  6377  0.2110  0.2613.7  0.3115.3  0.2411.1  0.3110 
7  665  8.438.4  5.0  7592  6788  7390  7591  7289  0.1415.9  0.1617.1  0.1515.8  0.1414.4  0.1615.7 
5  661  0.28.0  1.6  7182  5876  5776  5876  7082  0.841  1.171.82  1.11.7  1.11.7  0.390.9 
12  243  7.627.2  2.8  9097  8896  9096  8592  7891  0.094.6  0.14.6  0.094.8  0.128.2  0.178 
10  242  4.342.6  7.7  8496  7492  7894  7090  7183  0.128  0.177.5  0.139.2  0.1210.6  0.167.15 
1  241  8.937.9  4.6  8496  8095  7592  7691  7693  0.1212  0.1211.2  0.1414  0.1522  0.139.23 
13  241  4.446.7  6.3  6582  6278  5979  6380  5980  0.1922  0.222.3  0.226  0.2123  0.2418.7 
29  233  17.373.5  11.3  7793  7592  5980  7793  7692  0.1443  0.1331  0.285.0  0.1334  0.1537.6 
Household Power Consumption  France (HPCFR)  
1  161  1079.5  10.3  6483  5675  5373  6075  6381  0.1874.6  0.23111  0.26 110  0.22100  0.267 
Human Mobility  Traveling Distance (HMD)  
8  206  899  15.2  4663  4862  3550  4863  3345  0.28169  0.29170  0.35198  0.31300  0.4323 
12  156  9.560  11  5983  5677  4365  5470  4567  0.2074.5  0.23101  0.2896  0.27285  0.27103 
Human Mobility  Traveling Time (HMT)  
8  193  55345  47  5170  4764  4258  4057  4566  0.2332.1  0.2436.6  0.2859  0.35179  0.2543 
12  243  37280  32.4  6174  5870  4867  4968  5774  0.2124.0  0.2329.4  0.2529  0.360  0.2225 
4 Experiments
In the experiments, we conduct extensive tests on the proposed model, with 15 individual datasets and 4 competitive methods. The goals of the experimental studies are fourfold: 1) to evaluate the prediction performance of the proposed model, in terms of prediction accuracy, and compare it with the competitive models; 2) to evaluate the model’s behavior and sensitivity to features of diverse datasets; 3) to investigate the isolated effects of temporal embedding; and 4) to visualize the snippets and show how they work with intermediate values from the learning process.
4.1 Datasets
To support the comprehensive evaluation, we use a variety of univariate, periodical timeseries datasets that represent three modalities, ranging from human mobility patterns to household power consumption. The reason we choose these modalities is that the behaviors they represent are expected to exhibit complex periodical patterns in daily cycles, which is an ideal testbed for the proposed model to demonstrate its capability of discovering and capturing such abstract features and to test its robustness to various factors.
The first modality is Human Mobility  daily traveling Distance (HMD) in kilometers, and the second is Human Mobility  daily traveling Time (HMT) in minutes. Both modalities are extracted from the LifeMap [4]) that contains human mobility traces collected from eight individuals, spanning from a few months to around two years. In total there are 52,819 position fixations, most of which are from regular sampling every two to five minutes. HMD is the total displacement for an individual in a day, and HMT is accumulated from shortterm movements calculated as follows: for each five minute interval, if the individual’s displacement is higher than 500 meters ^{1}^{1}1median errors of localization with assisted GPS, WiFi positioning and cellular network positioning are reported to be 8, 74 and 600 m [33], then the fiveminute period is counted as “traveling” and is accumulated to the daily total traveling time.
The third modality is daily Household Power Consumption (HPC). Two datasets are used for this modality, i.e. household power consumption datasets from France^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption and Australia^{3}^{3}3http://data.gov.au/dataset/samplehouseholdelectricitytimeofusedata (HPCFR, HPCAU). HPCFR consists of
2,075,259 active power consumption in watt sampled every minute for 48 months from a single household. HPCAU consists of household power meter readings in kwatt hour sampled every 30 minutes from households for up to 29 months.
To prepare the data, we developed a program to extract only the samples that have complete (or nearly complete) day cycles, meaning that every data sample used must have regular readings in each period of time in a complete day. To obtain meaningful results, only individuals with more than days of records are used in the experiments.
For the human mobility datasets, we use the two individuals’ datasets with the highest quality of data in terms of timespan (>150 days) and sampling frequency. We extract the traveling distances and traveling times for each interval (e.g. a 30 minutes interval creates 48d timeseries for a day), and use the resulting timeseries for the experiments. Similar preprocessing is applied on the power consumption datasets. After preprocessing, each timeseries sample has elements as , each is the occurred value in the corresponding time interval (noncumulative).
For each individual dataset, we randomly divide the samples equally into three folds: the training set, the validation set and the test set. The model is trained using the training set, and is then tested on the validation set. Such crossvalidation is performed on the same individual dataset for five times with random splits, and the reported performance is the averaged value cross the five iterations. The settings of hyperparameters with the best validation performance are kept as the hyperparamters of the model. Finally we test the model on the test set and report the performance.
4.2 Evaluation Settings
For evaluation we consider the periodical accumulation prediction problem, where each input is a head segment of a complete and corresponds to a target value representing the periodical accumulation. Clearly the model can be used to perform other types of prediction such as timeseries forecast or ahead prediction. Due to space limit here we use periodical accumulation prediction as a showcase for TeNet’s performance advantages.
TeNet is implemented using Python with the Theano framework
^{4}^{4}4http://deeplearning.net/software/theano/. For comparison, we consider four competitive methods, namely Support Vector regression with Linear kernel (SVLN), Support Vector regression with Radial Basis kernel (SVSIG), Support Vector regression with Polynomial kernel (SVPOLY), and Multiple Kernel Regression (MKR) [24].The parameter selection criterion for the SVfamily is that we carefully tune the parameters (error margin), (degree of kernel function), and (kernel coefficient) for kernels. Each parameter’s value is selected from the sets , , respectively, so in total there are 363 combinations for each model. For each test run, during training we iterate through every combination of , and ’s candidate values, and keep the values that generate the highest accuracy on the validation set, then use these parameters on the test set and report the results. For comparable evaluation against MKR, we use an offline implementation where test samples are not used to update the parameters, and the number of support vectors is set to 120 for matching the parameter size of TeNet. The hyperparameter selection of TeNet follows the same procedure. We provide more details in Section 4.6.2.
For most of the experiments is set to 28, meaning for each day, the timeseries up to 2pm is known to the model. Selecting this particular number is because considering humans rarely remain active from 12am to 4am and the values in that period are almost all zeros, the first 28d represent information from exactly half of the active period from 4am to 12am of the next day. Such setting is challenging in the sense that the gap between 2pm to 12am next day is substantial and it leaves numerous possible outcomes for the daily accumulation. The complexity involved hence provides insight about how well the proposed and the competitive models can capture an individual’s daily patterns and make prediction from limited information.
Next we present the experimental results for the proposed method and the competitive methods, and also offer indepth discussion about hyperparameter tuning and about the effect of temporal embedding.
4.3 Prediction for Periodical Accumulation
Table 3 studies the prediction performances of the proposed method and four competitive methods on 15 individual datasets of three different modalities, evaluated by average HitRate(HR)@20% and 30%, Mean Squared Error (MSE) and Mean absolute Relative Error (MRE). Using four metrics is due to that for datasets with longtailed values (which human behaviors can often be characterized to be [11]), as an absolute measurement, MSE alone is not an ideal metric to evaluate a regression method’s performance because it is heavily biased by samples in the long tail [31, 32]. Therefore we mainly use relative measures for the evaluation while keeping MSE as a reference.
The highlighted numbers in red, black, magenta and blue indicate the winning performance on that dataset under the corresponding metric ( magenta HR@20%, blueHR@30%, redRE, black
MSE). Multiple highlighted numbers with the same color in a row indicate multiple winners under that metric on that dataset. We also report some of the properties, i.e. the total number of samples n, the numeric range [~], and standard deviation
, for each individual dataset. A closer look at these dataset statistics suggests large varieties in terms of number of samples (from 156 to 874), numerical ranges (0.2 to 345) and variances ( from 2.8 to 47). To present the reader with more intuitive and meaningful results, the numbers shown are unnormalized.Generally, the distribution of the highlighted and winning performances shows that TeNet achieved best results in most of the cases, with a few but nonsystematical exceptions spread across the competitive methods. Out of the 15 individual datasets, TeNet has won 14 entries in HR@20%, 15 entries in HR@30%, 13 entries in MRE, and 7 entries in MSE, showing a superior performance among the evaluated models. SVLN and SVSIG show least competitive results by having 1, 0, 2, 1 and 1, 0, 1, 0 winning performances respectively. SVPOLY obtains slightly better results with 3, 2, 3, 0 wins. MKR on the other hand, has shown comparable results in MSE but far less competitive results in other metrics, by having 0, 0, 1, 8 wins. In addition, we find that MKR is less robust to larger numerical ranges such as in HMD8, HMD12, HMT8, and HMT12, while TeNet demonstrates consistent performances cross all datasets.
To compare the methods quantatitively, we plot Figure 2 and show each method’s mean average scores cross all individual datasets (MSE is normalized with the maximum MSE among the methods in each entry). On the 15 individual dataset, TeNet achieved best average performance under all four metrics. Taking a TeNet vs. all approach, we find TeNet’s performance and the average of other methods’ performance under HR@20%, HR@30%, MRE and MSE are 69 vs. 60, 84 vs. 78, 0.22 vs. 0.27 and 34 vs. 51 respectively, showing that TeNet makes a relative improvement of 15%, 8%, 19% and 33% respectively under the corresponding metric. Then if we investigate TeNet vs. the best among the rest, with HR@20% 69 and HR@30% 84, TeNet beats the second best HR@20% 61 (SVLN, SVPOLY) by 8, the second best HR@30% 78 (SVLN, SVPOLY) by 6; on MRE and MSE, TeNet’s average errors are 0.22 and 34, while the second bests are 0.24 and 40 (MKR). Hence for all 15 individual dataset, in average TeNet marks an 13% increase in HR@20%, an 8% increase in HR@30%, a 9.1% decrease in MRE and a 15% decrease in MRE to the second best method under each corresponding metric. We also observe that though in all 15 individual datasets TeNet obtained the best performance under HR@30%, the average winning margin is the smallest than those under other metrics. This is because HR@30% is a relative looser measurement than other metrics, which leads to the result that less accurate prediction tends to have similar performances. However, the consistent advantage of TeNet in not only HR@30% but all four metrics still suggests that it has the best prediction accuracy. We hence conclude that TeNet has shown consistent advantages which are robust to variations in the data modality as well as the statistics characteristics of the data.
We further examine TeNet’s ability to scale up its learning effectiveness with a growing sample size or an increasing complexity of the data. Taking MRE for example, we measure two correlations using Pearson’s correlation coefficient: 1) the correlation between the averaged performance advantage () and the sample size, 2) the correlation between the averaged performance advantage and the entropy, for each individual dataset. The measurements yield correlation coefficients 0.7 and 0.79 respectively, suggesting a strong correlation between each set of the variables. Such patterns mean that as the sample size or the complexity of the data grows, TeNet is able to learn more effective than other methods to achieve better performance. The correlations are also visually identifiable as we plot the the performance advantage ratios in Figure 3.
4.4 The Effect of
Figure 4 illustrates the effect of the feature dimensionality on the prediction accuracy. Here we use HPCAU8 as a case study. Figure 4(a) shows the changes of MRE and normalized MSE to a growing . Unsurprisingly, both errors decrease monotonically as increases, from 1, 0.35 at to 0.08, 0.07 at . Figure 4(b) depicts how the HR responds to a growing . Again, we see monotonic growths (almost, except for ) in HR@20% and HR@30%. These results confirm that TeNet can effectively use the additional information and in the mean time has received little impact from the noise in the additional dimensions.
4.5 The Effect of Temporal Embedding
In Section 3.3 we discussed how hypothetically temporal embedding would boost the performance of the model by automatically realigning the distorted timeseries to the dominating patterns in a dataset, and verified it with a case study on a synthetic example. To further validate this hypothesis on real data, we create a designated dataset from HPCAU8 by performing the following procedure:

We run a clustering with the affinity propagation method in [10], and find the top 10 exemplars.

We take the exemplars and generate 300 synthetic samples (30 for each exemplar) by distorting the exemplars with randomly selected operations such as swapping two neighboring segments or shifting the data forward and backward. They are equally split into training, validation and test set.

We train a model with a modified classical convolutional neural network fore regression (CNN, input convolution/pooling sigmoid
l1linear regression) without temporal embedding, and a model with TeNet, and examine the performance differences.
The results are reported in Table 4. We observe that with the temporal embedding layer, the prediction accuracy has been improved by more than a half (15.5 to 6.4, 0.34 to 0.12) for MSE and MRE, and for about 100%/40% in HitRate@20% and 30%. This shows that temporal embedding is able to learn the weights which are conceptually equivalent to a reverse operation for the distortions and misalignments.
HR@20%  HR@30%  MRE  MSE  

CNN  38  66  0.34  15.5 
TeNet  75  93  0.12  6.4 
4.6 Discussion
4.6.1 Distinctive Snippets
We present a visualization of the random snippets and learned snippets for the first crossvalidation iteration on HPCAU8 in Figure 5. Each cell is a snippet, a segment of timeseries the model deems representative. The figures show some noteworthy properties. Firstly the random snippets are fairly dense, while the learned ones are much more sparse, meaning that in most of cases there are only a smaller number of spikes and valleys in each learned snippets. Secondly, the sparsity of the learned snippets is also accompanied by a visually identifiable high distinctiveness across the learned snippets, which means snippets learned tend to be different from one another because they effectively capture different patterns in the training data. Both properties suggest that the snippets are truly learning from the patterns in the dataset and both properties have a positive effect on the model’s prediction accuracy.
4.6.2 Selection of Hyperparameters
As an issue often posed to complex learning models including neural networks, how to select the hyperparameters is an open question studied by many [15]. There are six hyperparameters in the proposed model:
Notation  Description  Candidates 

filter size  {3,5,7}  
no.kernels in conv. layer  {20,30,40,60}  
learning rate  {0.01,0.02}  
temporal embedding step  {1,2}  
no.output in sigmoid layer  {12,16}  
weight for the l1term  {0.1,0.01,0.001} 
In this paper, since the sizes of the datasets are moderate, we use an intuitive approach to find the hyperparameters for the testing. The selection and testing processes follows that described in the third paragraph of Section 4.2. One can also use the greedy hyperparameter selection processed described in [15]. We also used two optional data preprocessing, i.e. high pass filtering to denoise, and data shifting to synthesize more training data. The activation of each technique is subject to a control parameter which is tuned using the same process.
Note that since all the hidden nodes in layers 2, 3 output small values only, with the settings we used for experiments, the regression layer’s ability to predict larger numbers (e.g. >1000) is limited. To predict larger numbers, one can consider either rescaling the data or setting smaller to adjust to the numerical range of the specific dataset.
4.6.3 Network Depth and Number of Parameters
The proposed model has a moderate number of layers (four if we count the convolution/pooling as one), and hence a moderate number of parameters to estimate. For example, with , (one and one ), and set to 20 and 5, , we have:
(17)  
It is possible to add more layers to construct a deeper architecture based on temporal embedding and convolution. However, the data itself must be complex enough to provide more potential for the model to exploit. Given the granularity of daily human behaviors, for the task of predicting modalities such as traveling distance/time and power consumption, a deeper architecture has only limited effect.
5 Conclusion
Motivated by the observation that regularities in periodical timeseries sometimes manifest at different moments and at varied paces, in this paper we propose a technique called temporal embedding and devise a convolutional neural networkbased learning model called TeNet, which is robust to temporal distortions and misalignments, to learn abstract features. First we present TeNet and discuss the intuition behind it using a case study, and then describe the technical details for the whole network architecture, and solve the backpropagation problem for the proposed model. In the experiments we use an extensive range of reallife periodical data that covers three modalities to compare the performances of the proposed model against competitive methods. We find that in average TeNet achieves 8% to 33% advantage against other methods in difference metrics and the advantage scales up with a growing sample size used in training. We also find that the accuracy of TeNet increases almost monotonically with a growing
, indicating the model is effective in utilizing more information and while remaining robust to noise. We also create a set of synthetic data from the reallife data to demonstrate the effect of temporal embedding and successfully show its capability of realigning distorted and misaligned data. At the end of the experiment we also offer an indepth discussion about hyperparameter selection, data preprocessing, network depth and number of parameters, and present a visualization of the learned snippets. Beyond the periodical accumulation prediction problem, we expect Tenet to be useful for general timeseries predictions ranging from forecasts to kahead prediction.References
 [1] O. AbdelHamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech & Language Processing, 22(10):1533–1545, 2014.

[2]
Y. Bengio.
Learning deep architectures for AI.
Foundations and Trends in Machine Learning
, 2(1):1–127, 2009.  [3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. NIPS, 19:153, 2007.
 [4] Y. Chon, E. Talipov, H. Shin, and H. Cha. CRAWDAD data set yonsei/lifemap (v. 20120103). Downloaded from http://crawdad.org/yonsei/lifemap/, Jan. 2012.
 [5] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In IJCAI, pages 1237–1242, 2011.
 [6] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pages 160–167. ACM, 2008.
 [7] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537, 2011.
 [8] L. Deng, J. Li, J. Huang, K. Yao, D. Yu, F. Seide, M. L. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, and A. Acero. Recent advances in deep learning for speech research at microsoft. In ICASSP, pages 8604–8608, 2013.
 [9] A. Fischer and C. Igel. Training restricted boltzmann machines: An introduction. Pattern Recognition, 47(1):25–39, 2014.
 [10] B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315(5814):972–976, February 2007.
 [11] M. C. Gonzalez, C. A. Hidalgo, and A.L. Barabasi. Understanding individual human mobility patterns. Nature, 453(7196):779–782, 2008.
 [12] A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng. Deep speech: Scaling up endtoend speech recognition. CoRR, abs/1412.5567, 2014.
 [13] G. E. Hinton. A practical guide to training restricted boltzmann machines. In G. Montavon, G. B. Orr, and K. Müller, editors, Neural Networks: Tricks of the Trade  Second Edition, volume 7700 of Lecture Notes in Computer Science, pages 599–619. Springer, 2012.
 [14] R. Jurdak, P. Sommer, B. Kusy, N. Kottege, C. Crossman, A. Mckeown, and D. Westcott. Camazotz: Multimodal activitybased gps sampling. In IPSN, 2013.
 [15] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 10:1–40, 2009.
 [16] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
 [17] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area V2. In NPIS, pages 873–880, 2007.

[18]
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.
In ICML, pages 609–616, 2009.  [19] H. Lee, P. Pham, Y. Largman, and A. Y. Ng. Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS, pages 1096–1104, 2009.

[20]
M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda.
Subject independent facial expression recognition with robust face detection using a convolutional neural network.
Neural Networks, 16(56):555–559, 2003.  [21] A. Y. Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In ICML, page 78, 2004.
 [22] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, pages 689–696, 2011.
 [23] T. Rakthanmanon, B. J. L. Campana, A. Mueen, G. E. A. P. A. Batista, M. B. Westover, Q. Zhu, J. Zakaria, and E. J. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In SIGKDD, pages 262–270, 2012.
 [24] D. Sahoo, S. C. H. Hoi, and B. Li. Online multiple kernel regression. In SIGKDD, pages 293–302, 2014.
 [25] M. Schmidt. Least squares optimization with l1norm regularization. CS542B Project Report, 2005.
 [26] C. M. Schneider, V. Belik, T. Couronné, Z. Smoreda, and M. C. González. Unravelling daily human mobility motifs. Journal of The Royal Society Interface, 10(84):20130246, 2013.
 [27] C. M. Schneider, C. Rudloff, D. Bauer, and M. C. González. Daily travel behavior: Lessons from a weeklong survey for the extraction of human mobility motifs related information. In ACM SIGKDD International Workshop on Urban Computing, page 3, 2013.
 [28] Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic gradient descent training for l1regularized loglinear models with cumulative penalty. In ACL 2009, pages 477–485, 2009.

[29]
P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol.
Extracting and composing robust features with denoising autoencoders.
In ICML, pages 1096–1103, 2008.  [30] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:3371–3408, 2010.
 [31] D. Wang, C. Song, and A.L. Barabasi. Quantifying longterm scientific impact. Science, 342(6154):127–132, October 2013.
 [32] D. Wang, C. Song, H.W. Shen, and A.L. Barabasi. Response to comment on "quantifying longterm scientific impact". Science, 345(6193), July 2014.
 [33] P. A. Zandbergen. Accuracy of iphone locations: A comparison of assisted gps, wifi and cellular positioning. Transactions in GIS, 13(s1):5–25, 2009.
 [34] K. Zhao, R. Jurdak, J. Liu, D. Westcott, B. Kusy, H. Parry, P. Sommer, and A. McKeown. Optimal lévyflight foraging in a finite landscape. Journal of The Royal Society Interface, 12(104), 2015.
Comments
There are no comments yet.