I Introduction and Related Work
A typical architecture of Internet VOD is depicted in Figure 1(a). Given the knowledge of the relation between users and videos as well as of the relation between users and CDNs, one key issue as Figure 1(b) is to dispatch videos to CDNs. Improper dispatches cause poor video service qualities. It is noticed that five main challenges occur with the emerging popularity of VOD: (1) large video amount (currently 50 million) compared with limited storage capacity (800,000 videos per CDN on average) and dispatch bandwidth (30,000 videos per day on average); (2) everincreasing uploaded videos (60,000 videos per day on average) every day; (3) notable differences between peak (1,667,000 qps) and offpeak (235,000 qps) requests and video preferences; (4) erratic variants of user requests; (5) implicit and intrinsic request relations among users due to different populations, economies and cultures. Bearing the above challenges in mind, we aim at a scalable video dispatch system working at offpeak time to enhance the peak service qualities on the next day.
Our present baseline method triggered the dispatch of video to CDN , when the requests of from the users served by exceed a threshold for a period
. This implies that one successful service corresponds to h CDN misses. Besides, dispatch at peak time occupies network bandwidth and resources, which affects the peak service quality. We apply two ratios as evaluation metric:
as the number of requested videos after dispatched within a whole day over the total dispatch number, and as the number of peak requested dispatched from several hours ago till recently over the total dispatch number. While can be optimized by fine tuning and , stays low, meaning that the baseline method performs poor for peak service quality. This is due to difficulty (3) and (4), and the baseline method fails to explore the relation between peak and offpeak requests, only to exploit a very short memory of requests without knowing the longer changing pattern.Ia Time Series Forecasting
Recurrent neural networks (RNN
s) memorize state transition efficiently thanks to their recursive structures with hidden state storage. RNN has been applied to predict noisy and nonstationary time series, where the series are unsupervised learned before sent to RNN in case that RNN prefers shortterm dependencies than longterm dependencies caused by vanishing gradients. Long shortterm memories (
LSTMs) and gated recurrent units (
GRUs) are proposed with learnable gates to control information flowing in order to alleviate this problem. Both LSTM and GRU have achieved the stateoftheart performances in sequential applications. A common point in the above sequential models is that every single point in the series weighs equally importantly, and therefore they are vulnerable to noise and prone to overfitting to specific series patterns.Convolutional neural networks (CNN
s), which are typically implemented with globally shared nonlinear filters followed by pooling operations, strengthen and extract local features. CNN has formed the crux of computer vision applications. CNN is usually followed with max pool to subtract the most representative features for each local region.
Time series analysis with CNN (Borovykh et al. 2017) also emerges in recent years. In this paper, we put a topk pool ahead of CNN to extract the peaktime pattern, place three max pools behind CNN to extract the interday pattern, and concatenate CNN with RNN (Jozefowicz et al. 2016) to learn the dynamics with low complexity and in avoidance of overfitting compared with pure RNN. In Wavenet, causal CNN without pools (reference ?) were proposed to generate predicted 1D signals, forcing the filters always to calculate on the previous points.
It is noticeable that in the above researches, all the input and output elements value equally. However, the peaktimeoriented dispatch problem focuses more on the peak elements than the offpeak ones. Besides, the predicted peak requests are not necessarily precise in an exact time, since the peak time sustains in a period. This motivates us to design our algorithm to differentiate peak and offpeak request patterns.
IB Ranking Problems
At first glance, video dispatch can be solved the same way as a variation of 0/1 knapsack problem. Metaheuristic techniques are guaranteed neither an optimal solution, nor polynomial prediction time complexity. Neural network methods, such as Hopfield Net [Hopfield and Tank, 1985], and recently proposed Pointer Net [Vinyals et al., 2015]
, though have been successfully verified in similar combinatorial optimization applications, suffer from either scalability or lowdimension restrictions.
Alternatively, the dispatch problem is to pick topk videos by the probabilities of the future requests from the users served by certain CDN servers. This process simulates a ranking problem. Learning to rank (L2R) has produced a much better performance than traditional methods. The L2R algorithms are categorized [Liu, 2009] as pointwise [Li et al., 2008], and other techniques. The pointwise ranking focuses on total and absolute orders. This works for the dispatch problem because the strategies cannot be made until the total video requests have been seen, while the pairwise ranking aims at partial orders and the listwise ranking costs too high complexity for large scale data set.
IC Clustering and Manifold Learning
In the ranking or time series forecasting problems, the set of ranked items or series are predetermined. The VOD provider, however, is facing new videos all the time, which is nontrivial in dispatch problems. Clustering methods categorize data with distancebased metric without human labelling so that new videos as well as present ones are represented with a nearest centroid. To cope with “curse of dimensionalit”, manifold learning projects features into a low dimensional subspace, uncovering the intrinsic distribution of the input data in visible and “compact” way. Autoencoder [Hinton and Salakhutdinov, 2006], as a nonlinear dimensionality reduction method, was proposed to pretrain deep neural networks. It and variations [Kingma and Welling, 2013, Makhzani et al., 2015] have been applied as preprocessing for clustering [Baldi, 2012, Xie et al., 2016].
Distributed clustering algorithms [Tasoulis and Vrahatis, 2004] were proposed for large data set. Unfortunately, they do not maintain consistent clustering indices at training and predicting iterations. We propose a supervisedlike clustering modification to solve this problem.
ID Endtoend Learning
Recently it has been a fashion to train multitask neural networks in an endtoend learning mechanism for better performances rather than to optimize them independently. Within the ideal endtoend learning trait, all learnable parameters are differentiable with respect to the ultimate task‘s losses (in supervised learning) or rewards (in reinforcement learning). One strategy is to carefully design the structure of these networks so that they are connected by shared components. For example, Faster Rcnn
[Ren et al., 2015] combined region proposal with objection recognition by a shared CNN part as feature extractors, and led to globally optimized object detection results; value iteration networks [Tamar et al., 2016]introduced a fully differentiable CNN planning module to approximate the value iteration algorithm, simultaneously approximating the Markov decision process and optimizing the control.
IE Our Contributions

An endtoend learning mechanism which fulfills video clustering and dispatch tasks.

A novel structure for time series forecasting which utilizes both peaktime and wholeday temporal features.

A supervisedlike clustering method fit for the minibatch gradient descent (MBGD) training and the distributed environment, and therefore scalable for big data.

Stateoftheart performances under the realworld VOD system.
IF Notations
The symbols in this paper are defined as follows:
denotes the constant as the number of consecutive days when training or predicting a batch of data.
denotes the constant as the number of time intervals within a day.
denotes the constant as the number of peak time intervals within a day.
denotes the set of the users.
denotes the set of CDNs, and the set of the CDNs that serve user .
denotes the set of the videos, and the set of the videos dispatched to CDN
denotes the set of the requested videos within a day, the set of the requested videos from the user within a day, the set of the requested videos on day , and the set of the requested videos by the user u on day . denotes the set of the requested videos at peak time, and similarly with , , and .
denotes the mapping from the video to its cluster, and denotes the cluster of the video . denotes the probabilities of the clustered videos dispatched to the CDNs, and the probability of the cluster of the requested video dispatched to IDC . denotes the upper bound of the sum of clustered dispatch probabilities to the CDNs.
denotes the cardinality of the set .
denotes the cardinality of the range of the mapping .
denotes the Cartesian product of two sets and .
represents convolution operator.
represents extracting the largest elements out of according to the ordered set given by .
represents sorting in descending order.
represents to extract the max element of each submatrix of the matrix , and reshape the output to 1D.
, , and
are the nonlinear activation functions, referring to rectifier, sigmoid, and hyperbolic tangent functions.
denotes the L2norm of .
Ii Problem Formulation
Iii The Structure of Dispatch Neural Networks
We set up two neural networks, clustering network and policy network, as depicted in Figure 2. The clustering network is composed of temporal layers and clustering layers. The policy network comprises accumulation layers, temporal layers and policy layers. The two networks are coupled with the temporal layers, which have identical structures and shared weights. The outputs of the clustering network and the primitive inputs constitute the inputs of the policy network. The outputs of the policy network are the probabilities of the videos dispatched to users rather than to CDNs. The probabilities of the videos dispatched to CDNs will be calculated in a following “postprocessing” procedure.
Iiia Primitive Inputs
The primitive inputs for one video are its request sequences. Each sequence represents the requests from a user on intervals of days. Therefore, the primitive inputs have a user dimension and a time dimension with intraday and interday parts. The inputs of video is denoted as follows, where denotes the requested number of from the user on interval on day .
We define for convenience.
The primitive inputs exclude the information other than video requests, such as video metadata, weekends and holidays. There are three reasons: (1) the mapping from video content to user request is not bijection due to redundant uploads; (2) labelling the videos with tags may be subjective and ambiguous; (3) the video requests are highly nonstationary processes, and modeling with other observations is vulnerable to overfitting.
IiiB Temporal Layers
The temporal layers process the time dimension of the inputs. There are two components, CNN module for the intraday part and RNN module for the interday part, as depicted in Figure 3. The temporal layers are designed to predict the future requests with daily peaktime features and wholeday trajectories.
The CNN module has two parallel groups of convolutional filters with pools, which we call peak convolutional part (PCP) and mean convolutional part (MCP) respectively. PCP and MCP take as a batch.
In PCP, the inputs are processed with a global topK pool sorted by the total requests at each time. Then 1D convolutions are calculated.
and are learnable parameters.
In MCP, the inputs are passed via 1D convolution filters.
and are learnable parameters.
Then is reshaped to three matrices , , : , , and , where , and The rows are closely related to hours, and the columns to minutes.
These matrices are processed with 2D max pools and weightedly summed:
, and are learnable parameters.
The outputs of the CNN module, , are the concatenation of and .
The RNN module adopts GRU, and takes as the th input for user . The th output is computed as :
, , , , , , , , and are learnable parameters.
The outputs of the temporal layers from user are . We denote the outputs for video from all users as .
IiiC Clustering Layers
The clustering layers process the user dimension of the inputs. There are four components, normalization module, autoencoder module, clustering module and loss module.
The normalization module scales the inputs according to L2norm, as .
The autoencoder module adopts a feedforward structure, which encodes to on the 2D plane , and then decodes back to .
In the clustering module, we divide into hierarchical blocks . We construct as Figure 4 and follows:

Let , and the number of divisions per rightside hierarchy. Let .

Divide uniformly into open subintervals . Let . Insert into .

Let . Repeat 2 until .

Let

Let .
Comparison with Kmeans results in Figure 4 shows that our supervisedlike clustering method maintains the consistency of cluster indices in MBGD training and distributed environment, so that the video cluster can be determined without knowledge of the other ones.
The loss module adds the loss from the encdec module and the loss from the clustering module weighted by a hyperparameter
.denotes the coordinate of the center of b.
We apply unsupervised learning to explore the user dimension because there are few labeled tags or accurate prior knowledge. Besides, we separate the temporal layers from the autoencoder module for two reasons: (1) fullyconnected structure is not an optimal choice for time series analysis; (2) the decoders with RNN and CNN suffer from lack of a finitesized dictionary.
IiiD Accumulation Layers
The accumulation layers group the primitive inputs by their clusters.
Then follows normalization, and gets After the temporal layers, the outputs are .
IiiE Policy Layers
The policy layers estimate the probabilities for each video cluster to be dispatched to all users on day
. There are two components, full connection module and loss module. The full connection module transforms to on with learnable parameters.The loss module compares with the real request data on day D+1.
Iv Implementation and Results
Iva Data Set
We prepare the data set with the realworld video requests from each user accumulated every five minutes. Historical requests are for training, and 15 consecutive days of requests until present are for predicting. There are 177 users, and the dimension of the feature space of one video has 12 minutes 24 hours 15 days 177 users = 764,640.
We analyze the video requests from two aspects. Firstly, we consider each video request sequence from one user as an independent stochastic process and observe the nonstationarities. We sample 5,000 sequences, calculate the difference of the means within a oneday sliding window. After taking the firstorder difference of the sequences, the difference of means decreases, as in Table 1. That means nonstationarities are reduced. Next we run the KwiatkowskiPhillipsSchmidtShin (KPSS) test to the sampled sequences with oneday lag (288). They also show the nonstationarities.
Difference of Means  
Original  0.00616  
After firstorder differences  0.00158  
KPSS Test  
KPSS statistic  5.38  
Pvalue  0.000003  
1%  3.43  
Critical values  5%  2.86 
10%  2.57 
Secondly, we analyze the peak request frequencies. The total request number is 82,700,000 for two hours, with 8,930,000 videos, so the average request frequency per video is 9. We sort the videos in descending order of request frequencies, and show the request distribution after excluding current videoss. Figure 5 shows a typical longtail shape.
IvB Environment
We run on multimachines with 48core CPU (Intel Xeon 2.20GHz) and 128G memory, based on TensorFlow 1.3 and coordinated with Apache Zookeeper 3.4.
IvC Trainers and Predictors
Training the clustering network requires a clustering trainer, and training the policy network requires a clustering predictor and a policy trainer. Predicting involves clustering and policy predictors.
The trainers comprises all of the modules in the respective networks. The clustering predictor is composed of the all layers but the decoder and loss modules in the clustering network, and the policy predictor is composed of all layers but the loss module in the policy network.
IvD Asynchronous Training Mechanism
Three models are generated through training the temporal, clustering, and policy layers. The clustering model is updated in the clustering network. The temporal and policy models are updated in the policy network.
We train the two networks asynchronously as in Figure 6. We start the policy trainer first with randomized clustering indices. Then start the clustering predictor and trainer with the identical temporal model trained for several iterations in the policy trainer. The clustering trainer fetches the temporal model from the policy trainer every 200 iterations. The clustering predictor fetches the clustering model from the clustering trainer at each iteration.
The video dispatch scenario differs from the regular L2R in that the items are assumed as independent in the latter case, while the popular videos often enjoy continual requests. Therefore, the dataset for the policy network needs shuffling to break the chronological order so as to remove autocorrelation between the neighboring samples. This procedure is inspired by experience replay [Lin, 1992, Mnih et al., 2013] in reinforcement learning.
All of the learnable parameters are initialized with Xavier initialization [Glorot and Bengio, 2010]. The MBGD optimizer is utilized, and the learning rates are layerwisely tuned with decay settings.
IvE Prediction Mechanism with Postprocessing

Run the clustering predictor for .

Run the policy predictor for .

Calculate .

For , sort in the descending order of . The videos in the same cluster are sorted in the descending order of uploaded time. Dispatch the first 10,000 videos to .
IvF Training Performances
Firstly, Figure 7 shows how the normalization module helps the clustering trainer to converge, where the normalization module is removed for A and B. Loss curve A represents training with a data set of 20 batches, and unchanged temporal models. The loss jumps beyond the initial value immediately after 3 iterations, and then descends slowly. Loss curve B represents training with a data set of only one batch, and periodically (every 500 iteration) changed temporal model. The loss shoots up every time the temporal model is updated to newly randomized values. Loss curve C represents training with a data set of 20 batches, and changed temporal model as in B. It descends 30% at the first 500 iterations, and continues till the 10,000 iterations.
Secondly, Figure 8 shows how shuffled data set helps the policy trainer to converge. The data set in A is ordered by ascending date, and divided into two groups in the middle. The policy trainer fetches the two groups of data alternately, training each group for 200 iterations. The data set in B is finely shuffled. Loss A decreases more rapidly than loss B in the beginning, but shoots up after moving to the other half of data, causing it difficult to converge.
Thirdly, Figure 9 shows the training results based on the section of “asynchronous training mechanism”. They converge without being interfered by each other.
IvG Assessment for Video Clustering
Firstly, Table 2 compares the inner productions between 10,000 video requests in the same clusters (excluding the inner productions of the video requests to themselves) with those in different clusters. The mean from the same clusters is much larger than that from the different, and the coefficient of variation (CV) is smaller. This implies that the video requests from the same cluster are much more similar than from the different.
Same  Different  

Mean  14.0  0.117 
CV  1.41  4.59 
Secondly, Figure 10 compares the video request densities in different clusters, including nonsparsities (NS), L1norms and L2norms (whose means are multiplied by 20). NS is defined as the ratio of the number of nonzeros to the length. The means of the three are positively related, and the CVs are close. This implies that the videos with similar request densities are more likely to be in the same cluster.
Thirdly, we count the number of videos (NV) in each cluster, and calculate the average distances (AD) from the encoder outputs to the cluster centroids.Table 3 applies Pearson correlation coefficient to measure the correlation of NV with AD and the area of the cluster. This implies that a larger cluster area for larger values of the encoder outputs is space efficient with sufficient accuracies.
NV and area  0.6194 
NV and AD  0.4956 
IvH Prediction and Controlled Experiment Results
We dispatch respectively 10,000 videos out of 50 million to 50 CDNs at daily offpeak time for the next peaktime requests. Our video dispatch prediction takes 4 hours and 100 machines to compute. We denote our proposed algorithm as A.
We previously dispatched videos based on the threshold of their request frequencies during a period of time, denoted as B. Since B dispatches videos at peak time, extra costs are not ignorable. The prediction accuracy is represented as the ratio of the peaktime request number (revenues) to the peaktime dispatch number (costs).
For comparison, six other methods are listed. Firstly, we compute dispatch policies via classification without video clustering
. This restricts us to build one request model for one user in order to lower the input dimensions. The outputs are either “to dispatch” or “not to dispatch”, while the ground truth is the average requests over the peak time on the next day. Method C applies logistic regression and Method D applies Gradient Boosting Decision Tree.
Secondly, we remove the endtoend learning
mechanism, and let the clustering and policy tasks run independently. This restricts us from using temporal layers in the clustering network. Method E substitutes the clustering network with principal component analysis to reduce the input dimension to 4, followed by Kmeans clustering. Method F removes the temporal layers from the clustering network.
Thirdly, method G substitutes our supervisedlike blocks with Kmeans clustering.
Fourthly, method H removes the CNN module from the temporal layers.
The input sequences in C, D, E, F and H are sampled per hour. The prediction data set in E and G is limited to the latest 10,000 videos, and the dispatch number is 100. The prediction results are shown in Figure 11, which summarize A and B for 90 days, and C to H for 30 days. The results include the means and standard deviations of the prediction accuracies, but exclude recalls because the CDN storage capacities are much smaller than the total size of the requested videos.
The overall performances of the endtoend learning methods, namely A, G ,H, D and C are better than the others. Although Kmeans in G and E provides more accurate clustering results, it suffers from scaling issues and limits the dispatch candidates. The CNN module in method A filters the sequences, meanwhile retaining the important information, which especially have advantages over the fully connected structure as in F, and also in D and C. Besides, the CNN module with learned parameters performs better than trivial downsampling method in H. The RNN module in A, G and H provides better sequence forecasting accuracies mainly because RNN is able to explore complex nonlinear state transitions. Lastly, our previous method B suffers from the dilemma of request frequencies and request amounts as in Figure 5. Although a higher threshold in B implies higher dispatch accuracies, the number of dispatchable videos decrease.
V Conclusion
In this paper, we propose a video dispatch algorithm for VOD to enhance peaktime service quality. We cluster the videos with one network by their request patterns before dispatching them from the large candidate set. We develop dispatch policies for the clustered videos with the other network inspired by pointwise L2R algorithms. The two networks are coupled with shared temporalfeatureextracting layers, which comprise CNN and RNN modules. After training them asynchronously, the average prediction accuracy on realworld video requests is 5 times as high as that of our previous methods.
Future work will be devoted to risk controls to guarantee robustness regardless of rapidly changing situations.
References

[Baldi, 2012]
Baldi, P. (2012).
Autoencoders, unsupervised learning, and deep architectures.
In
Proceedings of ICML workshop on unsupervised and transfer learning
, pages 37–49. 
[Glorot and Bengio, 2010]
Glorot, X. and Bengio, Y. (2010).
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pages 249–256.  [Hinton and Salakhutdinov, 2006] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786):504–507.
 [Hopfield and Tank, 1985] Hopfield, J. J. and Tank, D. W. (1985). “neural” computation of decisions in optimization problems. Biological Cybernetics, 52(3):141–152.
 [Kingma and Welling, 2013] Kingma, D. P. and Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 [Li et al., 2008] Li, P., Wu, Q., and Burges, C. J. (2008). Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in neural information processing systems, pages 897–904.
 [Lin, 1992] Lin, L.J. (1992). Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(34):293–321.
 [Liu, 2009] Liu, T.Y. (2009). Learning to rank for information retrieval. Foundations and TrendsÂ® in Information Retrieval, 3(3):225–331.
 [Makhzani et al., 2015] Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. J. (2015). Adversarial autoencoders. CoRR, abs/1511.05644.
 [Mnih et al., 2013] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
 [Ren et al., 2015] Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99.
 [Tamar et al., 2016] Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162.
 [Tasoulis and Vrahatis, 2004] Tasoulis, D. K. and Vrahatis, M. N. (2004). Unsupervised distributed clustering. In Parallel and distributed computing and networks, pages 347–351.
 [Vinyals et al., 2015] Vinyals, O., Fortunato, M., and Jaitly, N. (2015). Pointer networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 2692–2700. Curran Associates, Inc.

[Xie et al., 2016]
Xie, J., Girshick, R., and Farhadi, A. (2016).
Unsupervised deep embedding for clustering analysis.
In International conference on machine learning, pages 478–487.
Comments
There are no comments yet.