LSTM-RPA: A Simple but Effective Long Sequence Prediction Algorithm for Music Popularity Prediction

by   Kun Li, et al.
NetEase, Inc

The big data about music history contains information about time and users' behavior. Researchers could predict the trend of popular songs accurately by analyzing this data. The traditional trend prediction models can better predict the short trend than the long trend. In this paper, we proposed the improved LSTM Rolling Prediction Algorithm (LSTM-RPA), which combines LSTM historical input with current prediction results as model input for next time prediction. Meanwhile, this algorithm converts the long trend prediction task into multiple short trend prediction tasks. The evaluation results show that the LSTM-RPA model increased F score by 13.03 BiLSTM, GRU and RNN. And our method outperforms tradi-tional sequence models, which are ARIMA and SMA, by 10.67



There are no comments yet.


page 1

page 2

page 3

page 4


Sequential VAE-LSTM for Anomaly Detection on Time Series

In order to support stable web-based applications and services, anomalie...

COVID-19 growth prediction using multivariate long short term memory

Coronavirus disease (covid-19) spread forecasting is an important task t...

Future Vector Enhanced LSTM Language Model for LVCSR

Language models (LM) play an important role in large vocabulary continuo...

A Simple Prediction Model for the Development Trend of 2019-nCov Epidemics Based on Medical Observations

In order to predict the development trend of the 2019 coronavirus (2019-...

A Stock Selection Method Based on Earning Yield Forecast Using Sequence Prediction Models

Long-term investors, different from short-term traders, focus on examini...

DiSH-trend: Intervention Modeling Simulator That Accounts for Trend Influences

Simulation on directed graphs is an important method for understanding t...

Diversity encouraged learning of unsupervised LSTM ensemble for neural activity video prediction

Being able to predict the neural signal in the near future from the curr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of network technology, music as one of the streaming media has also been developed rapidly, such as QQ Music 111 , Spotify222 , Ali music 333 and NetEase Cloud Music 444 People could listen, share, download, collect music, and communicate their music ideas online. At the same time, a large number of users’ behavior data are generated. The data from different users have implicit relationships, although they seem like irrelevant. And it also attracted the attention of many researchers and Internet companies. With the help of music popularity prediction, enterprises can formulate accurate propaganda strategies. The researchers recommend songs for users accurately by analyzing the active users’ music play sequence[1], musical complexity (chroma, rhythm, timbre, and arousal)[2], and Twitter users’ music-listening behaviors[3]. Meanwhile, some studies analyzed the music semantics constructs (genre, mood, instrumental, theme) [4], the music frequency spectrogram image [5], and the music raw audio signal [5] to predict the popularity of songs. We could predict the trend of music popularity precisely from the analysis of different aspects of music.

In this paper, we investigated the relationship between the historical information of audiences’ behavior and the trend of music popularity, and proposed an algorithm to improve the accuracy of long trend prediction. This historical information comes from the competition dataset of ” Popular Music Prediction”555DataSet: by Tsinghua University and Alicloud Tianchi Big Data Platform. The goal of the contest is to predict music trends for the next month by analyzing this historical data. This competition attracted 5,475 teams all over the world, and this topic is of great research significance. We proposed the LSTM Rolling Prediction Algorithm (LSTM-RPA) model with different features by analyzing the time series of the data. This research could play an important role in business decision making. We believe that using historical behavioral information of audiences could improve the performance of trend prediction.

2 Related Work

From movie popularity prediction[6], stock market trend prediction[7], to short-trend traffic prediction[8], trend prediction and time series modeling have a long research history. Research [9] used the Auto Regressive Integrated Moving Average (ARIMA) model to predict the Dez dam reservoir inflow monthly, and their experimental results show that the Mean Square Error (MSE) of ARIMA is smaller than that of the Auto Regressive Moving Average (ARMA). In[10], the research found that the ARIMA model cannot predict the non-linear time series accurately. So, they proposed the hybrid model based on Artificial Neural Network (ANN) and Genetic Programming (GP), which predicted more accurately than ARIMA on non-linear time series.

Research [11] established the Artificial “Music Market” and found a successful song was not only related to the quality of the song itself but also to social preferences, which increased the difficulty of prediction. In[5], the researchers used the frequency spectrogram images of music raw audio signals as the CNN model input feature to predict popularity music, and the accuracy of this model is 61% in the test set. Research [12] established HitMusicNet based on multimodal end-to-end Deep Learning (DL) architecture. HitMusicNet used autoencoder to compress high-dimensional features, which consist of audio, text lyrics, and meta-data to improve the accuracy of the model prediction. Research [13] divided the time features of music play-count into a basic trend and incremental trend, then they proposed Time Series based Music Prediction (TSMP) which is based on category optimal value selection. When songs’ play-count increased exponentially, the accuracy of the TSMP model prediction would be reduced. Therefore, they proposed an Extend-Time Series Music Prediction (E-TSMP) algorithm, which is combining Sub-Sequence Pattern Matching Method (SSPMM) and Additional Processing (AD), to improve the prediction accuracy.

In DL, Recurrent Neural Network (RNN) could learn key information very well from sequences [14]. The Long Short-Term Memory (LSTM) network, the improvement of RNN, can solve the problem of RNN gradient explosion and gradient disappearance effectively in long sequences, and can transmit key information to further cells [15]. Thus, we used LSTM to predict songs’ play-count. We proposed a Rolling Prediction Algorithm (RPA) to solve the problem of low accuracy in the long-trend prediction task. In this paper, we build two kinds of models, that are traditional sequence models (ARIMA and Simple Moving Average (SMA)[16]) and DL sequence models (LSTM, Bidirectional Long-Short Term Memory (BiLSTM) [17], Gated Recurrent Unit (GRU) [18] and RNN model).

3 Methodology

3.1 LSTM model

Assume a time series represent each point in time that contains the feature information. (1)(6) shows the LSTM formula:


In LSTM block,

is the sigmoid function,

is the new memory of the current layer. In time t, is the memory cell block, is the hidden state.

3.2 Single Feature LSTM Model

In the music popularity prediction task, one of the most important features is the songs’ historical play-count. For this single feature, we built the Single Feature LSTM Model (SF-LSTM) as shown in Fig. 1

. This model inputs time series data of different time steps, then these data respective go through two LSTM layers which consist of 64 and 32 neurons respectively. We use ReLU as an activation function to reduce information loss. After that, The SF-LSTM uses a full-connected layer to reduce dimension, and outputs the prediction results.

is the input time step, is the output time step, and is the initial hidden state.

Figure 1: Single Feature LSTM Model.

3.3 Multiple Features LSTM Model

Besides songs’ historical play-count, there are many other important features, such as weekly play-count, monthly play-count, download-count, collect-count, etc. We build the Multiple Features LSTM Model (MF-LSTM) as shown in Fig. 2, which consists of three LSTM layers. The first two layers are also LSTM layers that consist of 64 and 32 neurons respectively, ReLU as the activation function. But the last layer is the LSTM layer that consisted of q neurons with output step, sigmoid as the activation function. The particularity of model is that when the input step and the output step are equal, the input and output dimensions are same.

Figure 2: Multiple Features LSTM Model.

3.4 Rolling Prediction Algorithm (RPA)

Most of trend prediction algorithms only predict once[20]. That is, if we predict the next five days, only change the output layer into five neurons. In the case of mini-batch sets or lacking features, the model cannot get accuracy highly in long-trend prediction task. So, we proposed the Rolling Prediction Algorithm (RPA) which is based on moving average and Shortcut ideas [21] to solve this problem. This algorithm optimizes the long trend prediction for sequence models with mini-batch data sets or lacking features, and turn the long trend prediction task into multiple short trend prediction tasks.

The RPA considers transforming a sequence with time step into multiple prediction tasks with p length. When the RPA construct the input of the model in predicting time , it brings in the time t-1 and t information together, so that the model can consider the information of and in predicting time . (7)(8) shows algorithm formula:


Set the model input sequence , length ; output sequence , length ; and is rolling step, so . In (7), is the input sequence for the model to predict at time , we reversely order select input values: . is the model prediction result at the time , we order select prediction values: . Above all, at the prediction time, the model input sequence consists of and . In (8), is the trained sequence neural network, and the combination sequence gets the prediction result by . We build the LSTM-RPA, the network structure is shown as Fig. 3.

Figure 3: LSTM-RPA model. We use SF-LSTM and MF-LSTM as to make up of the LSTM-RPA model. is input for time , and is prediction result for time . At time , The model prediction input consist of and sequences, then the predicted result is acquired by the trained LSTM model.

4 Experiments

4.1 Data preparation and analysis

The dataset includes 183 days, 50 artists, 349946 users and 10842 songs, collected from March 1 to August 30, 2015. The sensitive information (Song_id, Artist_id, and User_id) has been replaced by the fixed-length string, and one song belongs to only one artist. The data format shows in Table 12.

Name Type Description
Song_id String Track name
Artist_id String Artists’ name
Ds String Date of data recording
Gmt_create String Time of user’s playing
Action_type String Type of songs’ action:1. play; 2. download; 3. collect
Table 1: User action
Name Type Description
Song_id String Track name
Artist_id String Artists’ name
Publish_time String Publish time
Gender String Gender: 1. Male; 2. Female; 3. Combination
Language String Language:1. Chinese; 2. French; 3. English
Table 2: Relations between artists and songs

Table 1 shows users’ behavior, such as Track name, Artist, Date of data recording, Type of songs’ action. Table 2 shows the relationship between the artists and the songs, such as the song belongs to which one artist, the song’s language, and the initial play-count. After preprocessing the data set, we found that one artist’s data was lost seriously. So we selected the other 49 artists’ data as the experimental data.

4.2 Feature analysis and data slicing

Considering the influence of the different features on model accuracy, we select daily songs’ play-count as the input feature in a single feature experiment, and select daily artist songs’ play-count, download-count and collect-count as the input feature in multiple features experiment.

At the same time, we chose the March to July dataset as the train and dev set divided by 122:31, and the August data as the test set. The experiments’ goal is to use the data from the first five months to predict daily songs’ play-count in August.

(a) A
(b) B
(c) C
(d) D
Figure 4: From March to July, the trend of songs playback by artists A, B, C, D. The x-axis denotes the date, its range is 1-122; the y-axis denotes the play-count

4.3 Music popularity trend analysis

Fig. 4 shows the daily play-count trend of 4 artists, which selected randomly from 49 artists from March to July. The x-axis denotes the date, its range is 1-122. The y-axis denotes the play-count. Artist A’s play-count fluctuated a lot in the 1 to 40 days, but after 40 days the play-count gradually stabilized around 1000 daily. It is speculated that before 40 days this fluctuation may be the recommended period after the artist releases a new album, or the season’s top TV series, or movie songs produced by the artist. Artist B’s total play-count trended to jitter upwards, with a rapid growth period of nearly 10 days after 30 days. This was possibly influenced by external platform recommendations or similar song recommendations. After 80 days, the play-count trended to fluctuate slowly. Artist C had a steady trend in 1 70 days, but on the 75th day there have huge fluctuations; we guess: (1) Error data. (2) Special holidays impact. (3) Influenced by sensational entertainment news. Artist D’s daily play-count is extremely unstable.

Each artist’s historical play-count data contains their unique time information, and there is a large gap between the different artists’ data. It is impossible to use one model to predict all artist songs’ play-count trend, so we build LSTM-RPA for 49 artists independently.

4.4 Evaluation

The competition organizer gives the evaluation function: F score, which is used to measure the difference between the predicted value and the actual value. a denotes artists , and is artists set. is actual play-count on day, and is model prediction value on day.

is normalized variance



In (9), N is the total number of days predicted. indicate the gap between the and . If is small, is large which means the model’s prediction very accurate. is artist weight.


Finally, the F score is defined.


In this paper, we use the F score as a measure of prediction accuracy. The larger the F score, the more accurate the prediction.

4.5 Experimental result

The music trend prediction is not only influenced by different features, but also by different rolling steps and time steps. When the input is single or multiple features, we researched the effect of different rolling and time steps on the accuracy of long trend prediction, and it is based on the Keras framework.

4.6 Single feature LSTM-RPA experiment

This experiment established and compared four kinds of neural networks: LSTM, BiLSTM, GRU, and RNN. The GRU and RNN network structure is the same as the SF-LSTM model. The BiLSTM model modifies the first two LSTM layers to bidirectional LSTM layers, which means data flows forward and backward only in the current bidirectional LSTM layer. The baseline result (black line) is the prediction score of the corresponding model without using the RPA. In this experiment, we only consider a positive score and change the negative score to 0.

Figure 5: F scores of different models with different time steps and rolling steps in a single feature experiment. The x-axis denotes the F score, the y-axis denotes the time step.

Fig. 5 shows the predicted F score of 49 artists by different rolling prediction models under different time and rolling steps in the single feature. When the time step is less than 3, the LSTM-RPA prediction accuracy is better than the LSTM model (baseline). However, when the time step is greater than 3, the LSTM-RPA model prediction accuracy is greatly reduced due to the overfitting. In first four time steps, the BiLSTM-RPA model has an underfitting (F score is 276.9514) at the time step 2, and other prediction scores are higher than the baseline score. At the time step 4, different rolling steps have different effects for the model’s prediction accuracy. When rolling step increase, the model’s accuracy is also improved. As a variant of the LSTM-RPA, the GRU-RPA model also has a high prediction accuracy when the time step is less than 3. However, different from the LSTM-RPA model, the GRU-RPA model still has prediction accuracy when the time step is greater than 3. More than the half time-step, the RNN-RPA model’s prediction score is greater than the baseline score.

4.7 Multiple features LSTM-RPA experiment

Because songs’ play-count trend is not only influenced by historical data but also influenced by songs’ download-count and collect-count. Based on the single feature LSTM-RPA experiment, we modified four neural networks structure. GRU and RNN network structures are same as the MF-LSTM. The output layer of BiLSTM model is the LSTM layer and others are the bidirectional LSTM layer. The black line is the baseline score.

Figure 6: F scores of different models with different time step and rolling step in the multi-feature experiment. The x-axis denotes the F score, the y-axis denotes the time step.

Fig. 6 shows the F score of different models with different time and rolling steps in the multiple feature experiment. When increasing the features, the LSTM-RPA model prediction accuracy is better than the LSTM model without RPA at all times steps. When the time step is greater than 5, the BiLSTM-RPA model prediction accuracy decreases with the increase of time step. This is because that the amount of training data increases when the time step increases. At the same time, the amount of forward and reverse training data increases in the bidirectional network structure, which makes the model strengthen the memory of some non-necessary data, and thus leads to the decrease of the prediction accuracy. In the GRU-RPA model experiment, the RPA could optimize the baseline which has best score in all model’s baselines. The prediction accuracy of the RNN-RPA model is greatly influenced by different rolling steps in the case of the same time step, but some prediction F score is still higher than the baseline.

4.8 Experimental analysis

The experiment explored the effects of single and multiple features on the RPA, then selected and compared each RPA model’s best F scores and corresponding baseline under different features. Table 3 shows the best F scores and baseline of each RPA model in the single feature experiment. The LSTM-RPA model has the highest F score and the lowest average error among the four models. And the F score of GRU-RPA and LSTM-RPA model are similar. The RPA could improve the prediction accuracy of RNN model better than other models. This experiment shows that the RPA has more than 10% optimization effect for these sequence models in the single feature experiment.

Model using RPA Best F score Baseline score Optimum ratio Average error
LSTM 4366.81 3863.12 13.03% 26.34
BiLSTM 4326.58 3740.34 15.65% 27.19
GRU 4360.07 3902.00 11.73% 26.49
RNN 4287.89 3684.44 16.37% 27.96
Table 3: The F score and baseline of each rolling prediction model in the single feature experiment

Table 4 shows the best F scores and baseline of different rolling prediction models in the multi-feature experiment. The LSTM-RPA, BiLSTM-RPA, and GRU-RPA have little difference in the best F score. And compared with their baselines, the RPA has the highest improvement effect on the LSTM-RPA model. But this optimization effect less than the RNN-RPA model. In all models, the GRU-RPA model has the highest F score and the lowest average error. These experimental results show that the RPA still has more than a 4% optimization effect in the multi-feature experiment.

Model using RPA Best F score Baseline score Optimum ratio Average error
LSTM 4273.16 3813.11 12.06% 28.26
BiLSTM 4263.43 4027.17 5.89% 28.46
GRU 4291.26 4117.23 4.22% 27.89
RNN 4198.39 3685.90 12.81% 30.60
Table 4: The F score and baseline of each rolling prediction model in the multi-feature experiment

Based on single-feature and multi-feature experiments, we selected the rolling prediction model with the highest predictive accuracy: single-feature LSTM-RPA, and compared with the baseline of different models and traditional trend prediction algorithms: ARIMA and SMA. Table 5 shows the best F scores and average errors of traditional models and sequence models. The experimental results show that the LSTM-RPA model has a better F score than other models. The LSTM-RPA compared with ARIMA and SMA, F score increases by 10.67% and 3.43%, average error decreases by 32.64% and 11.23%, average error variance decreases by 368.730% and 35.29%. And it compared with LSTM, BiLSTM, GRU, and RNN, F score increases by 13.03%, 16.74%, 11.91% and 18.52%, average error decreases by 39.02%, 48.55%, 36.02% and 52.88%, average error variance decreases by 183.69%, 252.83%, 175.23% and 225.79%.

F score 3945.53 4221.99 4366.81 3863.12 3740.34 3902.00 3684.44
Average error 34.94 29.30 26.34 36.62 39.13 35.83 40.27
Mean error variance 45.42 13.11 9.69 27.49 34.19 26.67 31.57
Table 5: The F scores and mean errors of best models

In summary, the model using RPA has higher prediction accuracy and lower average error than baseline in single-feature and multi-feature experiments. In the proportion of improvement of algorithm optimization, the RPA optimization effect on multi-feature experiments is lower than that on single-feature experiments. At the same time, the LSTM-RPA model has higher prediction accuracy than the traditional trend prediction models. And the LSTM-RPA model has a lower average error variance than the LSTM model. All in all, the RPA has some optimization effect in the long trend prediction task.

5 Conclusion

In this paper, we analyzed the data about information of artist and user’s behavior, selected different features (play-count, download-count and collect-count) for experiments. The goal of experiments is to predict artists’ play-count daily for 30 days. We proposed the LSTM Rolling Prediction Algorithm (LSTM-RPA) based on the moving average and Shortcut idea to improve the accuracy of the model in long trend prediction task, by analyzing the relationship between the model’s historical input and current prediction results. At time t+1, the LSTM-RPA model can consider the information of time t-1 and t, and turn the long trend prediction task into multiple short trend prediction tasks. Experiment results show that the prediction accuracy of the single-feature LSTM-RPA model is better than the traditional model and DL sequence models. The LSTM-RPA compared with ARIMA and SMA, the F score increases by 10.67% and 3.43%, the average error decreases by 32.64% and 11.23%. And The LSTM-RPA increased F score by 13.03%, 16.74%, 11.91% and 18.52%, decreased average error by 39.02%, 48.55%, 36.02%, and 52.88%, compared with LSTM, BiLSTM, GRU, and RNN. So, our method could improve prediction accuracy of sequence model in long trend prediction task.

In the future, we will introduce the Self-Attention Mechanism to increase model’s accuracy in long trend prediction task.


  • [1] Cheng Z, Shen J, Zhu L, et al. Exploiting Music Play Sequence for Music Recommendation[C]//IJCAI. 2017, 17: 3654-3660.
  • [2] Lee J, Lee J S. Predicting music popularity patterns based on musical complexity and early stage popularity[C]//Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia. 2015: 3-6.
  • [3] Kim Y, Suh B, Lee K. # nowplaying the future Billboard: mining music listening behaviors of Twitter users for hit song prediction[C]//Proceedings of the first international workshop on Social media retrieval and analysis. 2014: 51-56.
  • [4] Ren J, KAUFFMAN R J. Understanding music track popularity in a social network[C]. AIS, 2017.
  • [5]

    Pleus B R M. Music Popularity Prediction via Techniques in Deep Supervised Learning[J].

  • [6]

    Latif M H, Afzal H. Prediction of movies popularity using machine learning techniques[J]. International Journal of Computer Science and Network Security (IJCSNS), 2016, 16(8): 127-131.

  • [7] Li Zhenzhen, Wu Qun. Research on stock forecasting algorithm based on LSTM neural network [J]. Fujian computer, 2019,35 (07): 41-43.
  • [8] Song Yujia, Zhang Jian, Xing Bin. Building short-term traffic prediction model based on long-term and short-term memory network [J]. Highway, 2019,64(07): 224-229.
  • [9]

    Mohammad Valipour, Mohammad Ebrahim Banihabib, Seyyed Mahmood Reza Behbahani. Parameters Estimate of Autoregressive Moving Average and Autoregressive Integrated Moving Average Models and Compare Their Ability for Inflow Forecasting[J]. Journal of Mathematics and Statistics,2012,8(3).

  • [10] Yi-Shian Lee, Lee-Ing Tong. Forecasting time series using a methodology based on autoregressive integrated moving average and genetic programming[J]. Knowledge-Based Systems,2010,24(1).
  • [11] Salganik, Matthew J., Peter Sheridan Dodds, and Duncan J. Watts. ”Experimental study of inequality and unpredictability in an artificial cultural market.” science 311.5762 (2006): 854-856.
  • [12] Martín-Gutiérrez D, Peñaloza G H, Belmonte-Hernández A, et al. A Multimodal End-to-End Deep Learning Architecture for Music Popularity Prediction[J]. IEEE Access, 2020, 8: 39361-39374.
  • [13] Yu Weisheng, Deng Weisheng, Zhang Yao, Li Shuyu. Prediction Study of Popular Trend of Music Based on Time Series [J]. Computer Engineering and Science, 2018,40 (09): 1703-1709.
  • [14] Mishra S, Rizoiu M A, Xie L. Modeling popularity in asynchronous social media streams with recurrent neural networks[J]. arXiv preprint arXiv:1804.02101, 2018.
  • [15] Youru Li, Zhenfeng Zhu, Deqiang Kong, Hua Han,Yao Zhao. EA-LSTM: Evolutionary attention-based LSTM for time series prediction[J]. Knowledge-Based Systems,2019,181.
  • [16] Lauren S , Harlili S D . Stock trend prediction using simple moving average supported by news classification[C]// International Conference of Advanced Informatics: Concept. IEEE, 2014.
  • [17] Xu Xianfeng, Liu ahui, Chen Yulu, Cai lulu. Short term power prediction of bilstm photovoltaic power generation based on meteorological factors [J]. Computer system application, 2020,29 (07): 205-211.
  • [18] Zhang Jinlei, Luo Yuling, Fu Qiang. Financial time series prediction based on gated cyclic unit neural network [J]. Journal of Guangxi Normal University (NATURAL SCIENCE EDITION), 2019,37 (02): 82-89.
  • [19] Hochreiter S, Schmidhuber J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
  • [20]

    Yu H, Li Y, Zhang S, et al. Popularity Prediction for Artists Based on User Songs Dataset[C]//Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence. 2019: 17-24.

  • [21] Wang Yuandong. Image Classification Method and Application Based on ResNet Model [D]. East China Jiaotong University, 2019.