German philosopher Friedrich Nietzche famously said “without music, life would be a mistake”. In this digital age, we have access to a large collection of music composed at an amazing rate. iTunes music store alone offers 37 million songs, and has sold more than 25 billion songs worldwide.
Every society has its version of music and popularity of the songs, and sometimes they transcend the societies as well as continents. The 90ś era of pop and rock music was dominated by artists such as Micheal Jackson, Sting, U2 and many others. The whole generation of 90ś youth can immediately identify “Beat it!” a top song during that period.
What makes a song catchy? The lyrics of the songs contain words that arouse several emotions such as anger, and love, which tend to play an important role in humans liking the songs. The liking of the songs does have not only a human emotion aspect but also has a direct economic impact on the $130 billion music industry.
The sales and evaluation of the songs directly impact the music companies and a computational model that predicts the popularity of a song is of great value for the music industry. Identifying the potential of a song earlier gives an edge for the companies to purchase the songs at a lower cost. Also, an artist usually composes the music for a song after the lyrics are written. For an organization investing in a music album, it is a great financial incentive to know whether the song would catch the pulse of the audience just based on the lyrics even before the music album is composed, as composing music requires considerable resources.
Since songs are composed of several complex components such as lyrics, instrumental music, vocal and visual renditions, the nature of a song itself is highly complex. Lyrics is the language component that ties up the vocal, music, and visual components. There needs to be harmony between the components to produce a song. Songs have the potential to lift our moods, make us shake a leg or move us to tears. They also help us relate to our experiences, by triggering several emotional responses.
There has been a lot of work on genre classification using machine learning. Researchers identify the category of the songs based on the emotions such as sad, happy and party. All the songs tend to have an emotional component, but we see very few songs that catch the people’s pulse and become a hit.
The research question addressed in the paper are as follows:
Can machine learning models be trained on lyrics for predicting the top and bottom ranked songs?
In the current paper, we look at language features that help predict whether a song belongs to a top or a bottom ranked category. To the best of our knowledge, this is the first study addressing this problem.
Language is a strong indicator of stresses and mood of a person. Identifying these features has helped computational linguists as well as computer scientists to correlate the language features with several complex problems arising in tutoring systems (Rus et al., 2013; Graesser et al., 2005), affect recognition(DMello et al., 2008), sentiment mining (Hu and Liu, 2004), opinion mining, and many others.
Su, Fung, and Auguin (2013) implemented a multimodal music emotion classification (MEC) for classifying 14 kinds of emotions from music and song lyrics of western music genre. Their dataset consisted of 3̃500 songs with emotions/mood such as sad, high, groovy, happy, lonely, sexy, energetic, romantic, angry, sleepy, nostalgic, funny, jazzy, and calm. They used AdaBoost with decision stumps for classification of the music and language features of the lyrics into their respective emotion categories. They have an accuracy of 0.78 using language as well as surface features of the audio. The authors claim that the language features played a more important role compared to the music features in classification.
Laurier, Grivolla, and Herrera (2008)
also indicated that the language features outperformed audio features for music mood classification. They have shown that language features extracted from the songs fit well with Russel’s valence(negative-positive) and arousal(inactive-active) model(Russell, 1980). Several cross-cultural studies show evidence for universal emotional cues in music and language across different cultures and traditions (McKay, 2002).
While significant advances have been made in the area of emotion detection and mood classification based on music and lyrics analysis, through large-scale machine learning operating on vast feature sets, sometimes spanning multiple domains, applied to relatively short musical selections (Kim et al., 2010). Many times, these approaches help in identifying the genre and mood but do not reveal much in terms of why a song is popular, or what features of the song made it catch the pulse of the audience.
Mihalcea and Strapparava (2012) used LIWC and surface music components of all the phrases present in a small collection of songs as a dataset for identifying the emotions in that phrase. Each of the phrases was annotated for emotions. Using SVM classifier they obtained an accuracy of 0.87 using just the language features. They observed that the language components gave a higher accuracy than music features in predicting emotions. The accuracy is higher as they are looking at emotions in a phrase, where the chance of having multiple emotions inside such a small text is very low.
When we look at a collection of popular songs, they belong to several emotional categories. It is clear from previous research that language is a strong indicator of emotions, but it is not clear if the language is an indicator of a song becoming a commercial success.
We used the language features extracted from the lyrics to train an SVM classifier to identify the top and bottom category of songs. Below is the description of both approaches:
A machine learning approach: We extracted the language features, performed dimensionality reduction using principal component analysis (PCA) in-order to reduce the noise in the data. We trained and tested SVM classifier on the new features for identifying the songs that belonged to the top and bottom of the Billboard rankings.
Billboard magazine (Billboard, 2015) is a world premier music publication since 1984. Billboard’s music charts have evolved into the primary source of information on trends and innovation in music industry. With more than 10 Million users, its ranking is considered as a standard in the music industry. Billboard releases the weekly ranking of top songs in several categories such as rock, pop, hip-hop, etc. For this study, we used top hot-hits of every week from . We collected the lyrics of the songs from www.lyrics.com. Since the ratings of the songs are given every week, there is a lot of repetition of the same song being in present in multiple weeks. For the simplifying the problem we selected the top rank of the song throughout the year as the rank of the song.
After cleaning the lyrics from hypertext annotations and punctuations, we had a total of songs from artists. The histogram of the peak rank of the songs in the dataset is shown in Figure 1. For our analysis, we build a model to identify the songs that belonged to the top 30 and bottom 30 ranks. There are a total of songs of which belonged to top 30, and the rest belonged to bottom 30 ranks.
There are few analysis which conduct whole battery of linguistic algorithms that look at syntax, semantics, emotions, and affect contribution of words present in the lyrics. These algorithms can generally be classified into general structural (e.g., word count), syntactic(e.g., connectives) and semantic (e.g., word choice) dimensions of language, whereby some used a bag-of-word approach (e.g. LIWC), whereas others used a probability approach (MRC), whereas yet others relied on the computation of different factors (e.g., type-token ratio). There are eight computation linguistic algorithms that are used to analyze the language features inside the lyrics of the songs.
For general linguistic features, we used the frequency of linguistic features described by (Biber, 1991). These features primarily operate at the word level (e.g., parts-of-speech) and can be categorized as tense and aspect markers, place and time adverbials, pronouns and proverbs, questions, nominal forms, passives, stative forms, subordination features, prepositional phrases, adjectives and adverbs, lexical specificity, lexical classes, modals, specialized verb classes, reduced forms and dis-preferred structures, and co-ordinations and negations (Luno, Beck, and Louwerse, 2013).
For semantic categories of the words, we used Wordnet (Miller et al., 1998). Wordnet has words in base types including primitive groups for nouns (e.g. time, location, person, etc.), for verbs (e.g. communication, cognition, etc.), groups of adjectives and group of adverbs. We also collected all the English words from Google unigrams (Brants and Franz, 2006) and binned them into one of the categories if one of their synonyms belonged to those categories. These words represent the categories such as communication nouns, social nouns, and many others.
The linguistic category model (LCM) gives insight into the interpersonal language use. The model consists of a classification of interpersonal (transitive) verbs that are used to describe actions or psychological states and adjectives that are employed to characterize persons. To capture the various emotions expressed by the statement, we have used the emotion words given by (Tausczik and Pennebaker, 2010), classified into two classes broadly basic emotions (anger, fear, disgust, happiness, etc.) and complex emotions (guilt, pity, tenderness, etc.).
The basic emotions indicate no cognitive load hence they are also called as raw emotions, whereas the complex emotions indicate cognitive load. Inter-clausal relationships were captured using parameterization, including positive additive, (also, moreover), negative additive (however, but), positive temporal (after, before), negative temporal (until), and causal (because, so) connectives. To get the frequencies of the words, we have used CELEX database (Baayen, Piepenbrock, and Gulikers, 1995). The CELEX database consists of million words taken from both spoken (news wire and telephonic conversations) and written (newspapers and books) corpora. Also, we used the MRC Psycholinguistic Database (Johnson-laird and Oatley, 1989), to get linguistic measures such as familiarity, concreteness, and meaningfulness.
|SVM exp. ker||0.76||0.76||0.51|
|SVM ply. ker||0.68||0.68||0.36|
|SVM lin. ker||0.53||0.53||0.05|
After the linguistic analysis, we approached the problem as a classification problem. As discussed earlier, we extracted the language features from the lyrics using the computational linguistic algorithms shown in Figure 2
. We extracted 261 features from each of the 2616 songs. The goal is to build a classifier that predicts the top and bottom ranked songs of the Billboard. Since there are many features and very few songs, we removed the noise contributed by the features using principle component analysis (PCA). Features that explained 0.6 variance were selected, and this reduced the features to 39 from 261.
It is important to note that the major advantage of doing a PCA is noise reduction, and also identifying the best features that capture the variance in the data. The disadvantage is that the variables loose their semantic meaning compared to the raw features.
The classes of positive and negative samples i.e. the top 30 and bottom 30 songs were in the ratio of 1.5 to 1, and to balance the classes we performed synthetic minority over-sampling (SMOTE) (Chawla et al., 2002). The SMOTE creates new synthetic samples that are similar to the minority class by picking data points that are closer to the original sample.
After balancing the classes, we performed classification using support vector machine (SVM) using a radial(exponential), polynomial and linear kernel functions. The classification is done using a 10-fold cross validation method.
SVM uses implicit mapping function defined by the kernel function, to map the input data into a very high dimensional feature space. Then it learns the plane of separation between the two classes of the high dimensional space. For the classification of top and bottom ranked songs we observe that the radial (exponential) function performs the best, with a precision 0.76, recall 0.76 and Cohen’s Kappa -0.51. The kappa score indicates that the classifier did the classification with great confidence.
There are several studies (Mihalcea and Strapparava, 2012; Su, Fung, and Auguin, 2013; Laurier, Grivolla, and Herrera, 2008; Kim et al., 2010) that have looked into emotions in music based on language as well as few audio features. All the studies explicitly indicated that language features were more useful than surface music features in identifying the emotion present in the songs.
Songs contain both music and lyrics. In this work, we have used only the lyrics as our data. Lyrics of the songs are available publicly when compared to the music. Since previous studies have shown the importance of language in music for identifying emotions, we extended the investigation for identifying the language features that help in differentiating the top and bottom rated songs on the billboard. To the best of our knowledge this is a first study that uses computational linguistic algorithms and machine learning models to predict whether a song belongs to top or bottom of the Billboard rankings.
We used the language features extracted using the language model to train SVM classifiers under different kernel functions to identify whether a song belongs to the top or bottom of the billboard chart. The radial kernel function gives a precision with a kappa which indicates that the confidence in classification.
Although audio features of the song play an important role, they are expensive and not publicly available for download. In this paper, we focused only on the language features and the results from both the studies indicate that we can robustly identify whether a song goes to top or bottom of Billboard charts based on the language features alone. Although the precision is only 0.76 (chance is 0.5), given that we are in a very dense space of top 100 songs from Billboard, where all the songs are best of the best when taking into consideration all the music albums uploaded on to social media (youtube, facebook, twitter, etc.).
Overall the take-home message of this paper is that language features can be exploited by the machine learning algorithms to predict whether a song reaches the top or bottom of the Billboard rankings.
Conclusion and Future Work
The music industry is a vibrant business community, with many artists publishing their work in the form of albums, individual songs, and performances. There is a huge financial incentive for the businesses to identify the songs that are most likely to be a hit.
can use machine learning models to train on several language features to predict whether a song belongs to the top 30 or bottom 30 of the Billboard ratings.
In future, we would like to expand our research question to predict whether the song reaches to the class of top 100 Billboard list or not.
- Baayen, Piepenbrock, and Gulikers (1995) Baayen, H. R.; Piepenbrock, R.; and Gulikers, L. 1995. The CELEX lexical database. release 2 (CD-ROM). Philadelphia, Pennsylvania: Linguistic Data Consortium, University of Pennsylvania.
- Biber (1991) Biber, D. 1991. Variation across speech and writing. Cambridge University Press.
- Billboard (2015) Billboard. 2015. Billboard magazine@ONLINE.
- Brants and Franz (2006) Brants, T., and Franz, A. 2006. Web 1T 5-gram Version 1. Philadelphia: Linguistic Data Consortium.
Chawla et al. (2002)
Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer, W. P.
Smote: synthetic minority over-sampling technique.
Journal of artificial intelligence research321–357.
- Coltheart (1981) Coltheart, M. 1981. The mrc psycholinguistic database. The Quarterly Journal of Experimental Psychology 33(4):497–505.
- DMello et al. (2008) DMello, S.; Craig, S.; Witherspoon, A.; McDaniel, B.; and Graesser, A. 2008. Automatic detection of learner?s affect from conversational cues. User Modeling and User-Adapted Interaction 18(1-2):45–80.
- Graesser et al. (2005) Graesser, A. C.; Chipman, P.; Haynes, B. C.; and Olney, A. 2005. Autotutor: An intelligent tutoring system with mixed-initiative dialogue. Education, IEEE Transactions on 48(4):612–618.
- Hu and Liu (2004) Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, 168–177. New York, NY, USA: ACM.
- Johnson-laird and Oatley (1989) Johnson-laird, P. N., and Oatley, K. 1989. The language of emotions: An analysis of a semantic field. Cognition and Emotion 3(2):81–123.
- Kim et al. (2010) Kim, Y. E.; Schmidt, E. M.; Migneco, R.; Morton, B. G.; Richardson, P.; Scott, J.; Speck, J. A.; and Turnbull, D. 2010. Music emotion recognition: A state of the art review. In Proc. ISMIR, 255–266. Citeseer.
- Laurier, Grivolla, and Herrera (2008) Laurier, C.; Grivolla, J.; and Herrera, P. 2008. Multimodal music mood classification using audio and lyrics. In Machine Learning and Applications, 2008. ICMLA’08. Seventh International Conference on, 688–693. IEEE.
- Louwerse (2001) Louwerse, M. 2001. An analytic and cognitive parametrization of coherence relations. Cognitive Linguistics 12(3):291–316.
- Luno, Beck, and Louwerse (2013) Luno, J. A.; Beck, J. G.; and Louwerse, M. 2013. Tell us your story: Investigating the linguistic features of trauma narrative. The Cognitive Science Society.
- McKay (2002) McKay, C. 2002. Emotion and music: Inherent responses and the importance of empirical cross-cultural research. Course Paper, McGill University, Canada.
Mihalcea and Strapparava (2012)
Mihalcea, R., and Strapparava, C.
Lyrics, music, and emotions.
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 590–599. Association for Computational Linguistics.
- Miller et al. (1998) Miller, G. A.; Beckwith, R.; Fellbaum, C.; Gross, D.; and Miller, K. 1998. Five Papers on WordNet. In Fellbaum, C., ed., WordNet: An Electronic Lexical Database. MIT Press.
- Rus et al. (2013) Rus, V.; Niraula, N.; Lintean, M.; Banjade, R.; Stefanescu, D.; and Baggett, W. 2013. Recommendations for the generalized intelligent framework for tutoring based on the development of the deeptutor tutoring service. In AIED 2013 Workshops Proceedings Volume 7, 116.
- Russell (1980) Russell, J. A. 1980. A circumplex model of affect. Journal of personality and social psychology 39(6):1161.
- Semin and Fiedler (1988) Semin, G. R., and Fiedler, K. 1988. The cognitive functions of linguistic categories in describing persons: Social cognition and language. Journal of Personality and Social Psychology 54(4):558–568.
- Semin and Fiedler (1991) Semin, G. R., and Fiedler, K. 1991. The linguistic category model, its bases, applications and range. European Review of Social Psychology 2(1):1–30.
- Su, Fung, and Auguin (2013) Su, D.; Fung, P.; and Auguin, N. 2013. Multimodal music emotion classification using adaboost with decision stumps. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 3447–3451.
- Tausczik and Pennebaker (2010) Tausczik, Y. R., and Pennebaker, J. W. 2010. The psychological meaning of words: Liwc and computerized text analysis methods. Journal of Language and Social Psychology 29(1):24–54.