Neural Melody Composition from Lyrics

09/12/2018 ∙ by Hangbo Bao, et al. ∙ Beihang University Microsoft Harbin Institute of Technology 4

In this paper, we study a novel task that learns to compose music from natural language. Given the lyrics as input, we propose a melody composition model that generates lyrics-conditional melody as well as the exact alignment between the generated melody and the given lyrics simultaneously. More specifically, we develop the melody composition model based on the sequence-to-sequence framework. It consists of two neural encoders to encode the current lyrics and the context melody respectively, and a hierarchical decoder to jointly produce musical notes and the corresponding alignment. Experimental results on lyrics-melody pairs of 18,451 pop songs demonstrate the effectiveness of our proposed methods. In addition, we apply a singing voice synthesizer software to synthesize the "singing" of the lyrics and melodies for human evaluation. Results indicate that our generated melodies are more melodious and tuneful compared with the baseline method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

We study the task of melody composition from lyrics, which consumes a piece of text as input and aims to compose the corresponding melody as well as the exact alignment between generated melody and the given lyrics. Specifically, the output consists of two sequences of musical notes and lyric syllables111A syllable is a word or part of a word which contains a single vowel sound and that is pronounced as a unit. Chinese is a monosyllabic language which means words (Chinese characters) predominantly consist of a single syllable (https://en.wikipedia.org/wiki/Monosyllabic_language). with two constraints. First, each syllable in the lyrics at least corresponds to one musical note in the melody. Second, a syllable in the lyrics may correspond to a sequence of notes, which increases the difficulty of this task. Figure 1 shows a fragment of a Chinese song. For instance, the last Chinese character ‘恋’ (love) aligns two notes ‘C5’ and ‘A4’ in the melody.

Figure 1: A fragment of a Chinese song “Drunken Concubine (new version)”. The blue rectangles indicate rests, some intervals of silence in a piece of melody. The red rectangles indicate the alignment between the lyrics and the melody, meaning a mapping from syllable of lyrics to musical notes. Pinyin indicates the syllables for each Chinese character. We can observe that the second Chinese character ‘恨’ (hate) aligns one note ‘E5’ and the last Chinese character ‘恋’ (love) aligns two notes ‘C5’ and ‘A4’ in melody, which describes the “one-to-many” relationship in the alignment between the lyrics and melody.

There are several existing research works on generating lyrics-conditional melody [Ackerman and Loker2017, Scirea et al.2015, Monteith, Martinez, and Ventura2012, Fukayama et al.2010]. These works usually treat the melody composition task as a classification or sequence labeling problem. They first determine the number of musical notes by counting the syllables in the lyrics, and then predict the musical notes one after another by considering previously generated notes and corresponding lyrics. However, these works only consider the “one-to-one” alignment between the melody and lyrics. According to our statistics on 18,451 Chinese songs, songs contains at least one syllable that corresponds to multiple musical notes (i.e. “one-to-many” alignment), thus the simplification may introduce bias into the task of melody composition.

In this paper, we propose a novel melody composition model which can generate melody from lyrics and well handle the “one-to-many” alignment between the generated melody and the given lyrics. For the given lyrics as input, we first divide the input lyrics into sentences and then use our model to compose each piece of melody from the sentences one by one. Finally, we merge these pieces to a complete melody for the given lyrics. More specifically, it consists of two encoders and one hierarchical decoder. The first encoder encodes the syllables in current lyrics into an array of hidden vectors with a bi-directional recurrent neural network (RNN) and the second encoder leverages an attention mechanism to convert the context melody into a dynamic context vector with a two-layer bi-directional RNN. In the decoder, we employ a three-layer RNN decoder to produce the musical notes and the alignment jointly, where the first two layers are to generate the pitch and duration of each musical note and the last layer is to predict a label for each generated musical note to indicate the alignment.

We collect 18,451 Chinese pop songs and generate the lyrics-melody pairs with precise syllable-note alignment to conduct experiments on our methods and baselines. Automatic evaluation results show that our model outperforms baseline methods on all the metrics. In addition, we leverage a singing voice synthesizer software to synthesize the “singing” of the lyrics and melodies and ask human annotators to manually judge the quality of the generated pop songs. The human evaluation results further indicate that the generated lyrics-conditional melodies from our method are more melodious and tuneful compared with the baseline methods.

The contributions of our work in this paper are summarized as follows.

  • To the best of our knowledge, this paper is the first work to use end-to-end neural network model to compose melody from lyrics.

  • We construct a large-scale lyrics-melody dataset with 18,451 Chinese pop songs and 644,472 lyrics-context-melody triples, so that the neural networks based approaches are possible for this task.

  • Compared with traditional sequence-to-sequence models, our proposed method can generate the exact alignment as well as the “one-to-many” alignment between the melody and lyrics.

  • The human evaluation verifies that the synthesized pop songs of the generated melody and input lyrics are melodious and meaningful.

Preliminary

We first introduce some basic definitions from music theory and then give a brief introduction to our lyrics-melody parallel corpus. Table 1 lists some mathematical notations used in this paper.

Concepts from Music Theory

Melody can be regarded as an ordered sequence of many musical notes. The basic unit of melody is the musical note which mainly consists of two attributes: pitch and duration. The pitch is a perceptual property of sounds that allows their ordering on a frequency-related scale, or more commonly, the pitch is the quality that makes it possible to judge sounds as “higher” and “lower” in the sense associated with musical melodies222https://en.wikipedia.org/wiki/Pitch_(music). Therefore, we use a sequence of numbers to represent the pitch. For example, we represent ‘C5’ and ‘Eb6’ as 72 and 87 respectively based on the MIDI333https://newt.phys.unsw.edu.au/jw/notes.html. A rest is an interval of silence in a piece of music and we use ‘’ to represent it and treat it as a special pitch. Duration is a particular time interval to describe the length of time that the pitch or tone sounds444https://en.wikipedia.org/wiki/Duration_(music), which is to judge how long or short a musical note lasts.

Lyrics-Melody Parallel Corpus

Figure 2 shows an example of a lyrics-melody aligned pair with precise syllable-note alignment, where each Chinese character of the lyrics aligns with one or more notes in the melody.

Notations Description the sequence of syllables in given lyrics the -th syllable in the sequence of musical notes in context melody the -th musical note in the pitch and duration of , respectively the sequence of musical notes in predicted melody the -th musical note in the previously predicted musical notes in the pitch, duration and label of , respectively the pitch sequence comprised of each in the duration sequence comprised of each in the label sequence comprised of each in the -th hidden state in output of lyrics encoder the -th hidden state in output of context melody encoder the dynamic context vector at time step the -th melody context vector from context melody encoder indicates the rest, specially

Table 1: Notations used in this paper

An example of a sheet music:

Lyrics-melody aligned data:

Pinyin ài hèn liǎng máng máng wèn jūn shí liàn
Lyrics

R A4 E5 D5 B4 A4 C5 A4 G4 E4 G4 R E4 D5 C5 A4 G4 C5 C5 A4
                       
0    1 1 0    1 0    0     1 0    0    1 0    1 1 0    1 0    1 0    1
Figure 2: An illustration for lyrics-melody aligned data. The and the respectively represent the pitch and duration of each musical note. In addition, the provides the information on alignment between the lyrics and melody. To be specific, a musical note is assigned with label that denotes it is a boundary of the musical note sub-sequence aligned to the corresponding syllable otherwise it is assigned with label . Additionally, we always align the rests with their latter syllables.
Figure 3: An illustration of Songwriter. The lyrics encoder and context melody encoder encode the syllables of given lyrics and the context melody into two arrays of hidden vectors, respectively. For decoding the -th musical note , Songwriter uses attention mechanism to obtain a context vector from the context melody encoder (green arrows) and counts how many label has been produced in previously musical notes to obtain to represent the current syllable corresponding to from the lyrics encoder (red arrows) to melody decoder. In melody decoder, the pitch layer and duration layer first predict the pitch and duration of , then the label layer predicts a label for to indicate the alignment.

The generated melody consists of three sequences , and where the sequence represents the alignment between melody and lyrics. We are able to rebuild the sheet music with them. sequence represents the pitch of each musical note in melody and ‘’ represents the rest in sequence specifically. Similarly, sequence represents the duration of each musical note in melody. and consist of a complete melody but do not include information on the alignment between the given lyrics and corresponding melody.

contains the information of alignment. Each item of the is labeled as one of to indicate the alignment between the musical note and the corresponding syllable in the lyrics. To be specific, a musical note is assigned with label that denotes it is a boundary of the musical note sub-sequence, which aligned to the corresponding syllable, otherwise it is assigned with label . We can split the musical notes into the parts by label , where is the number of syllables of the lyrics, and each part is a musical note sub-sequence. Then we can align the musical notes to their corresponding syllables sequentially. Additionally, we always align the rests to their latter syllables. For instance, we can observe that the second rest aligns to the Chinese character ‘问’ (ask).

Task Definition

Given lyrics as the input, our task is to generate the melody and alignment that make up a song with the lyrics. We can formally define this task as below:

The input is a sequence representing the syllables of lyrics. The output is a sequence representing the predicted musical notes for corresponding lyrics, where the . In addition, the output sequence should satisfy the following restriction:

(1)

which restricts the generated melody can be exactly aligned with the given lyrics.

Approach

In this section, we present the end-to-end neural networks model, termed as Songwriter, to compose a melody which aligns exactly to the given input lyrics. Figure 3 provides an illustration of Songwriter.

Overview

Given lyrics as the input, we first divide the lyrics into sentences and then use Songwriter to compose each piece of the melody sentence by sentence. For each sentence in lyrics, Songwriter takes the syllables in the sentence lyrics and the context melody, which are some previous predicted musical notes, as input and then predicts a piece of melody. When the last piece of melody has been predicted, we merge these pieces of melody to make a complete song with the given lyrics. This procedure can be considered as a sequence generation problem with two sequences as input, syllables of the current lyrics and the context melody . We develop our melody composition model based on a modified RNN encoder-decoder [Cho et al.2014a] to support multiple sequences as input.

Songwriter employs two neural encoders, lyrics encoder and context melody encoder, to respectively encode the syllables of the current lyrics and the context melody , and leverages a hierarchical melody decoder to produce musical notes and the alignment . To be specific, the lyrics encoder and context melody encoder encode and into two arrays of hidden vectors, respectively. At the time step , melody decoder obtains a context vector from the context melody encoder and a hidden vector from the lyrics encoder to produce the -th musical note . is computed dynamically by the attention mechanism from the output of the context melody encoder. is one of output hidden vectors of the lyrics encoder, which represents the -th syllable in the current lyrics. In the melody decoder, which is a three-layer RNN, the pitch layer and duration layer first predict the pitch and duration , then the label layer predicts a label of to indicate the alignment.

Gated Recurrent Units

We use Gated Recurrent Unit (GRU) 

[Cho et al.2014b] instead of basic RNN. We describe the mathematical model of the GRU as follows:

(2)
(3)
(4)
(5)

where , , , , , , , and are parameters to be learned in GRU, is an element-wise multiplication,

is a logistic sigmoid function,

and are the gates and is the hidden state at time step .

Lyrics Encoder

We use a bi-directional RNN  [Schuster and Paliwal1997] built by two GRUs to encode the syllables of lyrics which concatenates the syllable feature embedding and word embedding as input to the GRU encoders:

(6)
(7)
(8)

Then, the lyrics encoder outputs an array of hidden vectors to represent the information of each syllable in the lyrics.

Context Melody Encoder

We use the context melody encoder to encode the context melody . The encoder is a two-layer RNN that encodes pitch, duration and label of a musical note respectively at each time step. Each layer is a bi-directional RNN which is built by two GRUs. For the first layer, we describe the forward directional GRU and the backward directional GRU at time step as follows:

(9)
(10)

where is the pitch attribute of -th note . Then, we concatenate them into one vector:

(11)

The bottom layer encodes the output of the first layer and the duration attribute of melody. The employment can be described as follows:

(12)
(13)
(14)

We concatenate the two output arrays of vectors to an array of vectors to represent the context melody sequence:

(15)

Melody Decoder

The decoder predicts the next note from all previously predicted notes (, for short), the context musical notes and the syllables

of given lyrics. We define the conditional probability when decoding

-th note as follows:

(16)

To model the three attributes of , where we use to respectively represent the pitch, duration and label, we decompose Eq. (16) into Eq.(17):

(17)

We use a three-layer RNN as decoder to respectively decode the pitch, duration and label of a musical note at each time step. We define the conditional probabilities of each layer in the decoder:

(18)
(19)
(20)

where , and are nonlinear functions that output the probabilities of , and respectively. , and are respectively the corresponding hidden states of each layer. is a dynamic context vector representing the and . We introduce the employment of before , and :

(21)

where is a context vector from context melody encoder and is one of output hidden vectors of lyrics encoder, which represent the that should be aligned to the current predicting . In particular, we set as a zero vector if there is no context melody as input. From our representation method for lyrics-melody aligned pairs, it is not difficult to understand how to get the that should be aligned to:

(22)

is recomputed at each step by alignment model [Bahdanau, Cho, and Bengio2014] as follows:

(23)

where is one hidden vector from the output of melody encoder and the weight is computed by:

(24)
(25)

where , and are learnable parameters. Finally, we obtain the and then employ the , , and as follows:

(26)
(27)
(28)
(29)

Objective Function

Given a training dataset with lyrics-context-melody triples , where , and . In addition, , . Our training objective is to minimize the negative log likelihood loss with respect to the learnable model parameter :

(30)

where is short for .

Teacher-forcing Sampling Pitch Duration Label CRF / 41.23 42.02 40.98 49.82 53.12 50.84 / / / 2.02 25.53 Seq2seq 2.21 54.76 55.01 54.56 64.66 67.88 65.33 93.14 93.06 92.60 3.96 37.04 Songwriter 2.01 63.23 63.24 62.90 69.18 71.28 69.69 93.54 93.61 93.31 6.63 38.31

Table 2: Automatic evaluation results

Experiments

Dataset

We crawled 18,451 Chinese pop songs, which include melodies with the duration over 800 hours in total, from an online Karaoke app. Then preprocess the dataset with rules as described in zhu2018xiaoice zhu2018xiaoice to guarantee the reliability of the melodies. For each song, we convert the melody to C major or A minor that can keep all melodies in the same tune and we set BPM (Beats Per Minute) to 60 to calculate the duration of each musical note in the melody. We further divide the lyrics into sentences with their corresponding musical notes as lyrics-melody pairs. Besides, we set a window size as 40 to the context melody and use the previously musical notes as the context melody for each lyrics-melody pair to make up lyrics-context-melody triples. Finally, we obtain 644,472 triples to conduct our experiments. We randomly choose songs for validating, songs for testing and the rest of them for training.

Baselines

As melody composition task can generally be regarded as a sequence labeling problem or a machine translation problem, we select two state-of-the-art models as baselines.

  • CRF A modified sequence labeling model based on CRF [Lafferty, McCallum, and Pereira2001] which contains two layers for predicting and , respectively. For “one-to-many” relationships, this model uses some special tags to represent a series of original tags. For instance, if a syllable aligns two notes ‘C5’ and ’A4’, we use a tag ‘C5A4’ to represent them.

  • Seq2seq A modified attention based sequence to sequence model which contains two encoders and one decoder. Compared with Songwriter, Seq2seq uses attention mechanism [Bahdanau, Cho, and Bengio2014] to capture information on the given lyrics. Seq2seq may not guarantee the alignment between the generated melody and syllables in given lyrics. To avoid this problem, Seq2seq model stops predicting when the number of the label in predicted musical notes is equal to the number of syllables in the given lyrics.

Implementation

Model Size

For all the models used in this paper, the number of recurrent hidden units is set to 256. In the context melody encoder and melody decoder, we treat the , , and as tokens and use word embedding to represent them with 128, 128, and 64 dimensions, respectively. In the lyrics encoder, we use GloVe [Pennington, Socher, and Manning2014] to pre-train a char-level word embedding with 256 dimensions on a large Chinese lyrics corpus and use Pinyin555https://en.wikipedia.org/wiki/Pinyin as the syllable features with 128 dimensions.

Parameter Initialization

We use two linear layers with the last backward hidden states of the context melody encoder to respectively initialize the hidden states of the pitch layer and duration layer in the melody decoder in Songwriter and Seq2seq. We use zero vectors to initialize the hidden states in the lyrics encoder and context melody encoder.

Training

We use Adam [Diederik P. Kingma2015] with an initial learning rate of and an exponential decay rate of as the optimizer to train our models with batch size as

, and we use the cross entropy as the loss function.

Automatic evaluation

We use two modes to evaluate our model and baselines.

  • Teacher-forcing: As in [Roberts et al.2018], models use the ground truth as input for predicting the next-step at each time step.

  • Sampling Models predict the melody from given lyrics without any ground truth.

Metrics

We use the score to the automatic evaluation from roberts2018hierarchical roberts2018hierarchical. Additionally, we select three automatic metrics for our evaluation as follows.

  • Perplexity (PPL) This metric is a standard evaluation measure for language models and can measure how well a probability model predicts samples. Lower PPL score is better.

  • (weighted) Precision, Recall and 666We calculate these metrics by scikit-learn with the parameter average set as ‘weighted’: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics These metrics measure the performance of predicting the different attributes of the musical notes.

  • BLEU This metric [Papineni et al.2002] is widely used in machine translation. We use it to evaluate our predicted pitch. Higher BLEU score is better.

  • Duration of Word (DW) This metric checks the sum of the duration of all notes which aligned to one word is equal to the ground truth. Higher DW score is better.

Results

The results of the automatic evaluation are shown in Table 2. We can see that our proposed method outperforms all models in all metrics. As Songwriter performs better than Seq2seq, it shows that the exact information of the syllables (Eq. (22)) can enhance the quality of predicting the corresponding musical notes relative to attention mechanism in traditional Seq2seq models. In addition, the CRF model demostrates lower performance in all metrics. In CRF model, we use a special tag to represent multiple musical notes if a syllable aligns more than one musical note, which will produce a large number of different kinds of tags and result in the CRF model is difficult to learn from the sparse data.

Human evaluation

Similar to the text generation and dialog response generation 

[Zhang and Lapata2014, Schatzmann, Georgila, and Young2005], it is challenging to accurately evaluate the quality of music composition results with automatic metrics. To this end, we invite 3 participants as human annotators to evaluate the generated melodies from our models and the ground truth melodies of human creations. We randomly select lyrics-melody pairs, the average duration of each melody approximately 30 seconds, from our testing set. For each selected pair, we prepare three melodies, ground truth of human creations and the generated results from Songwriter and Seq2seq. Then, we synthesized all melodies with the lyrics by NiaoNiao 777A singing voice synthesizer software which can synthesize Chinese song, http://www.dsoundsoft.com/product/niaoeditor/ using default settings for the generated songs and ground truth, which is to eliminate the influences of other factors of singing. As a result, we obtain 3 (annotators) 3 (melodies) 20 (lyrics) samples in total. The human annotations are conducted in a blind-review mode, which means that human annotators do not know the source of the melodies during the experiments.

Metrics

We use the metrics from previous work on human evaluation for music composition as shown below. We also include an emotion score to measure the relationship between the generated melody and the given lyrics. The human annotators are asked to rate a score from 1 to 5 after listening to the songs. Larger scores indicate better quality in all the three metrics.

Model Overall Emotion Rhythm Seq2seq 3.28 3.52 2.66 Songwriter 3.83 3.98 3.52 Human 4.57 4.50 4.17

Table 3: Human evaluation results in blind-review mode

Results

Table 3 shows the human evaluation results. According to the results, Songwriter outperforms Seq2seq in all metrics, which indicates its effectiveness over the Seq2seq baseline. On the “Rhythm” metrics, human annotators give significantly lower scores to Seq2seq than Songwriter, which shows that the generated melodies from Songwriter are more natural on the pause and duration of words than the ones generated by Seq2seq. The results further suggest that using the exact information of syllables (Eq. (22)) is more effective than the soft attention mechanism in traditional Seq2seq models in the melody composition task. We can also observe from Table 3 that the gaps between the system generated melodies and the ones created by human are still large on all the three metrics. It remains an open challenge for future research to develop better algorithms and models to generate melodies with higher quality.

Related Work

A variety of music composition works have been done over the last decades. Most of the traditional methods compose music based on music theory and expert domain knowledge. chan2006improving chan2006improving design rules from music theory to use music clips to stitch them together in a reasonable way. With the development of machine learning and the increase of public music data, data-driven methods such as Markov chains model  

[Pachet and Roy2011] and graphic model  [Pachet, Papadopoulos, and Roy2017] have been introduced to compose music.

Recently, deep learning has been revealed the potentials for musical creation. Most of these deep learning approaches use the recurrent neural network (RNN) to compose the music by regarding as a sequence. The MelodyRNN 

[Waite2016] model, proposed by Google Brain Team, uses looking back RNN and attention RNN to capture the long-term dependency of melody. chu2016song chu2016song propose a hierarchical RNN based model which additionally incorporates knowledge from music theory into the representation to compose not only the melody but also the drums and chords. Some recent works have also started exploring various generative adversarial networks (GAN) models to compose music  [Mogren2016, Yang, Chou, and Yang2017, Dong et al.2017]

. brunner2018midi brunner2018midi design recurrent variational autoencoders (VAEs) with a hierarchical decoder to reproduce short musical sequences.

Generating a lyrics-conditional melody is a subset of music composition but under more restrictions. Early works first determine the number of musical notes by counting the syllables in lyrics and then predict the musical notes one after another by considering previously generated notes and corresponding lyrics. fukayama2010automatic fukayama2010automatic use dynamic programming to compute a melody from Japanese lyrics, the calculation needs three human well-designed constraints. monteith2012automatic monteith2012automatic propose a melody composition pipeline for given lyrics. For each given lyrics, it first generates hundreds of different possibilities for rhythms and pitches. Then it ranks these possibilities with a number of different metrics in order to select a final output. scirea2015smug scirea2015smug employ Hidden Markov Models (HMM) to generate rhythm based on the phonetics of the lyrics already written. Then a harmonical structure is generated, followed by generation of a melody matching the underlying harmony. ackerman2017algorithmic ackerman2017algorithmic design a co-creative automatic songwriting system ALYSIA base on machine learning model using random forests, which analyzes the lyrics features to generate one note at a time for each syllable.

Conclusion and Future Work

In this paper, we propose a lyrics-conditional melody composition model which can generate melody and the exact alignment between the generated melody and the given lyrics. We develop the melody composition model under the encoder-decoder framework, which consists of two RNN encoders, lyrics encoder and context melody encoder, and a hierarchical RNN decoder. The lyrics encoder encodes the syllables of current lyrics into a sequence of hidden vectors. The context melody leverages an attention mechanism to encode the context melody into a dynamic context vector. In the decoder, it uses two layers to produce musical notes and another layer to produce alignment jointly. Experimental results on our dataset, which contains 18,451 Chinese pop songs, demonstrate our model outperforms baseline models. Furthermore, we leverage a singing voice synthesizer software to synthesize “singing” of the lyrics and generated melodies for human evaluation. Results indicate that our generated melodies are more melodious and tuneful. For future work, we plan to incorporate the emotion and the style of lyrics to compose the melody.

References

  • [Ackerman and Loker2017] Ackerman, M., and Loker, D. 2017. Algorithmic songwriting with alysia. In International Conference on Evolutionary and Biologically Inspired Music and Art, 1–16. Springer.
  • [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473.
  • [Brunner et al.2018] Brunner, G.; Konrad, A.; Wang, Y.; and Wattenhofer, R. 2018. Midi-vae: Modeling dynamics and instrumentation of music with applications to style transfer. In Proc. Int. Society for Music Information Retrieval Conf.
  • [Chan, Potter, and Schubert2006] Chan, M.; Potter, J.; and Schubert, E. 2006. Improving algorithmic music composition with machine learning. In Proceedings of the 9th International Conference on Music Perception and Cognition, ICMPC.
  • [Cho et al.2014a] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014a. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , 1724–1734.
    Doha, Qatar: Association for Computational Linguistics.
  • [Cho et al.2014b] Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014b. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, 1724–1734.
  • [Chu, Urtasun, and Fidler2016] Chu, H.; Urtasun, R.; and Fidler, S. 2016. Song from pi: A musically plausible network for pop music generation. arXiv preprint arXiv:1611.03477.
  • [Diederik P. Kingma2015] Diederik P. Kingma, J. B. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).
  • [Dong et al.2017] Dong, H.-W.; Hsiao, W.-Y.; Yang, L.-C.; and Yang, Y.-H. 2017. Musegan: Symbolic-domain music generation and accompaniment with multi-track sequential generative adversarial networks. arXiv preprint arXiv:1709.06298.
  • [Fukayama et al.2010] Fukayama, S.; Nakatsuma, K.; Sako, S.; Nishimoto, T.; and Sagayama, S. 2010. Automatic song composition from the lyrics exploiting prosody of the japanese language. In Proc. 7th Sound and Music Computing Conference (SMC), 299–302.
  • [Lafferty, McCallum, and Pereira2001] Lafferty, J.; McCallum, A.; and Pereira, F. C. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
  • [Mogren2016] Mogren, O. 2016. C-rnn-gan: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904.
  • [Monteith, Martinez, and Ventura2012] Monteith, K.; Martinez, T. R.; and Ventura, D. 2012. Automatic generation of melodic accompaniments for lyrics. In ICCC, 87–94.
  • [Pachet and Roy2011] Pachet, F., and Roy, P. 2011. Markov constraints: steerable generation of markov sequences. Constraints 16(2):148–172.
  • [Pachet, Papadopoulos, and Roy2017] Pachet, F.; Papadopoulos, A.; and Roy, P. 2017. Sampling variations of sequences for structured music generation. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China, 167–173.
  • [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
  • [Roberts et al.2018] Roberts, A.; Engel, J.; Raffel, C.; Hawthorne, C.; and Eck, D. 2018. A hierarchical latent vector model for learning long-term structure in music. arXiv preprint arXiv:1803.05428.
  • [Schatzmann, Georgila, and Young2005] Schatzmann, J.; Georgila, K.; and Young, S. 2005. Quantitative evaluation of user simulation techniques for spoken dialogue systems. In 6th SIGdial Workshop on DISCOURSE and DIALOGUE.
  • [Schuster and Paliwal1997] Schuster, M., and Paliwal, K. 1997. Bidirectional recurrent neural networks. Trans. Sig. Proc. 45(11):2673–2681.
  • [Scirea et al.2015] Scirea, M.; Barros, G. A.; Shaker, N.; and Togelius, J. 2015. Smug: Scientific music generator. In ICCC, 204–211.
  • [Waite2016] Waite, E. 2016. Generating long-term structure in songs and stories. Magenta Bolg.
  • [Watanabe et al.2018] Watanabe, K.; Matsubayashi, Y.; Fukayama, S.; Goto, M.; Inui, K.; and Nakano, T. 2018. A melody-conditioned lyrics language model. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, 163–172.
  • [Yang, Chou, and Yang2017] Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847.
  • [Zhang and Lapata2014] Zhang, X., and Lapata, M. 2014. Chinese poetry generation with recurrent neural networks. In EMNLP, 670–680.
  • [Zhu et al.2018] Zhu, H.; Liu, Q.; Yuan, N. J.; Qin, C.; Li, J.; Zhang, K.; Zhou, G.; Wei, F.; Xu, Y.; and Chen, E. 2018. Xiaoice band: A melody and arrangement generation framework for pop music. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2837–2846. ACM.