Autoencoders are neural networks with two constituent parts (Figure 1). An encoder recognizes important information in the input, and compresses it in an internal representation. A decoder reconstructs the most likely original input from a given compressed representation. Encoder and decoder are trained together, with the goal of compressing and reconstructing the training inputs. Variational Autoencoders (VAEs) Kingma and Welling , Higgins et al.  further assume a continuous representation space, with a Gaussian prior. Therefore, VAEs can also generate new samples, by remixing features they encountered in training samples. This generative power makes VAEs particularly useful for both image and music generation Roche et al. , Broad and Grierson , Roberts et al. .
The image VAE was trained on a set comprising over 4000 abstract artworks from several abstract styles in the WikiArt database WikiArt , plus 10000 images generated algorithmically from music samples (see Online Supplement Wienand and Heckl ). All images were downsampled to 64×64 pixels for computational manageability. To represent and generate music, we use Google’s pre-trained MusicVAE Roberts et al. [37, 38]. Particularly, the models trained on 2- and 16-bar-long, single-voice MIDI melodies. Though MIDI is musically limiting, it allows us to work with melodies and build on clearer, existing translation processes (see below), which would be harder with natural sound.
Learning synesthetic associations
The translation between images and music has at its core a set of synesthetic associations between visual and musical information. Instead of leaving this key component entirely to the idiosyncrasies of the neural networks (as in Müller-Eberstein and van Noord ), we decided to ground it in a simple algorithmic conversion, based on Atomare Klangwelte Heckl . Inspired by scanning microscopes, the algorithm reads the image row by row, turning each pixel into a note. Conversely, it also reads music note by note and renders it in pixel sequences, with longer notes becoming longer strings of pixels.
Technical limitations restrict our choice of translation rules. MusicVAE, for one, works within strictly limited music parameters: it requires single-voice MIDI samples, all with the same tempo and length, and notes of constant velocity (i.e., volume). Therefore, notes are reduced to pitch and duration. To augment the data in notes, we express each pitch as octave and note on the chromatic scale.
For pixels, we limit information loss by working in color (instead of grayscale as in Heckl ). We express each pixel’s color in the Hue-Chroma-Value space (Figure 2(a)) with Chroma (similar to saturation) fixed. The note-color mapping combines physics-inspired and shared human synesthetic correspondences Parise and Spence . Specifically, we associate the color’s Hue with a note on the chromatic scale based on wave frequencies (Figure 2(b)): from a red C (lowest sound frequency on the chromatic scale and lowest frequency of visible light), to a violet B (highest sound and visible light frequency. Though not immediate for humans, we imagine this numerical link can be an intuitive audiovisual correspondence for the machine. We also map the color’s Value (essentially, the luminosity) to the note’s octave (between C2 and B5, Figure 2(c)), following the correspondence between low pitches and darkness, common to humans too Parise and Spence . This map (incidentally similar to Corra , Castro ) forms the basis of the perceptual associations of our networks, learned through repeated exposure and ingrained in the translation network.
Translation network training
We use the note-color map to turn 10000 MusicVAE-generated melodies into images (Figure 2(d), which were also added to the training of the image VAE). The respective encoders compress each melody and image to their representations. These music-image pairs in representation space serve as ground truth to train simple multilayer perceptron networks connecting the two representation spaces (Figures 1 and 2(e)). Thus the translation networks learn the synesthetic correspondences implicitly—connecting representation features, not pixels and notes. Furthermore, the correspondences originate from connections between neural networks and repeated joint exposure, reflecting some hypotheses on the origin of human synesthesiaParise and Spence , Ramachandran and Hubbard .
To reinforce the correspondences, we use simplified images to further train the translation network. From each of WikiArt’s “Color Field Painting” and “Hard-Edge Painting” styles, 50 works were selected at random and broken into 64x64 tiles (for a total of almost 10000 tiles). Each tile is a segment of its original, unmodified picture. Therefore, putting all tiles from the same picture side by side would reconstruct the original picture. These tiles are thuse reduced-information versions of real-world examples (instead of the synthetic images used previously). The translation network converts them to melodies. Encoding tiles and melodies produces new music-image pairs, serving as ground truth to train the translation network again, Figure 2(f). This reinforces the correspondences and expands the range of representations for which the translation network learned definite associations. The entire training process was carried out twice: once based using the 16-bar MusicVAE, once using the 2-bar version.
From this process emerges an “artificial synesthete” that is inspired from the color-note associations instead of bound to them.
Playing images, painting music
Figure 3 shows some examples of the artificial synesthete’s work. Since MusicVAE only generates single-voice MIDI, the music samples focus entirely on the central melodic motif. Image generation is the opposite. In its pictures, the synesthete conveys blurry impressions, with little details to focus on.
We made the synesthete generate 16-bar melodies from the images in the top row of Figure 3(a) (MIDI samples and sheet music in the Online Supplement Wienand and Heckl ). The samples clearly show that the synesthete bends the MusicVAE generator to fit its experience: melodies are more dissonant, have fewer rests, and less rhythmic variation than the typical MusicVAE samples (see Online Supplement). The melody composed from the first sketch on the left obsessively repeats the same few notes, reflecting the few colors in the image. However, these are very short notes (mostly 16th), so the networks seem to recognize the elementary composition of the sketch, but see small chunks of different shades instead of a large, flat field. Indeed, neural networks are known to perceive structures and patterns in the pixels that are invisible to humans Goodfellow et al. . The second sketch has a clearer structure (two equal rectangles), which translates to a rhythmic structure (roughly every third bar is entirely quarter notes). Similarly, the third sketch inspires a melody clearly divided in three parts: beginning with short notes gradually slowing to all quarter notes. Translating the fourth, darker sketch, the synesthete clearly perceives lower shades and pitches, and even rests. Finally, consider Lipide. Much of the detail is lost to the networks (see the autoencoder’s reconstruction in the middle row), which see a grey-blue haze with few details emerging. Accordingly, the synesthete generates a melody anchored to an A (blue in the map). Small musical figures briefly appear before melting back like objects in the fog—or the details in the reconstructed picture. The inverse process—translating these melodies back to images—produces the pictures at the bottom of Figure 3(a). These are visually similar to those obtained from a simple pass through the image autoencoder (middle row), despite some information loss. In other words, the translation is consistently reversible.
The synesthete was also made to translate samples of well-known classical music to images, generating series with clearly different flavors for different pieces, depicted in Figure 3(b) (MIDI samples and sheet music in the Online Supplement Wienand and Heckl ). Altogether, lighter images mirror higher-pitched samples, and one can recognize some composition elements. Bach’s Prelude No.1 in C Maj BWV 846 (top row) has mostly yellow-orange tints with green features, structurally quite similar to each other. Correspondingly, the music samples all present the same repeating structure. Beethoven’s Waldstein Sonata
Op. 53 (bottom row), skews blue. Brown or reddish features and diversified structures mirror the diverse melodies (in particular the second and fourth sample).
Figure 3(c) quantifies the diversity of music and image (calculation details in the Online Supplement). The bars indicate the average normalized distance in representation space between the samples in Figure 3(b) and the average of the corresponding series (orange for Bach’s Prelude, blue for Beethoven’s Waldstein). The Waldstein samples have higher heterogeneity than those from the Prelude (which have, for example, a strict rhythmic structure), and so are the translated picture series. This means that MusicVAE sees the Prelude samples as more similar to each other than the Waldstein ones. Therefore, it encodes them to a smaller region in the representation space. Analogously, according to the image VAE, the pictures series obtained from the Prelude is less heterogeneous than that from the Waldstein.
Variational autoencoders (VAEs) cab also interpolate between samples, generating intermediate image and musicWienand and Heckl , Roberts et al. . We leverage this capability to produce suggestive video sequences. After picking two images, we had the synesthete translate each to a 2-bar melody. Using a spherical interpolation between the encoded representations of the melody extremes, we generated 7 interpolating music samples (enough to perceive a gradual change, but keeping a limited cumulative duration). The concatenation of these melodies gives the audio track. Analogously, we obtain the video track from the series of images (24 samples per second of music) obtained interpolating between the starting and ending pictures. Figure 4 shows a sketch of the process, as well as example bars and frames from the video in the Online Supplement Wienand and Heckl . Visually, the top and bottom of Nassos Daphnis’ 11-68 slowly morph into two separate circles. Simultaneously, we hear the rapid 16th notes become longer and more staccato. Gradually, the pitch shifts down, until the last two bars, which represent a detail from Heckl’s Coronen Molekï¿œle.
Conclusion: Towards a more creative synesthete
We developed an “artificial synesthete” that translates between pictures and music. Using variational autoencoders (VAEs)—a type of neural network—the machine learns to read and organize visual and musical information. Its synesthetic ability is rooted in a set of learned correspondences between images and melodies. These correspondences are the synesthete’s interpretation of an underlying note-color map. Similar learning experience are thought to be the base of some widely shared cross-modal correspondences in humansParise and Spence , Ramachandran and Hubbard , Maurer et al. . The resulting translations are novel works, instead of transposed data.
The learned correspondences set our artificial synesthete apart from algorithmic data transposition Castro , Heckl  and idiosyncratic translation networks Müller-Eberstein and van Noord , as they allow us insight to comprehend what the synesthete sees. Moreover, they increase the creativity of the machine, at least in the framework of Colton’s simple “creativity tripod” Colton 
(made of appreciation, imagination, and skill). In fact, these correspondences give the machine better appreciation. Based on these guidelines, the synesthete decides what is more or less important to translate, and what are better or worse translation choices. The weakest leg of the tripod for our synesthete is certainly imagination. By construction, VAEs remix features they have seen: they cannot generate fully novel works. Training the image VAE on a larger and more diverse set of images would give the synesthete a broader representation space. This would at least increased the perception of imagination, although the network would remain a remixer. Finally, our system still has limited generation skill: the images are still very blurry, the music feels flat. Larger training samples could improve this aspect as well. Alternatively, generative adversarial networks have shown integration potential with VAEs for improved image generationBroad and Grierson , Larsen et al. . Music skill could be improved similarly, layering further networks that articulate from the MusicVAE-generated samples, e.g. varying tempo and intensity, adding voices or harmony.
Finally, emotions play an important role in color-music correspondences for humans (synesthete and not Curwen , Palmer et al. [33, 34]) but not in our neural networks. Affective tagging, already in us to analyze and generate images and music Alvarez-Melis and Amores , You et al. , Alameda-Pineda et al. , Mohammad and Kiritchenko , Ehrlich et al. , could bridge this gap. However, the unemotional experience of the artificial synesthete can, by contrast, highlight our own emotions, as well as elicit new ones when it presents translations we find unexpected or discordant. This also spotlights the deeply personal nature of all perception, synesthetic and not. Comparing and contrasting with the machine’s synthetic perception helps us explore our own, organic experience of music, image, and their pairing.
- Alameda-Pineda et al.  Xavier Alameda-Pineda, Elisa Ricci, Yan Yan, and Nicu Sebe. Recognizing Emotions From Abstract Paintings Using Non-Linear Matrix Completion. pages 5240–5248, 2016. URL http://openaccess.thecvf.com/content_cvpr_2016/html/Alameda-Pineda_Recognizing_Emotions_From_CVPR_2016_paper.html.
- Alvarez-Melis and Amores  David Alvarez-Melis and Judith Amores. The Emotional GAN: Priming Adversarial Generation of Art with Emotion. page 4, Long Beach, CA, USA, 2017.
- Barrass and Kramer  Stephen Barrass and Gregory Kramer. Using sonification. Multimedia Systems, 7(1):23–31, January 1999. ISSN 1432-1882. doi: 10.1007/s005300050108. URL https://doi.org/10.1007/s005300050108.
- Berg  Paul Berg. Composing Sound Structures with Rules. Contemporary Music Review, 28(1):75–87, February 2009. ISSN 0749-4467. doi: 10.1080/07494460802664049. URL https://doi.org/10.1080/07494460802664049.
- Briot et al.  Jean-Pierre Briot, GaÃ«tan Hadjeres, and FranÃ§ois-David Pachet. Deep Learning Techniques for Music Generation – A Survey. arXiv:1709.01620 [cs], August 2019. URL http://arxiv.org/abs/1709.01620.
- Broad and Grierson  Terence Broad and Mick Grierson. Autoencoding Blade Runner: reconstructing films with artificial neural networks. Leonardo, 50(4):376–383, 2017. doi: doi:10.1162/LEON_a_01088.
- Carnovalini and Rodá  Filippo Carnovalini and Antonio Rodá. Computational creativity and music generation systems: An introduction to the state of the art. Front. Artif. Intell., 3, 2020. ISSN 2624-8212. doi: 10.3389/frai.2020.00014. URL https://www.frontiersin.org/articles/10.3389/frai.2020.00014/full#B61.
-  Pablo Samuel Castro. JiDiJi: Music Colours https://jidiji.glitch.me/about.html. URL https://jidiji.glitch.me/about.html.
- Colton  Simon Colton. Creativity Versus the Perception of Creativity in Computational System. page 7, 2008.
- Corra  Bruno Corra. Abstract Cinema, 1912. URL http://www.ubu.com/papers/corra_abstract-cinema.html.
- Curwen  Caroline Curwen. Music-colour synaesthesia: Concept, context and qualia. Consciousness and Cognition, 61:94–106, May 2018. ISSN 1053-8100. doi: 10.1016/j.concog.2018.04.005. URL http://www.sciencedirect.com/science/article/pii/S1053810017305883.
- Daudrich  Anna Daudrich. Algorithmic art and its art-historical relationships. Journal of Science and Technology of the Arts, pages 37–44, January 2016. doi: 10.7559/CITARJ.V8I1.220. URL https://revistas.ucp.pt/index.php/jsta/article/view/7259.
- Diaz-Jerez  Gustavo Diaz-Jerez. Composing with Melomics: Delving into the Computational World for Musical Inspiration. Leonardo Music Journal, 21:13–14, December 2011. ISSN 0961-1215, 1531-4812. doi: 10.1162/LMJ_a_00053. URL https://direct.mit.edu/lmj/article/63657.
- Dubus and Bresin  GaÃ«l Dubus and Roberto Bresin. A Systematic Review of Mapping Strategies for the Sonification of Physical Quantities. PLOS ONE, 8(12):e82491, December 2013. ISSN 1932-6203. doi: 10.1371/journal.pone.0082491. URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0082491.
- Ehrlich et al.  Stefan K. Ehrlich, Kat R. Agres, Cuntai Guan, and Gordon Cheng. A closed-loop, music-based brain-computer interface for emotion mediation. PLOS ONE, 14(3):e0213516, March 2019. ISSN 1932-6203. doi: 10.1371/journal.pone.0213516. URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213516.
Fernandez and Vico 
J. D. Fernandez and F. Vico.
Ai methods in algorithmic composition: A comprehensive survey.
Journal of Artificial Intelligence Research, 48:513–582, November 2013. ISSN 1076-9757. doi: 10.1613/jair.3908. URL https://www.jair.org/index.php/jair/article/view/10845.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. page 9, 2014.
- Goodfellow et al.  Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. arXiv:1412.6572 [cs, stat], March 2015. URL http://arxiv.org/abs/1412.6572.
- Heckl  Wolfgang M. Heckl. Atomare Klangwelten. Technical Report 1/2006, Andrea von Braun Stiftung, 2006. URL http://www.avbstiftung.de/fileadmin/uploads/media/AVB_LP_01_WolfgangHeckl.pdf.
- Higgins et al.  Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning Basic Visual Concepts With a Constrained Variational Framework. page 13, 2017.
- Hiller and Isaacson  Jr Hiller and L. M. Isaacson. Musical Composition with a High-Speed Digital Computer. JAES, 6(3):154–160, July 1958. URL https://www.aes.org/e-lib/browse.cfm?elib=231.
- Jordanous  Anna Jordanous. A Standardised Procedure for Evaluating Creative Systems: Computational Creativity Evaluation Based on What it is to be Creative. Cogn Comput, 4(3):246–279, September 2012. ISSN 1866-9956, 1866-9964. doi: 10.1007/s12559-012-9156-1. URL http://link.springer.com/10.1007/s12559-012-9156-1.
- Kandinsky  Wassly Kandinsky. Über das Geistige in der Kunst. Piper, München, 1912.
- Kingma and Welling  Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat], December 2013. URL http://arxiv.org/abs/1312.6114.
- Kramer et al.  Gregory Kramer, Bruce Walker, Terri Bonebright, Perry Cook, John H Flowers, Nadine Miner, and John Neuhoff. Sonification Report: Status of the Field and Research Agenda. Faculty Publications, Department of Psychology, page 31, 2010.
- Larsen et al.  Anders Boesen Lindbo Larsen, SÃžren Kaae SÃžnderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. page 9, 2016.
- Maurer et al.  Daphne Maurer, Thanujeni Pathman, and Catherine J. Mondloch. The shape of boubas: sound-shape correspondences in toddlers and adults. Developmental Sci, 9(3):316–322, May 2006. ISSN 1363-755X, 1467-7687. doi: 10.1111/j.1467-7687.2006.00495.x. URL http://doi.wiley.com/10.1111/j.1467-7687.2006.00495.x.
- Mohammad and Kiritchenko  Saif M Mohammad and Svetlana Kiritchenko. WikiArt Emotions: An Annotated Dataset of Emotions Evoked by Art. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), page 14, 2018.
- Moritz  William Moritz. Optical Poetry: The Life and Work of Oskar Fischinger. Indiana University Press, 2004. ISBN 978-0-253-34348-2.
- Müller-Eberstein and van Noord  Maximilian Müller-Eberstein and Nanne van Noord. Translating Visual Art into Music. arXiv:1909.01218 [cs, eess], September 2019. URL http://arxiv.org/abs/1909.01218.
- Noll  A. M. Noll. Patterns by 7090. Technical report, Bell Telephone Laboratories, 1962.
- Ox and Keefer  Jack Ox and Cindy Keefer. On Curating Recent Digital Abstract Visual Music, 2006. URL http://www.centerforvisualmusic.org/Ox_Keefer_VM.htm.
- Palmer et al.  Stephen E. Palmer, Karen B. Schloss, Zoe Xu, and Lilia R. Prado-León. Music-color associations are mediated by emotion. PNAS, 110(22):8836–8841, May 2013. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1212562110. URL https://www.pnas.org/content/110/22/8836.
- Palmer et al.  Stephen E. Palmer, Thomas A. Langlois, and Karen B. Schloss. Music-to-color associations of single-line piano melodies in non-synesthetes. Multisensory Research, 29(1-3):157–193, January 2016. ISSN 2213-4808, 2213-4794. doi: 10.1163/22134808-00002486. URL https://brill.com/view/journals/msr/29/1-3/article-p157_8.xml.
- Parise and Spence  Cesare Parise and Charles Spence. Audiovisual cross-modal correspondences in the general population. In Oxford handbook of synesthesia, page 27. 2013.
- Ramachandran and Hubbard  V S Ramachandran and E M Hubbard. Synaesthesia – a window into perception, thought and language. page 33, 2001.
- Roberts et al. [2018a] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. arXiv:1803.05428 [cs, eess, stat], March 2018a. URL http://arxiv.org/abs/1803.05428.
- Roberts et al. [2018b] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. MusicVAE: Creating a palette for musical scores with machine learning., 2018b. URL https://magenta.tensorflow.org/music-vae.
- Roche et al.  Fanny Roche, Thomas Hueber, Samuel Limier, and Laurent Girin. Autoencoders for music sound synthesis: a comparison of linear, shallow, deep and variational models. arXiv:1806.04096 [cs, eess], June 2018. URL http://arxiv.org/abs/1806.04096.
- Rus  Jacob Rus. HCL-HCV Models, via Wikimedia Commons https://commons.wikimedia.org/w/index.php?curid=9760394, February 2010. URL https://commons.wikimedia.org/w/index.php?curid=9760394.
- Saunders  Rob Saunders. Towards Autonomous Creative Systems: A Computational Approach. Cogn Comput, 4(3):216–225, September 2012. ISSN 1866-9956, 1866-9964. doi: 10.1007/s12559-012-9131-x. URL http://link.springer.com/10.1007/s12559-012-9131-x.
- Taylor  Grant D. Taylor. When the Machine Made Art. Bloomsbury Academic, 1st edition, 2014. URL https://www.bloomsbury.com/us/when-the-machine-made-art-9781623562724/.
- Walker and Nees  Bruce N. Walker and Michael A. Nees. Theory of Sonification. In Thomas Hermann, Andy Hunt, and John G. Neuhoff, editors, The sonification handbook. Logos Verlag, Berlin, 2011. ISBN 978-3-8325-2819-5.
-  Karl Wienand and Wolfgang M. Heckl. Online Supplement. URL https://doi.org/10.6084/m9.figshare.11394219.
-  WikiArt. WikiArt.org - Visual Art Encyclopedia. URL https://www.wikiart.org/.
- Xenakis  Iannis Xenakis. Formalized Music: Thought and Mathematics in Composition. Pendragon Press, 1992. ISBN 978-1-57647-079-4.
- You et al.  Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark. In Thirtieth AAAI Conference on Artificial Intelligence, February 2016. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12272.