Low-dimensional Embodied Semantics for Music and Language

by   Francisco Afonso Raposo, et al.

Embodied cognition states that semantics is encoded in the brain as firing patterns of neural circuits, which are learned according to the statistical structure of human multimodal experience. However, each human brain is idiosyncratically biased, according to its subjective experience history, making this biological semantic machinery noisy with respect to the overall semantics inherent to media artifacts, such as music and language excerpts. We propose to represent shared semantics using low-dimensional vector embeddings by jointly modeling several brains from human subjects. We show these unsupervised efficient representations outperform the original high-dimensional fMRI voxel spaces in proxy music genre and language topic classification tasks. We further show that joint modeling of several subjects increases the semantic richness of the learned latent vector spaces.


page 1

page 2

page 3

page 4


Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses

How related are the representations learned by neural language models, t...

Towards Deep Modeling of Music Semantics using EEG Regularizers

Modeling of music audio semantics has been previously tackled through le...

Investigating Inner Properties of Multimodal Representation and Semantic Compositionality with Brain-based Componential Semantics

Multimodal models have been proven to outperform text-based approaches o...

Learning Embodied Semantics via Music and Dance Semiotic Correlations

Music semantics is embodied, in the sense that meaning is biologically m...

What Is It Like to Be a Brain Simulation?

We frame the question of what kind of subjective experience a brain simu...

Musical Instrument Classification via Low-Dimensional Feature Vectors

Music is a mysterious language that conveys feeling and thoughts via dif...

Formal models of Structure Building in Music, Language and Animal Songs

Human language, music and a variety of animal vocalisations constitute w...

1 Introduction

The current consensus in cognitive science defends that conceptual knowledge is encoded in the brain as firing patterns of neural populations (Kiefer and Pulvermüller, 2012)

. Correlated patterns of experience trigger the firing of neural populations, creating neural circuits that wire them together. Neural circuits represent semantic inferences involving the encoded concepts and connect neurons responsible for encoding different modalities of experience, such as emotional, linguistic, visual, auditory, somatosensory, and motor

(Kiefer and Pulvermüller, 2012; Lakoff, 2014; Pulvermüller, 2018; Ralph et al., 2017). Thus, cognition implies transmodal and distributed aspects that allow embodied meaning to be grounded, via semantic inferences which can be triggered by any concept, regardless of its level of abstraction. This multimodal semantics, which draws on supporting evidence from several disciplines, such as psychology and neuroscience (Desai et al., 2011; Lakoff, 2012; Thibodeau and Boroditsky, 2013), is appropriately termed “embodied cognition”.

Since embodied cognition explains semantics to be biologically implemented by neural circuitry which captures the statistical structure of human experience, this motivates a statistical approach based on brain activity to computationally model semantics. That is, since patterns of neural activity reflect conceptual encoding of experience, then statistical models of neural activity (e.g., measured with functional Magnetic Resonance Imaging (fMRI)) will capture conceptual knowledge, i.e., semantics. While each individual brain is a rich source of semantic information, it is also biased according to the life experience of that individual, meaning that idiosyncratic semantic inferences are always triggered for any subject, which implies a “noisy” semantic system. We claim that richer shared semantic descriptions can be captured by jointly modeling the statistics of neural firing of several similarly stimulated brains.

In this work, we propose modeling shared semantics as low-dimensional encodings via statistical inter-subject brain agreement using Generalized Canonical Correlation Analysis (GCCA). We model both music and natural language semantics based on fMRI recordings of subjects listening to music and being presented with linguistic concepts. By correlating the brain responses of several subjects to the same stimuli, we intend to uncover the latent shared semantics that are encoded in the fMRI data representing human (brain) interpretation of media. Shared semantics refers to the shared statistical structure of the neural firing, i.e., the shared meaning of media artifacts, across a population of individuals. This is a useful concept for automatic inference of semantics from brain data that is meaningful for most people. We evaluate how semantic richness evolves with the number of subjects involved in the modeling process, via across-subject retrieval of fMRI responses and proxy, downstream, semantic tasks for each modality: music genre and language topic classifications. Results show that these low-dimensional embeddings not only outperform the original fMRI voxel space of several thousand dimensions, but also that their semantic richness improves as the number of modeled subjects increases.

The rest of this article is organized as follows: Section 2 reviews related work on embodied cognition, both in terms of music and natural language, as well as computational approaches leveraging some of its aspects; Section 3 describes our experimental setup; Section 4 presents the results; Section 5 discusses the results; and Section 6 draws conclusions and considers future work.

2 Related work

Embodied cognition has been accumulating increasing amounts of supporting evidence in cognitive science (Kiefer and Pulvermüller, 2012; Lakoff, 2012). This research direction was initially motivated by the fact that embodied cognition is able to account for the semantic grounding of natural language. Grounding is rooted in embodied primitives, such as somatosensory and sensorimotor concepts (Lakoff, 2014), which implies humans ultimately understand high-level (linguistic) concepts in terms of what their mediating physical bodies afford in their physical environment. The Neural Theory of Thought and Language (NTTL) is an embodied cognition framework proposed by Lakoff (2012), which casts semantic inference as “conceptual metaphor”. This is based on the observation that metaphorical thought and understanding are independent of language. Linguistic and other kinds of metaphors (e.g., gestural and visual) are seen as surface realizations of conceptual metaphors, which are realized in the brain via neural circuits learned according to repeated patterns of experience. For instance, the abstract concept of “communicating ideas via language” is understood via a conceptual metaphor: ideas are objects; language is a container for idea-objects; and communication is sending idea-objects in language-containers. This metaphor maps a source domain of sending objects in containers to a target domain of communicating ideas via language (Lakoff, 2014). Ralph et al. (2017) proposed the Controlled Semantic Cognition (CSC) framework which, much like NTTL, features multimodal and distributed aspects of cognition, i.e., semantics is also defined by transmodal circuits connecting different modality-specific neuron populations distributed across the whole brain. It contrasts with NTTL, however, by proposing the existence of a transmodal hub in the Anterior Temporal Lobes. Pulvermüller (2018)

introduced the concepts of semantic “kernel”, “halo”, and “periphery” in order to address different levels of generality of conceptual semantic features. The periphery contains information about very idiosyncratic features and referents, whereas the kernel contains the most generic features of a concept. The generality level of the halo features fall in between the levels of the kernel and periphery. These correspond to larger (periphery) to smaller (kernel) sets of neurons encoding semantic information about concepts and are merely approximations of what is likely to be a smooth continuum of increasingly more generic (and smaller set size of) neurons. Hebbian (and anti-Hebbian) learning is what allows “semantic feature extraction” (conceptual metaphor) to happen. Conceptual flexibility is proposed to be implemented via neurobiological mechanisms such as priming and gain control modulation

(Pulvermüller, 2018). A review of several families of conceptual knowledge encoding theories is presented by Kiefer and Pulvermüller (2012), where the authors conclude by arguing in favor of the embodiment family.

Even though embodied cognition was initially motivated by the acquisition of a deeper understanding of natural language semantics, it is inherently multimodal and generic. Therefore, it also accounts for music semantics and, indeed, some work has already been published following this cognitive perspective. Korsakova-Kreyn (2018) emphasizes the role of modulations of physical tension in music semantics. These modulations are conveyed by tonal and temporal relationships in music artifacts, consisting of manipulations of tonal tension in a tonal framework (musical scale). This embodied perspective is reinforced by the fact that tonal perception seems to be biologically driven, since one-day old babies physically perceive tonal tension (Virtala et al., 2013). This is thought to be a consequence of the “principle of least effort”, where consonant sounds, which consist of more harmonic overtones, are more easily, i.e., efficiently, processed and compressed by the brain than dissonant sounds, creating a more pleasant experience (Bidelman and Krishnan, 2009). Leman (2010) also claims music semantics is defined by the mediation process of the listener, i.e., the human body and brain are responsible for mapping from the physical modality (audio) to the experienced modality. His theory, Embodied Music Cognition (EMC), also supports the idea that semantics is motivated by affordances, i.e., music is understood in a way that is relevant for functioning in a physical environment. Koelsch et al. (2019) proposed the Predictive Coding (PC) framework, also pointing to the involvement of transmodal neural circuits in both prediction and prediction error resolution of musical content. This process of active inference makes use of “mental action” while listening to music, thus, also pointing to the role of both action and perception in semantics.

Given the growing consensus that brain activity is a manifestation of semantic inferences, it is only natural to model it computationally. In particular, statistical modeling seems appropriate, given the seemingly stochastic nature of human brain dynamics. The most straightforward way to capture embodied semantics, which implies that conceptual knowledge is spatially encoded across the brain, is to model fMRI data, which consist of volume readings of brain activity, thus, providing a direct reading from a specific spatial location. This makes the decoding of such data easier than, for instance, decoding Electroencephalograms, which are scalp measurements of electrical activity, i.e., the readings of each scalp sampling location are result from aggregated activity originating from many distributed spatial locations in the brain. Accordingly, Pereira et al. (2018) proposed a method for decoding linguistic meaning from fMRI

, using ridge regression

(Hoerl and Kennard, 1970) to learn a mapping from fMRI to Global Vectors for Word Representation (GloVe) embeddings (Pennington et al., 2014), which are statistically learned to describe textual meaning. Within-subject models were significantly able to generalize to new concepts. More importantly, those models were able to generalize to whole sentences, even though they were only trained with single concepts. The authors found that the most informative voxels were widely distributed across the brain, as predicted by embodied cognition. Casey (2017) performed analogous experiments with fMRI responses to short music clips spanning five music genres. Stimuli and genre Support Vector Machine (SVM) (Cortes and Vapnik, 1995) classification results significantly outperformed the random baselines. Furthermore, melodic features were extracted from the stimuli and used for ridge regression of voxel sphere responses, also achieving significant results for some brain regions.

Our approach differs from previous computational approaches in the following ways: we never use vector space descriptions extracted from stimuli data in our models, as opposed to, for instance, fMRI-based GloVe regression (Pereira et al., 2018) or melody-based voxel sphere regression (Casey, 2017); and we do not constrain our semantic vector space learning to any taxonomy, as opposed to, for instance, fMRI-based genre classification (Casey, 2017). These differences allow for learning a generic semantic vector space, which can later be tested by semantic proxy tasks to assess its semantic richness.

3 Experimental setup

In this section, we describe the data that was used in our experiments (Section 3.1), the features and preprocessing (Section 3.2), the GCCA model (Section 3.3), and the evaluation setups: across-subject retrieval (Section 3.4.1), and classification (Section 3.4.2).

3.1 Datasets

We experiment with three datasets, taken from two published datasets (Hanke et al., 2015; Pereira et al., 2018). Hanke et al. (2015) had 20 human subjects listen to 25 short music clips (6s) spanning 5 music genres: ambient, country, metal, rocknroll, and symphonic. We select the data for every subject that listened to each clip 8 times (all but one), yielding a final dataset of 19 subjects and 75 clip segments (fMRI data is sampled every 2s seconds), hereafter, referred to as Music Genres (MG). Pereira et al. (2018) presented 180 linguistic concepts to 16 subjects using 3 different presentation paradigms, 8 of whom were presented with additional 243 sentences spanning 24 topics, and 6 of whom were presented with additional 384 sentences spanning another 24 topics. We select the two sentence subsets for evaluation (even though we also use the average concepts data for training the embedders) since they provide topic labels for proxy task evaluation and, hereafter, we refer to them as Language Topics (243 Sentences) (LT243) and Language Topics (384 Sentences) (LT384). Table 1 lists the topics.

LT243 LT384
Astronaut Animal
Beekeeping Appliance
Blindness Bird
Bone Fracture Body Part
Castle Building Part
Computer Graphics Clothing
Dreams Crime
Gambling Disaster
Hurricane Drink Non Alcoholic
Ice Cream Dwelling
Infection Fish
Law School Fruit
Lawn Mower Furniture
Opera Human
Owl Insect
Painter Kitchen Utensil
Pharmacist Landscape
Polar Bear Music
Pyramid Place
Rock Climbing Profession
Skiing Tool
Stress Vegetable
Taste Vehicles Transport
Tuxedo Weapon
Table 1: Language topics

We randomly partition each dataset into folds based on stimuli, for use in cross-validation evaluation setups, in a way that guarantees at least one instance of each class in the test set and such that all folds have the same sizes. It is of the utmost importance to guarantee that train/test splits of recorded brain data are not biased in terms of local temporal coherence: preliminary experiments revealed that recorded brain activity exhibits a great degree of local temporal similarity that is not the result of similar stimuli processing and integration but that it rather reflects an overall short term temporally coherent brain state. For instance, if a sequence of sentences is presented to a subject (even if there are short time intervals between them), the fMRI data corresponding to sentences that were shown consecutively will be at least among the most similar pairs of fMRI vectors regardless of the content of the sentences. Consequently, in order to evaluate fMRI-based classification, consecutive same-class stimuli instances cannot be in both the train and test splits, so as to guarantee the model learns to extract stimuli-related semantic content instead of general brain state content. Since music clips presentation in MG is appropriately randomized, we need only to guarantee that all segments (three) from each clip are either in the train or test set. For LT243 and LT384, the presentation of passages was also randomized but its sentences were not, meaning that we allocate every sentence from each passage either to the train or test set. Given the aforementioned constraints, MG, LT243, and LT384 are partitioned into five, three, and four folds, respectively.

3.2 Features and preprocessing

Even though the original fMRI datasets were recorded using different equipment (e.g., recordings by Hanke et al. (2015) have a greater spatial resolution than the recordings by Pereira et al. (2018)), they all roughly cover the whole brain, i.e., there is no informed brain area voxel selection. As embodied cognition posits that the semantic network spans the whole brain, this is interesting as it allows us to evaluate whether GCCA is able to filter relevant activity in an end-to-end fashion. Since MG has very high spatial resolution, the number of voxels is prohibitively large (both in terms of memory usage and processing time). Therefore, we downsample MG, using the Nilearn library (Abraham et al., 2014), by a factor of 6, resulting in 5488 voxels. Furthermore, we also average all runs in MG in order to produce a single fMRI per subject and stimulus.

3.3 Generalized Canonical Correlation Analysis (Gcca)

Canonical Correlation Analysis (CCA) (Hotelling, 1936) is a method that finds pairs of maximally correlated linear projections and of two input views and , such that the canonical dimensions are uncorrelated with each other. Formally:


where and are the covariances of and , respectively, and is the cross-covariance. GCCA (Horst, 1961; Kettenring, 1971) is an extension of CCA for an arbitrary number of views . Formally, it minimizes:


where , is the projection matrix for view (stacking linear projections), is the data matrix for view , and is the number of samples. is the canonical vector space shared by all views. This is the space that captures the latent semantics that are uncovered via correlating brain (fMRI) views. CCA-based models have previously been used for learning cross-modal semantic spaces which essentially leveraged the transmodal aspect of human semantic cognition as manifested in multimodal artifacts. For instance, Raposo et al. (2019) modeled correlations between music audio and dance video and Yu et al. (2019) modeled correlations between music audio and lyrics text.

3.4 Evaluation

We start by determining the appropriate number of semantic dimensions (canonical components) for each dataset. For each tested number of components (from 2 to 25), we train the GCCA models in a cross-validation setup and evaluate its generalizability by computing the Mean Average Precision (MAP) on the test set in an across-subject retrieval of fMRI task. Then, we select the number of components that maximizes this metric (averaged over all folds). After deciding on the appropriate number of semantic dimensions, we evaluate the semantic richness of this latent space via appropriate proxy tasks: music genre classification and language topic classification. These tasks are evaluated for a variable number of modeled subjects by GCCA, in order to study how semantic richness evolves with an increasing number of subjective brain responses. Since there is more than one way of choosing a variable number of subjects to model out of all subjects, we randomly sample up to 50 combinations, run the experiments for each combination, and report the average results.

3.4.1 Across-subject retrieval

This task consists of retrieving a sorted list of fMRI database items, given an fMRI

query, based on the cosine similarity between semantic vectors. The database contains all test set

fMRI responses from all subjects besides the query itself. We compute MAP, which is appropriate to assess ranking quality in sorted lists of database items which contain more than a single relevant item. The relevance criterion considers any fMRI response from other subjects, for the same stimulus as the fMRI query, to be relevant. We also compute random performance baselines.

3.4.2 Music genre and language topic classifications

Both music genre and language topic classification tasks consist of classifying query items into one of the classes modeled by a classifier. We use

SVMs for classifying music clips into 1 of 5 music genres and language sentences into 1 of 24 topics for both language datasets. The data instances used for training and testing the SVMs are the average (over the corresponding modeled subjects) of the GCCA projections of the corresponding fMRIs, which represent the semantic embeddings as computed by the GCCA model. We also compute two baselines: the average classification performance over all subjects using the original high-dimensional (i.e., several thousands of voxels) fMRI data; and random performance. We perform classification in a grid search setup for parameters and and report the results for a particular combination that maximizes the average classification accuracy across folds.

4 Results

The optimal number of dimensions (which are chosen for the proxy classification tasks) are 9, 11, and 6 for MG, LT243, and LT384, respectively, as determined via the across-subject retrieval task, which yielded MAP scores of 0.272, 0.070, and 0.044 for MG, LT243, and LT384, respectively (all statistically significant results according to randomization tests against the random baseline performances (Bestgen, 2015) of 0.081, 0.022, and 0.017, respectively).

Figure 0(a) shows the evolution of music genre classification performance for an increasing number of modeled subjects in MG (red, green, and blue lines relative to the left y-axis). The best accuracy score is 0.360. Figures 0(b) and 0(c) show the evolution of language topic classification performance for an increasing number of modeled subjects in LT243 and LT384 (red, green, and blue lines relative to the left y-axes), respectively. The best accuracy scores are 0.132 and 0.127 for LT243 and LT384, respectively.

(a) MG
(b) LT243
(c) LT384
Figure 1:

Classification accuracies (red, green, and blue lines relative to left y-axes) and voxel score distribution variances (orange dashed lines relative to right y-axes) vs. number of modeled subjects.

5 Discussion

We assessed statistical significance on the across-subject retrieval task MAP scores via randomization tests for 32768 randomizations and obtained p-values , , and for MG, LT243, and LT384, respectively. We also ran binomial statistical significance tests for the classification performances. fMRI-based classification in MG did not significantly differ from random performance. GCCA embeddings are significantly better than random performance starting at 4 subjects (). fMRI-based classification in LT243 and LT384 is significantly better than random performance, with p-values and , respectively. GCCA-based classification in LT243 and LT384 is significantly better than random performance for any number of subjects and is also significantly better than fMRI-based classification starting from 3 () and 5 () subjects, respectively.

The fact that the retrieval tasks showed generalizability of the model is already evidence that it captures some semantic aspects of the stimuli, since the only supervision during training is (musical and linguistic) stimulus-based matching between fMRI data from different subjects. Proxy tasks results further validate this claim by showing the predictive power of the shared semantics in terms of specific semantic concepts (music genres and language topics). These tasks are appropriate for evaluating shared semantics because they involve inference of semantic concepts which are stable across individuals. Recall that the learned embeddings are generic, i.e., not specifically optimized for these concepts, yet they performed remarkably well in these tasks, meaning that these shared concepts were uncovered by jointly modeling inter-subject fMRI data in an unsupervised fashion. Moreover, these low-dimensional embeddings significantly outperformed the fMRI spaces using hundreds and thousands of times less dimensions, showing their efficiency.

These semantics were shown to be improved by including additional subjects to the model, suggesting that the semantics of the stimuli is indeed latent in the fMRI data, but also that it is polluted by noisy inferences which can be better filtered out by GCCA when more views are taken into account. It is likely that modeling less subjective semantic systems, i.e., brains, is not enough to filter out the conceptual semantic peripheries (using the terminology of Pulvermüller (2018)) and other unrelated but co-occurring (during fMRI data acquisition) brain activity, which are both non-robust semantic descriptions of the concepts encoded in the media artifacts used as stimuli. It is equally possible that modeling more brains helps the model in finding semantically robust voxels, i.e., voxels belonging to the semantic kernel and halo, that would not be found otherwise. In order to get some insight on which of these effects is actually happening in the context of these experiments, we ran an additional analysis that measures how the GCCA scores are distributed across voxels, regardless of where those voxels are located. For each number of modeled subjects, we compute the distribution of voxel scores, normalize them in order to sum to 1, and compute the average variance of these distributions across folds and subject combinations. A higher value means that the voxel scores are concentrated in a smaller number of voxels, whereas a lower value means that the voxels scores are concentrated in a larger number of voxels. Voxel scores are computed according to , where is the subject (view) index, is the voxel index and is the canonical dimension index. The orange dashed lines (relative to the right y-axes) in Figures 0(a), 0(b), and 0(c) illustrate the evolution of voxel score distribution variance for MG, LT243, and LT384, respectively.

The voxel score distribution variance plots suggest that the model is able to find more semantically rich voxels as the number of modeled subjects increases. Note, however, that these experiments were performed with a relatively small number of samples (a few hundreds) compared to the number of input dimensions (several thousands) being modeled. This means that there is an upper bound on the number of latent dimensions that GCCA can model for each view. Therefore, adding additional subjective views, in these conditions, seems to help in finding the semantic kernel and halo, as if these views are artificially boosting the sample size (number of stimuli). It is still possible that the inverse effect (filtering out of the semantic periphery) can happen with a larger sample size and that is a direction for further investigation. Finally, it is interesting that all these effects are observed for both music and language modalities. These results not only show the promise of CCA-based methods to model semantics via brain dynamics but also suggest that meaning in music and language is inferred in a similar way, thus, providing further evidence for embodied cognition in general.

6 Conclusions and future work

We showed that music and language semantics can be learned and studied via canonically correlating fMRI responses from different subjects. GCCA models were able to generalize how the brains from all subjects covary with respect to musical and linguistic stimuli as well as to produce unsupervised semantic embeddings for classification of music genres and language topics. Moreover, we showed that the semantic space learned by GCCA is more powerful as the number of modeled subjects increases. Future work includes experimenting with larger and more varied datasets, using GCCA to study which brain areas covary together for both music and language domains, as well as extending this study to other modalities.


  • Kiefer and Pulvermüller (2012) M. Kiefer, F. Pulvermüller, Conceptual Representations in Mind and Brain: Theoretical Developments, Current Evidence and Future Directions, Cortex 48 (2012) 805–825. doi:10.1016/j.cortex.2011.04.006.
  • Lakoff (2014) G. Lakoff, Mapping the Brain’s Metaphor Circuitry: Metaphorical Thought in Everyday Reason, Frontiers in Human Neuroscience 8 (2014) 958. doi:10.3389/fnhum.2014.00958.
  • Pulvermüller (2018) F. Pulvermüller, Neurobiological Mechanisms for Semantic Feature Extraction and Conceptual Flexibility, Topics in Cognitive Science 10 (2018) 590–620. doi:10.1111/tops.12367.
  • Ralph et al. (2017) M. A. L. Ralph, E. Jefferies, K. Patterson, T. T. Rogers, The Neural and Computational Bases of Semantic Cognition, Nature Reviews Neuroscience 18 (2017) 42–55. doi:10.1038/nrn.2016.150.
  • Desai et al. (2011) R. H. Desai, J. R. Binder, L. L. Conant, Q. R. Mano, M. S. Seidenberg, The Neural Career of Sensory-motor Metaphors, Journal of Cognitive Neuroscience 23 (2011) 2376–2386. doi:10.1162/jocn.2010.21596.
  • Lakoff (2012) G. Lakoff, Explaining Embodied Cognition Results, Topics in Cognitive Science 4 (2012) 773–785. doi:10.1111/j.1756-8765.2012.01222.x.
  • Thibodeau and Boroditsky (2013) P. H. Thibodeau, L. Boroditsky, Natural Language Metaphors Covertly Influence Reasoning, PLOS ONE 8 (2013) e52961. doi:10.1371/journal.pone.0052961.
  • Korsakova-Kreyn (2018) M. Korsakova-Kreyn, Two-level Model of Embodied Cognition in Music, Psychomusicology: Music, Mind, and Brain 28 (2018) 240–259. doi:10.1037/pmu0000228.
  • Virtala et al. (2013) P. Virtala, M. Huotilainen, E. Partanen, V. Fellman, M. Tervaniemi, Newborn Infants’ Auditory System is Sensitive to Western Music Chord Categories, Frontiers in Psychology 4 (2013) 492. doi:10.3389/fpsyg.2013.00492.
  • Bidelman and Krishnan (2009) G. M. Bidelman, A. Krishnan, Neural Correlates of Consonance, Dissonance, and the Hierarchy of Musical Pitch in the Human Brainstem, Journal of Neuroscience 29 (2009) 13165–13171. doi:10.1523/jneurosci.3900-09.2009.
  • Leman (2010) M. Leman, An Embodied Approach to Music Semantics, Musicae Scientiae 14 (2010) 43–67. doi:10.1177/10298649100140S104.
  • Koelsch et al. (2019) S. Koelsch, P. Vuust, K. Friston, Predictive Processes and the Peculiar Case of Music, Trends in Cognitive Sciences 23 (2019) 63–77. doi:10.1016/j.tics.2018.10.006.
  • Pereira et al. (2018) F. Pereira, B. Lou, B. Pritchett, S. Ritter, S. J. Gershman, N. Kanwisher, M. Botvinick, E. Fedorenko, Toward a Universal Decoder of Linguistic Meaning from Brain Activation, Nature Communications 9 (2018) 963. doi:10.1038/s41467-018-03068-4.
  • Hoerl and Kennard (1970) A. E. Hoerl, R. W. Kennard,

    Ridge Regression: Biased Estimation for Nonorthogonal Problems,

    Technometrics 12 (1970) 55–67. doi:10.1080/00401706.1970.10488634.
  • Pennington et al. (2014) J. Pennington, R. Socher, C. D. Manning, GloVe: Global Vectors for Word Representation,

    in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543. doi:

  • Casey (2017) M. A. Casey, Music of the 7Ts: Predicting and Decoding Multivoxel fMRI Responses with Acoustic, Schematic, and Categorical Music Features, Frontiers in Psychology 8 (2017) 1179. doi:10.3389/fpsyg.2017.01179.
  • Cortes and Vapnik (1995) C. Cortes, V. Vapnik, Support-vector Networks, Machine Learning 20 (1995) 273–297. doi:10.1007/BF00994018.
  • Hanke et al. (2015) M. Hanke, R. Dinga, C. Häusler, J. S. Guntupalli, M. Casey, F. R. Kaule, J. Stadler, High-resolution 7-Tesla fMRI Data on the Perception of Musical Genres - An Extension to the Studyforrest Dataset, F1000Research 4 (2015) 174. doi:10.12688/f1000research.6679.1.
  • Abraham et al. (2014) A. Abraham, F. Pedregosa, M. Eickenberg, P. Gervais, A. Mueller, J. Kossaifi, A. Gramfort, B. Thirion, G. Varoquaux, Machine learning for neuroimaging with scikit-learn, Frontiers in Neuroinformatics 8 (2014) 14. doi:10.3389/fninf.2014.00014.
  • Hotelling (1936) H. Hotelling, Relations Between Two Sets of Variates, Biometrika 28 (1936) 321–377. doi:10.2307/2333955.
  • Horst (1961) P. Horst, Generalized Canonical Correlations and their Applications to Experimental Data, Journal of Clinical Psychology 17 (1961) 331–347. doi:10.1002/1097-4679(196110)17:4<331::aid-jclp2270170402>3.0.co;2-d.
  • Kettenring (1971) J. R. Kettenring, Canonical Analysis of Several Sets of Variables, Biometrika 58 (1971) 433–451. doi:10.1093/biomet/58.3.433.
  • Raposo et al. (2019) F. A. Raposo, D. M. de Matos, R. Ribeiro, Learning Embodied Semantics via Music and Dance Semiotic Correlations, CoRR abs/1903.10534 (2019).
  • Yu et al. (2019) Y. Yu, S. Tang, F. Raposo, L. Chen, Deep Cross-modal Correlation Learning for Audio and Lyrics in Music Retrieval, ACM Transactions on Multimedia Computing, Communications, and Applications 15 (2019) 20. doi:10.1145/3281746.
  • Bestgen (2015) Y. Bestgen, Exact Expected Average Precision of the Random Baseline for System Evaluation, The Prague Bulletin of Mathematical Linguistics 103 (2015) 131–138. doi:10.1515/pralin-2015-0007.