The comparison of observed brain activity with the statistics generated by artificial intelligence systems is useful to probe brain functional organization under ecological conditions. Here we study fMRI activity in ten subjects watching color natural movies and compute deep representations of these movies with an architecture that relies on optical flow and image content. The association of activity in visual areas with the different layers of the deep architecture displays complexity-related contrasts across visual areas and reveals a striking foveal/peripheral dichotomy.
Keywords:deep learning; video encoding; brain mapping;
The understanding of brain functional architecture has long been driven by subtractive reasoning approaches, in which the activation patterns associated with different experimental conditions presented in event-related or block designs are contrasted in order to yield condition-specific maps poline2012. A more ecological way of stimulating subjects consists in presenting complex continuous stimuli that are much more similar to every-day cognitive experiences.
The analysis of the ensuing complex stimulation streams proceeds by extracting relevant features from the stimuli and correlating the occurrence of these features with brain activity recorded simultaneously with the presentation of the stimuli. The analysis of video streams has been carried in eickenberg2017seeing or gucclu2015deep using a deep convolutional network trained for image classification. More recently, gucclu2017increasingly
has used a deep neural network trained for action recognition to analyze video streams.
Like gucclu2017increasingly, we use a deep neural network trained for action recognition to extract video features and train a linear model to predict brain activity from these features. In contrast, our study is not restricted to dorsal stream visual areas but involves the whole brain, and the deep neural network we use is pretrained on the largest action recognition dataset available kay2017kinetics.
From the different layers of the deep neural networks, we build video representations that allow us to segregate (1) occipital and lateral areas of the visual cortex (reproducing the results of gucclu2015deep) and (2) foveal and peripheric areas of the visual cortex. We also introduce an efficient spatial compression scheme for deep video features that allows us to speed up the training of our predictive algorithm. We show that our compression scheme outperforms PCA by a large margin.
3.1 Deep video representation
We use a deep neural network trained for action recognition to build deep representations of the Berkeley Video stimuli nishimoto2011reconstructing. This material consists of more than four hours of color natural movies built by mixing video blocks of - seconds in a random fashion.
The deep network we use is called Temporal Segment Network (TSN) wang2016temporal. Following an idea introduced in 2014 simonyan2014two it was intended to mimic the dorsal and ventral stream by separately processing raw frames and optical flow fields. We chose TSN for our experiments because it uses a much larger number of layers than the original network (which results in higher accuracy in action recognition) and that a version of TSN pretrained on Kinetics – a massive video dataset (300 000 unique video clips) with 400 different classes all describing a human and at least 400 videos per class – is publicly available. The network is trained to recognize human actions such as slack-lining, skateboarding, massaging feet, dancing zumba and dining.
The version of TSN we use in our experiments is based on Inception v3 szegedy2016rethinking for both streams where small networks are used as building blocks of the main large network lin2013network
. Each stream in the TSN Network is composed of more than 40 convolution layers and a fully connected layer. The activities after the last layer represent the probability of belonging to each action class.
3.2 Feature extraction
The raw frames encode information about pixels, and flow fields encode information about pixels displacements. Although flow fields and raw frames streams do not precisely disentangle spatial content and motion information in videos, we may expect that the raw frames stream better represent local spatial features while the flow fields stream more efficiently convey dynamic information. Following eickenberg2017seeing we consider that the activation statistics in the first layers (the ones closer to those of the input) have a low level of abstraction, whereas the last layers (closer to the labels) represent high-level information. Therefore each activity in both streams can be considered as specific features or representations of the video.
If we were to extract all network activities of the Berkeley Video Dataset we would need to store more than 6 millions floats per frame in the dataset. Such a representation would be highly redundant. In order to keep the volume of data reasonable, in each stream we only focus on four convolutional layers ranked by complexity. We further compress the data using spatial smoothing, and use temporal smoothing so that we get one representation every two seconds of video, which allows us to match the acquisition rate of fMRI scanners.
10 subjects were scanned while watching the color natural movies of the Berkeley Video Dataset. The fMRI images were acquired at high spatial resolution (1.5mm), from a Prisma Scanner, using Multi-band and IPAT accelerations (mb factor=3, ipat=2). These data are part of a large-scale mapping project on a limited number of participants, called Human Brain Charting. Data acquisition procedures and initial experiments run in this project are described in ibc. In order to link extracted deep video features to the internal representation of videos in each subject we use a simple linear model to fit their brain activity in each voxel.
The use of a very simple model allows us to posit that the performance of the predictive model from a particular video representation is mostly linked to the suitability of the video representation. Hence the performance of the algorithm can be seen as a measure of the biological suitability of the video representation.
We use a kernel ridge regression with an hyper-parameter setting the magnitude of the l2-penalization on the weights. The resulting prediction is obtained using a cross validation procedure (11 sessions are used for train, 1 for test and at least 5 different splits are considered). To set the value of the hyper-parameter, we use a 5-fold cross validation on the train set and consider 20 different values. During hyper parameter selection, we only focus on the visual cortex to make this computation efficient.
The chosen measure of performance of our prediction algorithm is the coefficient of determination . Let and be the respectively the prediction of a voxel activity and the real voxel activity. Then
The metric used to select the best parameter is the number of voxels having a coefficient of determination greater than . This procedure leads to different parameter values depending on the chosen layer activities.
The extracted deep network features lead to different prediction performance depending on the down-sampling procedure, the stream used and the localization of target voxels.
4.1 An efficient spatial compression scheme
We show that preserving the channel structure of the network during spatial compression procedure is key for developing an efficient compression scheme.
We compare three spatial compression schemes for network activities: (1) Standard principal component analysis (PCA) withcomponents; the transformation is learned on training sessions before it is applied to all sessions. (2) Average pooling inside channels (APIC) which computes local means of activities located in the same channel. (3) Average pooling inside and between convolution layers (APBIC) which is used to get the same number of output features for all layers while minimizing the number of convolutions between channels. It allows us to check that the performance of the predictive algorithm is not merely driven by the number of features.
The procedure for activities extraction, temporal down-sampling and brain activity prediction is not changed while the spatial compression scheme varies. The benchmark is performed using a leave-one-out cross-validation procedure with two splits in three subjects.
Figure 2 shows that both approaches preserving channel organization structure outperform PCA by a large margin.
These results suggest that data stored in the same channel are similar and that mixing data between channels tends to destroy valuable information. In our pipeline, we average only inside same channels (APIC) because it yields the best performance. Choosing APBIC would be trading performance for computation speed since its high compression rate enables a much faster training of the prediction algorithm.
4.2 Data based parcellation of the brain using deep video representation
Depending on the considered region of the brain, the best fitting representation varies. We show that the compressed activities of different layers show contrasts between low-level (retinotopic) versus high-level (object-responsive) areas, but also between foveal and peripheral areas.
The difference between the prediction score from high level layer activity and low level layer activity of both streams ( and ) yields a clear contrast between occipital (low-level) and lateral (high-level) areas (see Fig 3). This highlights a gradient of complexity in neural representation along the ventral stream which was also found in gucclu2015deep.
The difference between predictions score from low-level layers activity of flow fields stream and high level layers activity of raw frames stream () yields a contrast that does not match boundaries between visual areas; instead, it does coincide with the retinotopic map displaying preferred eccentricity (see Figure 4). Intuitively this means that regions where brain activity is better predicted from the highest layer of optical flow fields than from the lowest layer of raw frames stream are involved in peripheric vision whereas regions where activity is better predicted from the lowest layer of raw frames stream than from the highest layer of optical flow fields are mainly foveal.
We use the contrasts between high level layers and low level layers, and the eccentricity related contrast to construct a parcellation of the brain based on these contrasts (see Figure 5). From the 8 possible resulting profiles, three major clusters stand out allowing us to successfully depict a clustering of the voxels using contrasts from deep representation of the stimuli.
Reproducing the results of gucclu2015deep we have shown that lateral areas are best predicted by the last layers of both streams whereas occipital areas are best predicted by first layers of both streams. We have also shown that foveal areas are best predicted by last layers of the raw frames stream and that peripheric areas are best predicted by the first layers of the flow fields stream. We have introduced a compression procedure for video representation that does not alter too much the channel structure of the network, yielding tremendous gains in performance compared to PCA.
The linear prediction from deep video features yields predictions scores that are far better than chance. However the TVL1 algorithm zach2007duality used in the TSN network does not produce high quality flow fields. Using more recent algorithms to compute optical flow such as Flownet 2 ilg2017flownet, our performance could be further improved. The TSN Network would have to be retrained though.
In contrast to gucclu2017increasingly
, the data used to train the network are not the same as the data presented to the subjects. We rely in fact on transfer between computer vision datasets and the visual content used for visual stimulation. This transfer is imperfect: the Berkeley video dataset contains videos of landscapes and animated pictures that are not present in the Kinetic dataset, which introduces some noise.
In conclusion, our study provides key insights that areas have a role linked to their retinotopic representation when performing action recognition. Future studies should focus on finessing this result by using a network tuned for other tasks.
This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 720270 (HBP SGA1).