Log In Sign Up

Overcoming the Domain Gap in Contrastive Learning of Neural Action Representations

by   Semih Günel, et al.

A fundamental goal in neuroscience is to understand the relationship between neural activity and behavior. For example, the ability to extract behavioral intentions from neural data, or neural decoding, is critical for developing effective brain machine interfaces. Although simple linear models have been applied to this challenge, they cannot identify important non-linear relationships. Thus, a self-supervised means of identifying non-linear relationships between neural dynamics and behavior, in order to compute neural representations, remains an important open problem. To address this challenge, we generated a new multimodal dataset consisting of the spontaneous behaviors generated by fruit flies, Drosophila melanogaster – a popular model organism in neuroscience research. The dataset includes 3D markerless motion capture data from six camera views of the animal generating spontaneous actions, as well as synchronously acquired two-photon microscope images capturing the activity of descending neuron populations that are thought to drive actions. Standard contrastive learning and unsupervised domain adaptation techniques struggle to learn neural action representations (embeddings computed from the neural data describing action labels) due to large inter-animal differences in both neural and behavioral modalities. To overcome this deficiency, we developed simple yet effective augmentations that close the inter-animal domain gap, allowing us to extract behaviorally relevant, yet domain agnostic, information from neural data. This multimodal dataset and our new set of augmentations promise to accelerate the application of self-supervised learning methods in neuroscience.


page 14

page 15


Overcoming the Domain Gap in Neural Action Representations

Relating animal behaviors to brain activity is a fundamental goal in neu...

Learnable latent embeddings for joint behavioral and neural analysis

Mapping behavioral actions to neural activity is a fundamental goal of n...

TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data

Among ubiquitous multimodal data in the real world, text is the modality...

Self-Supervised Learning Through Efference Copies

Self-supervised learning (SSL) methods aim to exploit the abundance of u...

Taxonomy of multimodal self-supervised representation learning

Sensory input from multiple sources is crucial for robust and coherent h...

The MABe22 Benchmarks for Representation Learning of Multi-Agent Behavior

Real-world behavior is often shaped by complex interactions between mult...

Drop, Swap, and Generate: A Self-Supervised Approach for Generating Neural Activity

Meaningful and simplified representations of neural activity can yield i...

1 Introduction

Recent technological advances have enabled large-scale simultaneous recordings of neural activity and behavior in animals including rodents, macaques, humans and the vinegar fly, Drosophila melanogaster dombeck ; Seelig ; chen2018imaging ; lfads ; Ecker2010c ; wirelesshuman . In parallel, recent efforts have been made it possible to perform markerless predictions of 2D and 3D animal poses leap ; Mathis ; Gunel ; Bala ; newell ; couzin ; fang2017rmpe ; wei2016cpm ; cao2017realtime ; lp3d ; li2020deformation . Video and pose data have been used to segment and cluster temporally related behavioral information task_programming ; Segalin2020 ; berman21 ; quantify ; robertZebraFish20 . To capture a similarly low dimensional representation of neural activity, most efforts have focused on the application of recurrent state space models Nassar2019TreeStructuredRS ; Linderman621540 ; pmlr-v54-linderman17a

, or variational autoencoders

Gao2016LinearDN ; lfads . By contrast, there has been relatively limited work aimed at extracting behavioral information from neural data behavenet ; subspace ; MLfordecoding and most efforts have focused on identifying linear relationships between these two modalities using simple correlation analysis, or generalized linear models subtrate ; musall19 ; stringer19 . However, neural action representations—the mapping of behavioral information within neural data—which are particularly crucial for brain-machine interfaces and closed-loop experimentation bmi ; closed-loop are highly nonlinear. Therefore, devising a systematic approach for uncovering complex non-linear relationships between behavioral and neural modalities remains an important challenge.

Contrastive learning is one promising approach to address this gap. It has been used to extract information from multimodal datasets in a self-supervised way, for modalities including audio, speech, and optical flow munro20multi ; Han20 ; alwassel_2020_xdc ; asano2020self ; relja18 ; asano2020labelling . Contrastive learning also has been applied to unimodal datasets, including the study of human motion sequences liu2020snce ; su2020predict ; Lin_2020 , medical imaging chaitanya2020contrastive ; zhang2020contrastive , video understanding pan2021videomoco ; Dave2021TCLRTC

, and pose estimation

honar21SSL ; mitra2020multiview . Thus, contrastive learning holds great promise for application in neuroscience.

One of the largest barriers to applying contrastive learning to behavioral-neural multimodal datasets is the fact that their statistics (e.g., neuron locations and sizes, body part lengths and ranges of motion) often differ dramatically across animals. This makes it difficult to train models that can generalize across subjects. We confront this domain gap when comparing neural imaging datasets from two different flies (Supplementary Fig. S3; Supplementary Videos 1-2). Although multimodal domain adaptation methods for downstream tasks such as action recognition exist munro20multi , they assume supervision in the form of labeled source data. However, labeling behavioral-neural datasets requires expensive and arduous manual labor by trained scientists, and thus often leaving the vast majority of data unlabeled. Similarly, it is non-trivial to generalize few-shot domain adaptation methods to multimodal tasks kangcontrastive ; wang2021crossdomain . Thus, the field of neuroscience needs new computational approaches that can extract information from ever-increasing amounts of unlabeled multimodal datasets that also suffer from extensive domain gaps across subjects.

Here, we address this challenge by extracting domain agnostic action representations from neural data. We measure representation quality using an action recognition task, in which we apply a linear classification head and transfer our pretrained weights to classify action labels. Therefore, we call our representations

neural action representations. To best reflect real world conditions, during the unsupervised pre-training phase, we assume access to the paired behavioral-neural data for all domains but without any action labels. Then, we show that a strong domain gap exists across data taken from different animals, rendering standard contrastive methods ineffective. To address this challenge, we propose a set of simple augmentations that can perform domain adaptation and extract useful representations. We find that the resulting model outperforms baseline approaches, including linear models, previous neural representation learning approaches and common domain adaptation techniques. Finally, to accelerate the uptake and development of these and other self-supervised methods in neuroscience, we will release our new multimodal Drosophila behavioral-neural dataset along with associated dense action labels for spontaneously-generated behaviors

2 Methods

2.1 Problem Definition

We assume a paired set of data , where and represent the behavioral and neural information respectively, with being the number of samples for animal . We quantify behavioral information as a set of 3D poses corresponding to a set of frames from animal , and neural information as a set of two-photon microscope images capturing the activity of neurons. We assume that the two modalities are always synchronized (paired), and therefore describe the same set of events. Our goal is to learn a parameterized image encoder function , which maps a set of neural images to a low-dimensional representation. We aim for our learned representation to be representative of the underlying behavioral label, while being modality-agnostic and not representative of the underlying animal identity information , and therefore effectively removing the domain gap across animals and modalities. We assume that we are not given behavioral labels during unsupervised training.

2.2 Contrastive Representation Learning

For each input pair , we first draw a random view with a sampled transformation function and , where and represent a family of stochastic image transformation functions for behavioral and neural data, respectively. Next, the encoder functions and

transform input data into low-dimensional vectors

and , followed by non-linear projection functions and , which further transform data into the vectors and . During training, we sample a minibatch of N input pairs

, and train with the symmetric loss function



is the cosine similarity between behavioral and neural modalities and

is the temperature parameter. The loss function maximizes the mutual information between two modalities oord2019representation . The symmetric version of the contrastive loss function was previously used in multimodal self-supervised learning zhang2020contrastive ; Yuan_2021_CVPR . An overview of our method for learning is shown in Supplementary Fig S2. Although standard contrastive learning bridges the gap between different modalities, it does not bridge the gap between different animals, a fundamental challenge that we address in this work through augmentations described in the following section.

Swapping Augmentation:

Given a set of consecutive 3D poses , for each , we stochastically replace with one of its nearest pose neighbors in the set of domains , where is the set of all animals. To do so, we first randomly select a domain

and define a probability distribution

over the domain with respect to ,


We then replace each 3D pose by first uniformly sampling a new domain , and then sampling from the above distribution , therefore resulting in . In practice, we calculate the distribution only over the first nearest neighbors of , in order to sample from a distribution of the most similar poses. We empirically set to . Swapping augmentation reduces the identity information in the behavioral data without perturbing it to the extent that semantic action information is lost. Each transformed behavioral sample is composed of multiple domains. This forces the behavioral encoding function

to leave identity information out, therefore merging multiple domains in the latent space. Swapping augmentation is similar to synonym replacement augmentation used in natural language processing

wei-zou-2019-eda , where randomly selected words in a sentence are replaced by their synonyms. To the best of our knowledge, we are the first to use swapping augmentation in the context of time-series analysis or for domain adaptation.

Neural Calcium Imaging Data Augmentation:

Our neural data was obtained using two-photon microscopy and calcium imaging. The resulting images are only a function of the underlying neural activity, and have temporal properties that differ from the true neural activity. For example, calcium signals from a neuron change much more slowly than the neuron’s actual firing rate. Consequently, a single neural image includes decaying information concerning neural activity from the recent past, and thus carries information about previous behaviors. This makes it harder to decode the current behavioral state. We aimed to prevent this overlap of ongoing and previous actions. Specifically, we wanted to teach our network to be invariant with respect to past behavioral information by augmenting the set of possible past actions. To do this, we generated new data , that included previous neural activity . To mimic calcium indicator decay dynamics, given a neural data sample of multiple frames, we sample a new neural frame from the same domain, where . We then convolve with the temporally decaying calcium convolutional kernel , therefore creating a set of images from a single frame , which we then add back to the original data sample . This results in where denotes the convolutional operation. In the Appendix, we explain calcium dynamics and our calculation of the kernel in more detail.

3 Experiments

In this section we introduce a new dataset consisting of Drosophila melanogaster

neural and behavioral recordings as well as the set of downstream evaluation metrics.

3.1 Dataset

Motion Capture and Two-photon Dataset (MC2P):

We acquired data from tethered adult female flies, (Drosophila melanogaster). This dataset consists of neural activity recorded using a two-photon microscope chen18 from the axons of descending neurons passing through the animal’s cervical connective. It also includes behavioral data recorded using multi-view infrared cameras (Supplementary Fig. S1; Supplementary Videos 1-2). Specifically, behavioral video data of size pixels were acquired at frames-per-second (fps) using a six circular camera network with the animal at its center.The neural data was recorded using a two-photon microscope, yielding images of pixels at fps. Eight animals and 133 trials were recorded, resulting in 8.2 hours of recordings with 2,975,000 behavioral and 476,000 neural frames. The dataset includes manual and dense action labels of eight behaviors: forward walking, pushing, hindleg grooming, abdominal grooming, rest, foreleg grooming, antennal grooming, and eye grooming. We report the statistics of our dataset in Supplementary Fig. S5. See the Appendix for more details.

3.2 Evaluation

To evaluate our unsupervised pretrained neural encoder , we froze its parameters and trained a randomly initialized linear classification layer with with SGD. To compare data efficiency, for each setting we evaluated image encoders with and of the data. We report aggregated results over -fold cross-validation evaluations and report the average in each task. We evaluated models on the following tasks:

Single-Animal Action Recognition:

We performed action recognition on a single domain by training and testing on the same animal. We repeated the same experiment on each of four animals, and report the mean accuracy.

Multi-Animal Action Recognition:

We evaluated models on their ability to reduce the domain gap. We trained the linear classifier on N-1 animals and tested on the left-out one, leaving each animal out one at a time.

Identity Recognition:

We classified animal identity from among the eight animals. We sampled 1000 random data-points uniformly across animals and applied 4-fold cross validation. In the case that the learned representations are domain (subject) invariant, we expect that the linear classifier will not be able to detect the domain of the representations, resulting in a lower identity recognition accuracy.

4 Results

We present action recognition results from neural imaging data in Table 1 and identity recognition task results in Table S2. For the supervised baseline, we trained an MLP with manually annotated action labels using cross-entropy loss, with the raw neural data as input, and show the results in the "Raw" section of  Table 1. For the "Self-Supervised" section, before using the proposed augmentations, the contrastive method SimCLR performed worse than convolutional and recurrent regression-based methods including the current state-of-art BehaveNet behavenet . Although domain adaptation methods MMD (Maximum Mean Discrepancy) and GRL (Gradient Reversal Layer) close the domain gap and lower identity recognition accuracy, they do not position semantically similar points near one another (Supplementary Fig. S4). As a result, domain adaptation-based methods do not result in significant improvements in the action recognition task. Although regression-based methods suffer less from the domain gap problem, they do not produce as discriminative representations as contrastive learning based methods. The same trend is observed in Table Table S2. Our proposed set of augmentations close the domain gap, while significantly improving the action recognition baseline for self-supervised methods, for both single-animal and multi-animal tasks. We include detailed information about the baselines in the Appendix.

Tasks Single-Animal Multi-Animal
Percentage of Data 0.5 1.0 0.5 1.0
Random Guess 16.6 16.6 16.6 16.6
Neural (Linear)


29.3 32.5 18.4 18.4
Neural (MLP) 18.4 18.4
SimCLR simclr


54.3 57.6 46.9 50.6
Regression (Recurr.) 53.6 59.7 49.4 51.2
Regression (Conv.) 52.6 59.6 50.6 55.8
BehaveNet behavenet 54.6 60.2 50.5 56.8
Ours 57.9 63.3 54.8 61.9
SimCLR simclr + MMD

Domain Ada.

53.6 57.8 50.1 53.1
SimCLR simclr + GRL 53.5 56.3 49.9 52.3
Regression (Conv.) + MMD 54.5 60.7 52.6 55.4
Regression (Conv.) + GRL 55.5 60.2 51.8 55.7
Table 1: Action Recognition Accuracy. Single- and multi-animal action recognition results on the MC2P dataset. Behavioral and Neural MLP results for the single-animal task are removed because single animals often do not have enough labels for every action.

5 Conclusion

We introduced an unsupervised neural action representation framework. We extended previous methods by establishing set of augmentations that we show overcomes the multimodal domain gap in our Drosophila behavioral-neural dataset. Finally, we will share in order to dataset to accelerate the application of self-supervised learning methods in neuroscience. In future work, we aim to extend our work for domain generalization.


  • (1) Daniel A. Dombeck, Anton N. Khabbaz, Forrest Collman, Thomas L. Adelman, and David W. Tank. Imaging large-scale neural activity with cellular resolution in awake, mobile mice. Neuron, 56(1):43 – 57, 2007.
  • (2) Johannes D Seelig, M Eugenia Chiappe, Gus K Lott, Anirban Dutta, Jason E Osborne, Michael B Reiser, and Vivek Jayaraman. Two-photon calcium imaging from head-fixed Drosophila during optomotor walking behavior. Nature Methods, 7(7):535–540, 2010.
  • (3) Chen C-L, Hermans L, Meera C Viswanathan, Denis Fortun, Florian Aymanns, Michael Unser, Anthony Cammarato, Michael H Dickinson, and Pavan Ramdya. Imaging neural activity in the ventral nerve cord of behaving adult drosophila. Nature communications, 9(1):4390, 2018.
  • (4) Chethan Pandarinath, Daniel J. O’Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D. Stavisky, Jonathan C. Kao, Eric M. Trautmann, Matthew T. Kaufman, Stephen I. Ryu, Leigh R. Hochberg, Jaimie M. Henderson, Krishna V. Shenoy, L. F. Abbott, and David Sussillo. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature Methods, 15(10):805–815, 2018.
  • (5) A. S. Ecker, P. Berens, G. A. Keliris, M. Bethge, N. K. Logothetis, and A. S. Tolias. Decorrelated neuronal firing in cortical microcircuits. Science, 327(5965):584–587, 2010.
  • (6) Uros Topalovic, Zahra M. Aghajan, Diane Villaroman, Sonja Hiller, Leonardo Christov-Moore, Tyler J. Wishard, Matthias Stangl, Nicholas R. Hasulak, Cory S. Inman, Tony A. Fields, Vikram R. Rao, Dawn Eliashiv, Itzhak Fried, and Nanthia Suthana. Wireless programmable recording and stimulation of deep brain activity in freely moving humans. Neuron, 108(2):322–334.e9, 2020.
  • (7) Talmo D Pereira, Diego E Aldarondo, Lindsay Willmore, Mikhail Kislin, Samuel S H Wang, Mala Murthy, and Joshua W Shaevitz.

    Fast animal pose estimation using deep neural networks.

    Nature Methods, 16(1):117–125, 2019.
  • (8) Alexander Mathis, Pranav Mamidanna, Kevin M Cury, Taiga Abe, Venkatesh N Murthy, Mackenzie Weygandt Mathis, and Matthias Bethge.

    DeepLabCut: markerless pose estimation of user-defined body parts with deep learning.

    Nature neuroscience, 21(9):1281–1289, 2018.
  • (9) Semih Günel, Helge Rhodin, Daniel Morales, João Campagnolo, Pavan Ramdya, and Pascal Fua. DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila. eLife, 8:3686, 2019.
  • (10) Praneet C. Bala, Benjamin R. Eisenreich, Seng Bum Michael Yoo, Benjamin Y. Hayden, Hyun Soo Park, and Jan Zimmermann. Automated markerless pose estimation in freely moving macaques with openmonkeystudio. Nature Communications, 11(1):4560, 2020.
  • (11) Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    , 2016.
  • (12) Jacob M Graving, Daniel Chae, Hemal Naik, Liang Li, Benjamin Koger, Blair R Costelloe, and Iain D Couzin. Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife, 8:e47994, 2019.
  • (13) Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • (14) Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016.
  • (15) Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • (16) Adam Gosztolai, Semih Günel, Victor Lobato-Ríos, Marco Pietro Abrate, Daniel Morales, Helge Rhodin, Pascal Fua, and Pavan Ramdya. Liftpose3d, a deep learning-based approach for transforming two-dimensional to three-dimensional poses in laboratory animals. Nature Methods, 18(8):975–981, 2021.
  • (17) Siyuan Li, Semih Günel, Mirela Ostrek, Pavan Ramdya, Pascal Fua, and Helge Rhodin. Deformation-aware unpaired image translation for pose estimation on laboratory animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • (18) Jennifer J Sun, Ann Kennedy, Eric Zhan, David J Anderson, Yisong Yue, and Pietro Perona. Task programming: Learning data efficient behavior representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • (19) Cristina Segalin, Jalani Williams, Tomomi Karigo, May Hui, Moriel Zelikowsky, Jennifer J. Sun, Pietro Perona, David J. Anderson, and Ann Kennedy. The mouse action recognition system (mars): a software pipeline for automated analysis of social behaviors in mice. bioRxiv, 2020.
  • (20) Katherine Overman, Daniel Choi, Kawai Leung, Joshua Shaevitz, and Gordon Berman. Measuring the repertoire of age-related behavioral changes in drosophila melanogaster. bioRxiv, 2021.
  • (21) Talmo D. Pereira, Joshua W. Shaevitz, and Mala Murthy. Quantifying behavior to understand the brain. Nature Neuroscience, 23(12):1537–1549, 2020.
  • (22) Robert Evan Johnson, Scott Linderman, Thomas Panier, Caroline Lei Wee, Erin Song, Kristian Joseph Herrera, Andrew Miller, and Florian Engert. Probabilistic models of larval zebrafish behavior reveal structure on many scales. Current Biology, 30(1):70–82.e4, 2020.
  • (23) Josue Nassar, Scott W. Linderman, M. Bugallo, and Il-Su Park. Tree-structured recurrent switching linear dynamical systems for multi-scale modeling. arXiv, 2019.
  • (24) Scott Linderman, Annika Nichols, David Blei, Manuel Zimmer, and Liam Paninski. Hierarchical recurrent state space models reveal discrete and continuous dynamics of neural activity in c. elegans. bioRxiv, 2019.
  • (25) Scott Linderman, Matthew Johnson, Andrew Miller, Ryan Adams, David Blei, and Liam Paninski. Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems. In

    Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2017.
  • (26) Yuanjun Gao, Evan Archer, L. Paninski, and J. Cunningham. Linear dynamical neural population models through nonlinear embeddings. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  • (27) Eleanor Batty, Matthew Whiteway, Shreya Saxena, Dan Biderman, Taiga Abe, Simon Musall, Winthrop Gillis, Jeffrey Markowitz, Anne Churchland, John P Cunningham, et al. Behavenet: nonlinear embedding and bayesian neural decoding of behavioral videos. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • (28) Omid G. Sani, Hamidreza Abbaspourazad, Yan T. Wong, Bijan Pesaran, and Maryam M. Shanechi. Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification. Nature Neuroscience, 24(1):140–149, 2021.
  • (29) Joshua I. Glaser, Ari S. Benjamin, Raeed H. Chowdhury, Matthew G. Perich, Lee E. Miller, and Konrad P. Kording. Machine learning for neural decoding. eNeuro, 7(4), 2020.
  • (30) Alice A. Robie, Jonathan Hirokawa, Austin W. Edwards, Lowell A. Umayam, Allen Lee, Mary L. Phillips, Gwyneth M. Card, Wyatt Korff, Gerald M. Rubin, Julie H. Simpson, Michael B. Reiser, and Kristin Branson. Mapping the neural substrates of behavior. Cell, 170(2):393–406.e28, 2017.
  • (31) Simon Musall, Matthew T. Kaufman, Ashley L. Juavinett, Steven Gluf, and Anne K. Churchland. Single-trial neural dynamics are dominated by richly varied movements. Nature Neuroscience, 22(10):1677–1686, 2019.
  • (32) Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Charu Bai Reddy, Matteo Carandini, and Kenneth D Harris. Spontaneous behaviors drive multidimensional, brainwide activity. Science, 364(6437):255–255, 2019.
  • (33) Shixian Wen, Allen Yin, Po-He Tseng, Laurent Itti, Mikhail A. Lebedev, and Miguel Nicolelis. Capturing spike train temporal pattern with wavelet average coefficient for brain machine interface. Scientific Reports, 11(1):19020, 2021.
  • (34) Celia K S Lau, Meghan Jelen, and Michael D Gordon. A closed-loop optogenetic screen for neurons controlling feeding in drosophila. G3 (Bethesda), 11(5), 05 2021.
  • (35) Jonathan Munro and Dima Damen. Multi-modal Domain Adaptation for Fine-grained Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • (36) Tengda Han, Weidi Xie, and Andrew Zisserman. Self-supervised co-training for video representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • (37) Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self-supervised learning by cross-modal audio-video clustering. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • (38) Yuki M. Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In International Conference on Learning Representations (ICLR), 2020.
  • (39) Relja Arandjelović and Andrew Zisserman. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • (40) Yuki M. Asano, Mandela Patrick, Christian Rupprecht, and Andrea Vedaldi. Labelling unlabelled videos from scratch with multi-modal self-supervision. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • (41) Yuejiang Liu, Qi Yan, and Alexandre Alahi. Social nce: Contrastive learning of socially-aware motion representations. arXiv, 2020.
  • (42) Kun Su, Xiulong Liu, and Eli Shlizerman. Predict & cluster: Unsupervised skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • (43) Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. MS2L: Multi-task self-supervised learning for skeleton based action recognition. In Proceedings of the ACM International Conference on Multimedia, 2020.
  • (44) Krishna Chaitanya, Ertunc Erdil, Neerav Karani, and Ender Konukoglu. Contrastive learning of global and local features for medical image segmentation with limited annotations. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • (45) Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text. arXiv, 2020.
  • (46) Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • (47) I. Dave, Rohit Gupta, M. N. Rizve, and M. Shah. TCLR: Temporal contrastive learning for video representation. arXiv, 2021.
  • (48) Sina Honari, Victor Constantin, Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsupervised learning on monocular videos for 3d human pose estimation. arXiv, 2020.
  • (49) Rahul Mitra, Nitesh B Gundavarapu, Abhishek Sharma, and Arjun Jain.

    Multiview-consistent semi-supervised learning for 3d human pose estimation.

    In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2020.
  • (50) Guoliang Kang, Lu Jiang, Yunchao Wei, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for single-and multi-source domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
  • (51) Rui Wang, Zuxuan Wu, Zejia Weng, Jingjing Chen, Guo-Jun Qi, and Yu-Gang Jiang. Cross-domain contrastive learning for unsupervised domain adaptation. arXiv, 2021.
  • (52) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv, 2019.
  • (53) Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. Multimodal contrastive training for visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • (54) Jason Wei and Kai Zou. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
  • (55) Chin-Lin Chen, Laura Hermans, Meera C. Viswanathan, Denis Fortun, Florian Aymanns, Michael Unser, Anthony Cammarato, Michael H. Dickinson, and Pavan Ramdya. Imaging neural activity in the ventral nerve cord of behaving adult drosophila. Nature Communications, 9(1):4390, 2018.
  • (56) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
  • (57) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, (ICLR), 2015.
  • (58) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), 2015.
  • (59) Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Machine Learning (ICML), 2015.
  • (60) Jessica Cande, Shigehiro Namiki, Jirui Qiu, Wyatt Korff, Gwyneth M Card, Joshua W Shaevitz, David L Stern, and Gordon J Berman. Optogenetic dissection of descending behavioral control in Drosophila. eLife, 7:970, 2018.
  • (61) Florian Aymanns. utils2p., Sep 2021.
  • (62) Florian Aymanns. ofco: optical flow motion correction., Sep 2021.
  • (63) Jérôme Lecoq, Michael Oliver, Joshua H. Siegle, Natalia Orlova, and Christof Koch. Removing independent noise in systems neuroscience data using deepinterpolation. bioRxiv, 2020.
  • (64) Victor Lobato-Rios, Pembe Gizem Özdil, Shravan Tata Ramalingasetty, Jonathan Arreguit, Auke Jan Ijspeert, and Pavan Ramdya. Neuromechfly, a neuromechanical model of adult drosophila melanogaster. bioRxiv, 2021.
  • (65) Eftychios A. Pnevmatikakis, Josh Merel, Ari Pakman, and Liam Paninski. Bayesian spike inference from calcium imaging data. arXiv, 2013.
  • (66) Peter Rupprecht, Stefano Carta, Adrian Hoffmann, Mayumi Echizen, Antonin Blot, Alex C. Kwan, Yang Dan, Sonja B. Hofer, Kazuo Kitamura, Fritjof Helmchen, and Rainer W. Friedrich. A database and deep learning toolbox for noise-optimized, generalized spike inference from calcium imaging. Nature Neuroscience, 24(9):1324–1337, 2021.
  • (67) Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander J. Smola. A kernel method for the two-sample-problem. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2006.
  • (68) Yaroslav Ganin and Victor Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In Proceedings of the International Conference on Machine Learning (ICML), 2015.


  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? Please see the Conclusion section.

    3. Did you discuss any potential negative societal impacts of your work? Please see the Broader Impact Statement Section.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? We include instructions to download and use our dataset in the supplementary materials.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Please see the appendix, particularly the implementation details section.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? We use cross-validation and report the mean accuracy. Please see the appendix, the implementation details section.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Please see the appendix, the implementation details section.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets? We include the license of our dataset in the supplementary material.

    3. Did you include any new assets either in the supplemental material or as a URL? We include instructions to download and using our dataset in the supplementary materials.

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix for Overcoming the Domain Gap in
Contrastive Learning of Neural Action Representations

Appendix A Implementation Details

Aside from the augmentations mentioned in the main text, for the image transformation family , we used a sequential application of Poisson noise, Gaussian blur, and color jittering. In contrast with recent work on contrastive visual representation learning, we only applied brightness and contrast adjustments in color jittering because neural images have a single channel that measures calcium indicator fluorescence intensity. We did not apply any cropping augmentation, such as cutout, because action representation is often highly localized and non-redundant (e.g., grooming is associated with the activity of a small set of neurons and thus with only a small number of pixels). We did not apply affine transformations since it removes absolute location information, which is essential for associating neural data with behavioral information (e.g., left-turning is associated with higher activity of descending neurons on the right-side of the connective). We applied the same augmentations to each frame in single sample of neural data.

For the behavior transformation family , we used a sequential application of scaling, shear, and random temporal and spatial dropping. We did not apply rotation and translation augmentations because the animals were tethered (i.e., restrained from moving freely), and their direction and absolute location were kept fixed throughout the experiment. We did not use time warping because neural and behavioral information are temporally linked (e.g., fast walking has different neural representations than slow walking).

For all methods, we initialized the weights of the networks randomly unless otherwise specified. To keep the experiments consistent, we always paired frames of neural data with frames of behavioral data. For the neural data, we used a larger time window because the timescale during which dynamic changes occur are smaller. For the paired modalities, we considered data synchronized if their center frames had the same timestamp. We trained contrastive methods for epochs and set the temperature value to . We set the output dimension of and to . We used a cosine training schedule with three epochs of warm-up. For non-contrastive methods, we trained for epochs with a learning rate of , and a weight decay of , using the Adam optimizer adam . We ran all experiments using an Intel Core i9-7900X CPU, 32 GB of DDR4 RAM, and a GeForce GTX 1080. Training for a single SimCLR network for 200 epochs took 12 hours. To create train and test splits, we removed two trials from each animal and used them only for testing. For the domain adaptation methods GRL and MMD, we reformulated the denominator of the contrastive loss function. Given a domain function which gives the domain of the data sample, we replaced one side of in Eq. 1 with,


where selective negative sampling prevents forming trivial negative pairs across domains, therefore making it easier to merge multiple domains. Negative pairs formed during contrastive learning try to push away inter-domain pairs, whereas domain adaptation methods try to merge multiple domains to close the domain gap. We found that the training of contrastive and domain adaptation losses together could be quite unstable, unless the above changes were made to the contrastive loss function.

We used the architecture shown in Supplementary Table S1

for the neural image and behavioral pose encoder. Each layer except the final fully-connected layer was followed by Batch Normalization and a ReLU activation function

batchnorm . For the self-attention mechanism in the behavioral encoder (Supplementary Table S1), we implement Bahdanau attention bahdanau . Given the set of intermediate behavioral representations , we first calculated,

where and are a set of matrices of shape and respectively. is the assigned score i-th pose in the sequence of motion. Then the final representation is given by .

Layer # filters K S Output
input 1 - -
conv1 2 (3,3) (1,1)
mp2 - (2,2) (2,2)
conv3 4 (3,3) (1,1)
mp4 - (2,2) (2,2)
conv5 8 (3,3) (1,1)
mp6 - (2,2) (2,2)
conv7 16 (3,3) (1,1)
mp8 - (2,2) (2,2)
conv9 32 (3,3) (1,1)
mp10 - (2,2) (2,2)
conv11 64 (3,3) (1,1)
mp12 - (2,2) (2,2)
fc13 128 (1,1) (1,1)
fc14 128 (1,1) (1,1)
(b) Behavior Encoder
Layer # filters K S Output
input 60 - -
conv1 64 (3) (1)
conv2 80 (3) (1)
mp2 - (2) (2)
conv2 96 (3) (1)
conv2 112 (3) (1)
conv2 128 (3) (1)
attention6 - (1) (1)
fc7 128 (1) (1)
Table S1: Architecture details. Shown are half of the neural encoder and behavior encoder functions. How these encoders are used is shown in Supplementary Figure S2. Neural encoder is followed by 1D convolutions similar to the behavioral encoder , by replacing the number of filters. Both encoders produce dimensional output, while first half of the neural encoder do not downsample on the temporal axis. mp

denotes a max-pooling layer. Batch Normalization and ReLU activation are added after every convolutional layer.

(a) First part of the Neural Encoder

Appendix B Dataset Details

Here we provide a more detailed technical explanation of the experimental dataset. Transgenic female Drosophila melanogaster flies aged 2-4 days post eclosion were selected for experiments. They were raised on a 12h:12h day, night light cycle and recorded in either the morning or late afternoon Zeitgeber time. Flies expressed both GCaMP6s and tdTomato in all brain neurons targeted by otd-Gal4 (). The fluorescence of GCaMP6s proteins within the neuron increases when it binds to calcium. There is an increase in intracellular calcium when neurons are active and fire action potentials. Due to the relatively slow release (as opposed to binding) of calcium by GCaMP6s molecules, the signal decays exponentially. We also expressed the red fluorescent protein, tdTomato, in the same neurons as an anatomical fiduciary to be used for neural data registration that compensates for image deformations and translations during animal movements. We recorded neural data using a two-photon microscope (ThorLabs, Germany; Bergamo2) by scanning the cervical connective. This neural tissue serves as a conduit between the brain and ventral nerve cord (VNC) chen2018imaging . The brain-only GCaMP6s expression pattern in combination with restrictions of recording to the cervical connective allowed us to record a large population of descending neuron axons while also being certain that none of the axons arose from ascending neurons in the VNC. Because descending neurons are expected to drive ongoing actions Cande , this imaging approach has the added benefit of ensuring that the imaged cells could, in principle, relate to paired behavioral data.

For neural data processing, raw microscope files were first converted into *.tiff files. These data were then synchronized using a custom Python package aymanns21utils2p . We then estimated the motion of the neurons using images acquired on the red (tdTomato) PMT channel. The first image of the first trial was selected as a reference frame to which all other frames were registered. For image registration, we estimated the vector field describing the motion between two frames. To do this, we numerically solved the optimization problem in Eq. 4, where is the motion field, is the image being transformed, is the reference image, and is the set of all pixel coordinates chen2018imaging ; aymanns21ofco .


A smoothness promoting parameter was empirically set to 800. We then applied to the green PMT channel (GCaMP6s). To denoise the motion corrected green signal, we trained a DeepInterpolation network deepinterpolation for nine epochs for each fly and applied it to the rest of the frames. We only used the first 100 frames of each trial and used the first and last trials as validation data. The batch size was set to 20 and we used 30 frames before and after the current frame as input. In order to have a direct correlation between pixel intensity and neuronal activity we applied the following transformation to all neural images , where is the baseline fluorescence in the absence of neural activity. To estimate , we used the pixel-wise minimum of a moving average of 15 frames.

We calibrated the camera rig and extracted 3D poses including 38 landmarks from each animal from RGB video data using DeepFly3D Gunel . We then further preprocessed the 3D data by extracting the anchor (thorax-coxa) joints from each leg and and normalizing the range of the data between . Finally, we applied inverse kinematics to convert 3D poses into Euler angles using Lobato-Rios2021.04.17.440214

. Unlike human action datasets with scripted actions and a uniform distribution over time, our MC2P dataset is more challenging to analyze because it includes spontaneous and unscripted animal actions with heavy-tailed time and action distributions

(Supplementary Fig. S5).

Calcium Dynamics:

The relationship between the calcium signal and neural activity can be modeled as a first-order autoregressive process


is a binary variable indicating an event at time

(e.g. the neuron firing an action potential). The amplitudes and determine the rate at which the signal decays and the initial response to an event, respectively. In general, , therefore resulting in an exponential decay of information pertaining to to be inside of . A single neural image includes decaying information from previous neural activity, and hence carry information from past behaviors. For more detailed information on calcium dynamics, see pnevmatikakis2013bayesian ; Rupprecht21 . Assuming no neural firings, , is given by . Therefore, we define the calcium kernel as .

Appendix C Baseline Methods

We compare our method with a set of baseline methods previously applied on behavioral-neural datasets.

Identity Recog. Identity Recog.
Method (0.5, Accuracy) (1.0, Accuracy)
Random Guess 12.5 12.5
Behavior (Linear) 88.6 89.7
Neural (Linear) 100.0 100.0
SimCLR simclr 69.9 80.3
Regression (Recurrent) 89.5 91.8
Regression (Convolution) 88.7 92.5
BehaveNet behavenet 80.2 83.4
Ours 12.5 12.5
SimCLR + MMD MMD 18.4 21.2
SimCLR + GRL GRL 16.7 19.1
Table S2: Identity Recognition task. Comparison of neural representation learning approaches on an Identity Recognition task. Smaller values reflect better representations.


We trained a single feedforward network with manually annotated action labels using cross-entropy loss, with the raw data as input. We initialized the network weights randomly. We discarded datapoints that did not have associated behavioral labels. For the MLP baseline, we trained a simple three layer MLP with a hidden layer size of 128 neurons with ReLU activation and without Batch Normalization.

Regression (Convolutional):

We trained a single fully-convolutional feedforward network for a behavioral reconstruction task, given the set of neural images. We trained with a simple MSE loss. To keep the architectures consistent, the average pooling was followed by a projection layer. We took the input to the projection layer as the final representation.

Regression (Recurrent):

Similar to convolutional regression, the last projection network was replaced with a two-layer GRU module. The GRU module takes as an input the fixed representation of neural images. At each time step, the GRU module predicts a single 3D pose with a total of eight steps to predict the eight poses associated with an input neural image. We trained this model with a simple MSE loss. We took the input of the GRU module as the final representation of neural encoder.


BehaveNet uses a discrete autoregressive hidden Markov model (ARHMM) module to decompose 3D motion information into discrete “behavioral syllables." Similar to regression baseline, the neural information is used to predict the posterior probability of observing each discrete syllable

behavenet . Unlike the original method, we used 3D poses instead of RGB videos. We skipped compressing the behavioral data using a convolutional autoencoder because, unlike RGB videos, 3D poses are already low-dimensional.


We trained the original SimCLR module without the calcium imaging data and swapping augmentations. Similar to our method, we took the features before the projection layer as the final representation simclr .

Gradient Reversal Layer (GRL):

Together with the contrastive loss, we trained a two-layer MLP domain discriminator per modality, and , which estimated the domain of the neural and behavioral representations GRL . Discriminators were trained with the loss function

where is the one-hot identity vector. Gradient Reversal layer is inserted before the projection layer. Given the reversed gradients, the neural and behavioral encoders and learn to fool the discriminator and outputs invariant representations across domains. We kept the hyperparameters of the discriminator the same as in previous work munro20multi . We froze the weights of the discriminator for the first 10 epochs, and trained only the . We trained the network using both loss functions, , for the remainder of training. We set the hyperparameters to empirically.

Maximum Mean Discrepancy (MMD):

We replaced adversarial loss with a statistical test to minimize the distributional discrepancy from different domains MMD . Similar to previous work, we applied MMD only on the representations before the projection layer independently on both modalities munro20multi ; kangcontrastive . Similar to the GLR baseline, we first trained 10 epochs only using the contrastive loss, and trained using the combined losses for the remainder. We set the hyperparameters as empirically.

Appendix D Broader Impact

In this work, we have proposed a method that extracts behavioral information from two-photon neural imaging data. In the long run, our work can impact humans through the development of more effective brain machine interface neural decoding algorithms. Here we focus on animal studies because of issues related to human studies, including experimental invasiveness and infringements of personal privacy, therefore we see limited negative societal impact due to our research. Notably, in the long term, by increasing the efficiency of self-supervised learning techniques, these algorithms can also reduce the amount of data needed, and reduce the number of animals for experiments in neuroscience.

Appendix E Supplementary Figures

Figure S1: Overview of motion capture and two-photon neural imaging dataset. A tethered fly (Drosophila melanogaster) behaves spontaneously while neural and behavioral data are recorded using multi-view infrared cameras and a two-photon microscope, respectively. The dataset includes (A) 2D poses from six cameras (only three are shown), (B) 3D poses, triangulated from multiview 2D poses. Calibration parameters for the markerless motion capture system are included. (C) Synchronized, registered, and denoised calcium imaging data from coronal sections of the cervical connective. Shown are color-coded activity patterns for populations of descending neurons from the brain (red is active, blue is inactive). Data are collected from multiple animals and include action labels.
Figure S2: Overview of our contrastive learning-based neural action representation learning approach. First, we sample a synchronized set of behavioral and neural frames, . Then, we augment these data using randomly sampled augmentation functions and . Encoders and then generate intermediate representations and , which are then projected into and by two separate projection heads and . We maximize the similarity between the two projections using an InfoNCE loss.
Figure S3: Domain gap between nervous systems. Neural imaging data from four different animals. Images differ in terms of total brightness, the location of observed neurons, the number of visible neurons, and the shape of axons.
Figure S4: t-SNE plots of the neural modality. Each color denotes a different domain (animal). Two red dots are the embeddings of the same action label. (A) Raw neural data (B) SimCLR representation, (C) Domain adaptation using a two-layer MLP discriminator and a Gradient Reversal Layer (D) Ours, aligns multiple domains and keeps the semantic structure.
Figure S5: Motion Capture and two-photon dataset statistics. Visualizing (A) the number of annotations per domain and (B) the distribution of the durations of each behavior across domains. Unlike scripted human behaviors, animal behaviors occur spontaneously. The total number of behaviors and their durations do not follow a uniform distribution.

Appendix F Supplementary Videos

The following videos are sample behavioral-neural recordings from two different flies. The videos show (left) raw behavioral RGB video together with (right) registered and denoised neural images in their original resolution. The behavioral video is resampled and synchronized with the neural data. A counter (top-left) shows the time in seconds. The colorbar indicates normalized relative intensity values. Calculation of values is explained in the Appendix.