Social NCE: Contrastive Learning of Socially-aware Motion Representations

12/21/2020 ∙ by Yuejiang Liu, et al. ∙ EPFL 0

Learning socially-aware motion representations is at the core of recent advances in human trajectory forecasting and robot navigation in crowded spaces. Yet existing methods often struggle to generalize to challenging scenarios and even output unacceptable solutions (e.g., collisions). In this work, we propose to address this issue via contrastive learning. Concretely, we introduce a social contrastive loss that encourages the encoded motion representation to preserve sufficient information for distinguishing a positive future event from a set of negative ones. We explicitly draw these negative samples based on our domain knowledge about socially unfavorable scenarios in the multi-agent context. Experimental results show that the proposed method consistently boosts the performance of previous trajectory forecasting, behavioral cloning, and reinforcement learning algorithms in various settings. Our method makes little assumptions about neural architecture designs, and hence can be used as a generic way to incorporate negative data augmentation into motion representation learning.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Official implementation of the "Social NCE: Contrastive Learning of Socially-aware Motion Representations" in PyTorch.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans have an instinctive ability to anticipate the future motions of other people while navigating in crowded spaces. This ability allows us to not only keep a comfortable distance from the others but also identify potential dangers or discomforts ahead of time. However, building predictive models capable of doing so is challenging. Recent works have proposed a plethora of neural network-based models

[deo_convolutional_2018, vemula_social_2017, ivanovic_trajectron_2019, sadeghian_sophie_2019, huang_stgat_2019, kosaraju_social-bigat_2019, sun_recursive_2020, li_evolvegraph_2020] to learn socially-aware motion representations and demonstrated their potentials for human trajectory forecasting [alahi_social_2016, lee_desire:_2017, gupta_social_2018, salzmann_trajectron_2020] or robot motion planning [chen_decentralized_2016, chen_socially_2017, chen_crowd-robot_2019] in crowded spaces. Yet existing methods still output unacceptable solutions (e.g., collisions), which raises significant safety concerns.

Figure 1: Our social contrastive learning method encourages the encoded motion representation to preserve sufficient information for distinguishing a positive future event from a set of synthetic negative ones that are socially unfavorable, which provides an effective way to incorporate our prior knowledge about social norms into motion representation learning.

One common challenge for learning robust neural motion models stems from covariate shift (also referred to as distributional shift) [daume_search-based_2009, ross_efficient_2010]. Very often, the collected data may not be able to cover the entire state space but only contain a subset, e.g., human trajectories observed in safe scenarios without any dangerous occurrences. The absence of challenging events poses a significant difficulty for learning-based methods to truly capture the underlying social norms and generalize to novel scenarios. As such, small prediction errors made by the learned model during inference may accumulate over time, which gradually creates a discrepancy between the training and test state distribution and eventually causes catastrophic errors [daume_search-based_2009, ross_efficient_2010, codevilla_exploring_2019]. Most previous methods attempt to mitigate this issue through interactive data collections, such as expert queries [ross_reduction_2011, laskey_dart_2017, liu_map-based_2018, sun_deeply_2017, de_haan_causal_2019] and interactions with the environment [ho_generative_2016, kostrikov_discriminator-actor-critic_2018, wang_random_2019, brantley_disagreement-regularized_2019, reddy_sqil_2019]. Unfortunately, these methods are not only costly and tedious but often infeasible for forecasting problems where human behaviors cannot be easily intervened by another learning system for data collection purposes. These shortcomings motivate us to explore an alternative approach: given a fixed training dataset, can we learn a robust neural motion model by exploiting our prior knowledge about socially unfavorable events?

Figure 2:

Illustration of different learning approaches to socially-aware sequential predictions from a distributional perspective. (a) The vanilla supervised learning method often suffers from compounding errors due to the covariate shift

[daume_search-based_2009, ross_efficient_2010] between the training (blue) and test (red) distribution with respect to the separation distance (x-axis) between agents. (b) Interactive data collection methods [ross_reduction_2011, laskey_dart_2017, sun_deeply_2017, ho_generative_2016, de_haan_causal_2019, reddy_sqil_2019] expand the training distribution from the original one (dash blue) to a wider range (green) via additional experiments, which are not only tedious but often infeasible for forecasting problems. (c) Our social contrastive learning method augments negative data based on prior knowledge, explicitly informing neural motion models about socially unfavorable states (gray) for improved robustness.

To this end, we propose a social contrastive learning method to incorporate prior knowledge into motion representation learning (Figure 1). Contrastive methods [hadsell_dimensionality_2006, gutmann_noise-contrastive_2010, sohn_improved_2016, oord_representation_2019] have recently achieved tremendous successes in learning powerful representations of complex signals, such as images and texts [mikolov_distributed_2013, goldberg_word2vec_2014, logeswaran_efficient_2018, wu_unsupervised_2018, misra_self-supervised_2019, chen_simple_2020, he_momentum_2020, park_contrastive_2020, chuang_debiased_2020, kalantidis_hard_2020, khosla_supervised_2020]. In this work, we adapt this learning approach in the multi-agent context and introduce Social-NCE as an auxiliary loss. Complementary to the conventional predictive loss, e.g., minimizing the distance between model output and the labeled behaviors, our Social-NCE encourages the extracted motion representation to preserve sufficient information for distinguishing a positive future event from a set of synthetic knowledge-driven negative events.

One crucial component of this social contrastive learning is the design of positive and negative samples, i.e., spatial locations given a future time step. Existing contrastive methods often uniformly sample a large set of negatives from training data [mikolov_distributed_2013, goldberg_word2vec_2014, peters_deep_2018, wu_unsupervised_2018, chen_simple_2020, he_momentum_2020]. However, this common choice does not provide much additional information about social norms. Instead of using random locations, we propose a safety-driven sampling strategy that explicitly draws negative samples from the regions of other agents in the future, given that it is typically forbidden or uncomfortable for multiple agents to visit the same or adjacent places simultaneously. As illustrated in Figure 2, our sampling method serves as a form of negative data augmentation [anonymous_negative_2020], intentionally exposing the model to dangerous scenarios in order to learn a more robust motion representation.

We evaluate our method on three tasks: human trajectory forecasting, behavioral cloning and reinforcement learning for robot navigation in crowds. Experimental results show that the proposed Social-NCE consistently improves the previous methods in various settings. When applied to the Trajnet++ trajectory forecasting challenge111, our method ranks the 1st place in terms of both accuracy and safety at the time of publication. Our code is available at

2 Related Work

Figure 3: Social contrastive learning in the multi-agent context. Given a scenario that contains a primary agent of our interest (blue) and multiple neighboring agents in its surroundings (gray), our Social-NCE loss encourages the extracted motion representation to be close to a positive future event in an embedding space and apart from some synthetic negative events that could have caused collisions or discomforts.

Socially-aware Motion Representations. Human motion in the social context has been traditionally studied based on relative distances and specific rules [helbing_social_1998, mehran_abnormal_2009, van_den_berg_reciprocal_2011, alahi_socially-aware_2014]. While these hand-crafted models have been successfully applied to various tasks [luber_people_2010, zanlungo_social_2011, ferrer_robot_2013, coscia2016point], they often struggle to capture the strong interactions among agents in complex scenes [rudenko_human_2020]. Some other methods attempt to learn the patterns of social behaviors from data. Yet early work often suffers from considerable performance drop in densely populated spaces due to limited modeling capacity [trautman_unfreezing_2010, pellegrini_improving_2010].

More recently, a variety of neural networks have been explored for learning socially-aware motion representations [alahi2017learning, kothari_human_2020]. Some peculiar neural architecture designs, such as feature pooling [alahi_social_2016, gupta_social_2018, deo_convolutional_2018], attention mechanism [vemula_social_2017, chen_crowd-robot_2019, sadeghian_sophie_2019, huang_stgat_2019], and spatial-temporal graph [ivanovic_trajectron_2019, kosaraju_social-bigat_2019, sun_recursive_2020, li_evolvegraph_2020], have yielded promising results in crowded environments. However, the robustness of these methods remains a central concern. Our work is orthogonal to the design of neural motion models and focuses on the learning aspect to enhance the extracted motion representation.

Covariate Shift. The problem of covariate shift was observed back in [pomerleau_alvinn_1989] and has been a persistent challenge for sequential prediction problems [daume_search-based_2009, ross_efficient_2010]. One practical solution is to actively query experts [ross_reduction_2011, laskey_dart_2017, sun_deeply_2017], which has been shown effective for behavioral cloning but hardly applicable to forecasting problems [ridel_literature_2018, rudenko_human_2020, kothari_human_2020]. Inverse reinforcement learning methods [ng_algorithms_2000, abbeel_apprenticeship_2004, ziebart_maximum_2008, ho_generative_2016, kostrikov_discriminator-actor-critic_2018, wang_random_2019, brantley_disagreement-regularized_2019] jointly learn a reward function and the corresponding optimal policy to account for the sequential nature. However, they typically require extensive explorations to solve a reinforcement learning (RL) subproblem [dulac-arnold_challenges_2019].

Another line of work introduces additional loss terms penalizing the predictions that lead to undesirable events, such as collisions and off-road trajectories [bansal_chauffeurnet_2018, niedoba_improving_2019]. However, these penalties are dependent on the predicted states during training and become utterly ineffective once the model fits the dataset well in late training stages.

Closely related to our work, [luo_learning_2019, zeng_dsdnet_2020] propose to learn a robust value (or cost) function by making use of negative samples. Our method differs from theirs in two key aspects: [luo_learning_2019, zeng_dsdnet_2020] change the task loss that directly affects (and potentially biases) the model output, whereas our goal is to enhance the extracted motion representation without modifications in the main task; they draw negative samples randomly, in contrast, we design a more informed sampling strategy using prior knowledge.

Contrastive Learning. Contrastive learning was proposed in [hadsell_dimensionality_2006]

to learn an embedding space such that a simple similarity measure can approximate the preferred neighborhood relationships. This approach has recently achieved stunning results in a broad spectrum of areas, including computer vision

[chen_simple_2020, he_momentum_2020], natural language understanding [logeswaran_efficient_2018, pagliardini_unsupervised_2018, arora_theoretical_2019], image synthesis [park_contrastive_2020] and robotics [sermanet_time-contrastive_2018]. Some detailed design choices, such as positive and negative sampling, often play a critical role in the practical success of contrastive methods [purushwalkam_demystifying_2020, chuang_debiased_2020, kalantidis_hard_2020, robinson_contrastive_2020]. To the best of our knowledge, we are the first to adapt contrastive learning in the multi-agent motion context and explore the sampling method which is unique and critical to socially-aware motion representation learning.

3 Proposed Method

The robustness of neural motion models has been a long-standing concern for safety-critical applications like autonomous vehicles and social robots. In many practical scenarios, the training data may only contain quality examples from safe and normal states but lack dangerous occurrences. This incomplete state coverage poses a significant challenge for learning algorithms to truly capture the underlying social norms and generalize to challenging scenarios.

In this section, we present a learning method that aims to tackle this challenge by means of contrastive representation learning. We will first briefly introduce the background of contrastive learning and then present a social contrastive loss for motion representations. We will finally introduce an informative sampling strategy tailored for the multi-agent context.

3.1 Contrastive Representation Learning

Representation learning typically consists in learning a parametric function (i.e., encoder) that maps the raw data into a feature space to extract abstract and useful information for downstream tasks [bengio_representation_2013]

. Recent contrastive learning methods often adopt the principle of noise contrastive estimation

[gutmann_noise-contrastive_2010, dyer_notes_2014] in an embedding space, namely the InfoNCE loss [oord_representation_2019], to train an encoder:


where the encoded query is brought close to one positive key and pushed apart from negative keys ,

is a temperature hyperparameter, and

is the cosine similarity between two feature vectors. It has been shown that minimizing the InfoNCE loss is equivalent to maximize the lower bound on the mutual information between the raw input and the latent representation

[oord_representation_2019]. Moreover, the representations learned by this approach have provable performance guarantees on downstream tasks [arora_theoretical_2019]. The empirical success of this approach often highly relies on the informativeness of positive and negative samples [chen_simple_2020, purushwalkam_demystifying_2020, chuang_debiased_2020, kalantidis_hard_2020, robinson_contrastive_2020, song_multi-label_2020, jabri_space-time_2020].

3.2 Social NCE

Consider a trajectory forecasting problem in crowded spaces as an example. Let denote the position of agent at time and denote the joint state of agents in the scene. Given a sequence of history observations , the task is to predict future trajectories of all agents until time . Many recent forecasting models are designed as encoder-decoder neural networks, where the motion encoder first extracts a compact representation with respect to agent and the decoder subsequently rolls out its future trajectory :


To model social interactions among agents, typically contains two sub-modules: a sequential module that encodes each individual sequence and an interaction module that shares information among agents, e.g.,


where is the latent representation of agent given the observation of its own state at time and . A variety of architectures have been explored for each modules and validated on accuracy measures [alahi_social_2016, li_end--end_2020, li_evolvegraph_2020]. Nevertheless, their robustness remains an open issue. Several recent works [bansal_chauffeurnet_2018, kothari_human_2020] have shown that trajectories predicted by existing models often output socially unacceptable solutions (e.g., collisions), indicating a lack of common sense about social norms.

To tackle this challenge, we propose Social-NCE, a variant of InfoNCE tailored for socially-aware motion representation learning. As illustrated in Figure 3, we construct the encoded query and key vectors for the primary agent at time as follows:

  • query: embedding of history observations , where is an MLP projection head.

  • key: embedding of a future event , where is an event encoder modeled by an MLP, is a sampled spatial location and is the sampling horizon.

By adjusting the discrete variable in a range, e.g., , we can take into account future events in the next few steps simultaneously. Nevertheless, when is a fixed value, can be simplified as a location encoder, i.e., .

In each frame, we draw one positive key and multiple negative keys based on future trajectories in the scene, which we will describe in the next Section 3.3. Following [wu_unsupervised_2018, he_momentum_2020, chen_simple_2020]

, we normalize the embedding vectors onto a unit sphere and train the parametric models

jointly with the objective of mapping the positive pair of query and key to similar points, relative to the other negative pairs, in the embedding space:


The full training objective is a weighted combination of the conventional task loss, e.g., mean squared error (MSE) or negative log-likelihood (NLL) for trajectory forecasting, and the proposed social contrastive loss:


where is a hyper-parameter controlling the emphasis on Social-NCE.

(a) random
(b) proposed
Figure 4: Different negative sampling methods in the multi-agent context. (a) The conventional random sampling method draws negative samples homogeneously scattered in the space, which does not provide much information about social norms. (b) Our safety-driven sampling strategy seeks more informative negative samples from the neighborhood of other agents in the future.

3.3 Multi-agent Contrastive Sampling

One critical design choice of the proposed social contrastive learning lies in the sampling strategy. The recent successes in contrastive learning of visual representations are heavily tied to the use of a large set of negative samples uniformly drawn from the training dataset [mikolov_distributed_2013, goldberg_word2vec_2014, peters_deep_2018, wu_unsupervised_2018, chen_simple_2020, he_momentum_2020]. Unfortunately, this common practice is not suitable for socially-aware motion representation learning. As the main predictive loss already encourages the model to replicate the socially good behaviors from training examples, adding another discrimination task between the correct solution and other randomly scattered negatives cannot provide much extra information about social norms. Worse yet, the random negative sampling, like Figure 3(a), may contradict the nature of the multimodal distribution of the future trajectories and incorrectly penalize plausible solutions.

To effectively incorporate our domain knowledge about socially unfavorable events in the multi-agent context, we propose a safety-driven sampling strategy. As shown in Figure 3(b), we draw a set of negative samples from the neighborhood of other agents in the future at time ,


where is the index of other agents, and is a local displacement to account for the social comfort area. For instance, can be the minimum physical distance between two agents and . Thus, a total of negative samples are synthesized. We also add random perturbations to each sampled location , where is a small constant, e.g., 0.05 [m], to prevent over-fitting. For positive sampling, we pick a location from the ground truth region of the primary agent at time ,


The key intuition behind our method is that conventional predictive learning only requires the model to replicate good motion behaviors in normal states without the need to understand the consequence of undesirable outputs. In contrast, by using Social-NCE in tandem with safety-driven negative sampling, we actively enforce the extracted motion representation to contain the necessary information for identifying unfavorable events that could have led to catastrophic outcomes. This subtle but essential difference enables us to learn a significantly more robust model from the fixed training dataset.

Figure 5: Comparison between Social-NCE and the standard predictive learning for socially-aware trajectory forecasting on an interaction-centric synthetic dataset (test set) [kothari_human_2020]

using D-LSTM models. Our Social-NCE consistently outperforms the baseline method across the learning process. Lower is better for all evaluation metrics.

4 Experiments

We empirically validate the proposed Social-NCE on three different tasks: (i) human trajectory forecasting, (ii) imitation learning and (iii) reinforcement learning for robot navigation in multi-agent environments.

On each task, we compare the models obtained by three training methods:

  • Vanilla: models trained without contrastive loss.

  • Random: models trained with contrastive loss and the random negative sampling (Figure 3(a)).

  • Social-NCE (ours): models trained with contrastive loss and our safety-driven negative sampling strategy (Figure 3(b)).

4.1 Implementation Details

In our experiments, we use two different 2-layer MLPs as the projection head and the event encoder . We encode the history observations and future events into 8-dimensional embedding vectors. The distance hyper-parameter is set as 0.2 [m] for trajectory forecasting tasks and 0.6 [m] for robot navigation according to the geometric size of agents in environments. By default, the sampling horizon is set up to 4 and the temperature is set as 0.1. All models are trained with the Adam optimizer [kingma_adam:_2014].

4.2 Trajectory Forecasting

We first evaluate our method on the human trajectory forecasting task. Specifically, we compare the performances of several forecasting models trained with and without the proposed Social-NCE. Following the public benchmark TrajNet++ [kothari_human_2020], each model takes as input 9-step observations and predicts the future trajectories of all agents in the scene for 12 steps. We apply our method to the following LSTM-based forecasting models with different interaction modules:

  • S-LSTM [alahi_social_2016]: grid-based interaction module over the hidden states of neighboring agents.

  • S-ATT [vemula_social_2017]: graph-based soft attention mechanism over the hidden states of neighboring agents.

  • D-LSTM [kothari_human_2020]: grid-based interaction module sharing the relative velocities of neighboring agents.

  • S-GAN [gupta_social_2018]

    : max-pooling among neighboring agents, trained with the adversarial and variety loss.

Similar to previous work [gupta_social_2018, zeng_dsdnet_2020, kothari_human_2020], we evaluate the predictor on the following metrics:

  • Average displacement error (ADE): the average Euclidean distance between the output trajectory and the ground truth over all predicted steps.

  • Final displacement error (FDE): the Euclidean distance between the predicted output and the ground truth at the last time step.

  • Collision rate (COL): the percentage of test cases where the predicted trajectories of different agents run into collisions.

To evaluate the multi-modal forecasting of S-GAN, we use the top-3 displacement error that measures the minimum distance between three randomly sampled output trajectories and the ground truth observation.

4.2.1 Synthetic Data

Following the trajectory prediction benchmark [kothari_human_2020], we first evaluate our method on an interaction-centric synthetic dataset generated by ORCA [van_den_berg_reciprocal_2011], a classical collision avoidance algorithm known for multi-agent simulations. We collect the dataset from 1000 simulation runs of circle crossing scenarios with 5 pedestrians. This dataset serves as a noise-free testbed, which allows us to analyze the impact of our method in the presence of strong interactions among multiple agents. We train the D-LSTM [kothari_human_2020]

model on 80% of the collected data with a fixed learning rate of 0.001 for 55 epochs and evaluate the models obtained at each epoch on the test set.

Figure 5 shows the performance of D-LSTM [kothari_human_2020] trained with and without Social-NCE. Compared with the results of the standard supervised learning, our method (with ) brings clear performance gains, both accelerating the learning process and boosting the final performance across all the evaluation metrics. In particular, our method quickly reduces the collision rate to less than 2% within 25 epochs as opposed to 45 epochs required in the vanilla predictive baseline. Both methods run into over-fitting after 45 epochs.

Model Method ADE FDE COL
S-LSTM [alahi_social_2016] Vanilla 0.55 1.18 7.57 0.67
Random 0.55 1.19 7.46 0.23
Ours 0.55 1.18 6.99 0.25
S-ATT [vemula_social_2017] Vanilla 0.56 1.22 10.59 0.30
Random 0.56 1.22 10.66 0.44
Ours 0.56 1.23 10.17 0.20
D-LSTM [kothari_human_2020] Vanilla 0.57 1.23 6.82 0.23
Random 0.57 1.24 6.76 0.29
Ours 0.57 1.24 6.13 0.24
S-GAN [gupta_social_2018] Vanilla 0.51 1.09 7.07 0.26
Random 0.51 1.09 7.01 0.30
Ours 0.51 1.09 6.65 0.56
Table 1: Quantitative results of trajectory forecasting models on the real interacting dataset in Trajnet++ [kothari_human_2020]

. We fine-tune four pre-trained forecasting models for 10 epochs with different methods and compare the average performance of the models saved at each epoch. The standard deviation on ADE and FDE are always smaller than 0.01 and 0.02 respectively. Our method consistently reduces the collision rates of all top-performing models on the public Trajnet++ benchmark.

4.2.2 Real Data

We further validate our method on the real dataset in the TrajNet++ benchmark [kothari_human_2020]. The whole dataset consists of several publicly available subsets, including ETH [pellegrini_improving_2010], UCY [lerner_crowds_2007], WildTrack [chavdarova_wildtrack_2018], L-CAS [sun_3dof_2018] and CFF [alahi_socially-aware_2014]. We follow the official training and test split for the interacting subcategory (Type-III) in the TrajNet++ [kothari_human_2020] and pre-train each baseline model for 25 epochs. We then finetune these pre-trained models for 10 epochs using our Social-NCE with the weight parameter and evaluate the average performance over fine-tuning.

Table 1 reports the results of our method in comparison to other baselines. The random negative sampling does not show any clear advantages over the vanilla baseline. On the contrary, our method yields lower collision rates than its counterparts by a clear margin across all the top-ranked models on the TrajNet++ benchmark [kothari_human_2020]. Note that the models trained with different methods result in similar prediction accuracy, which is likely caused by the considerable amount of uncertainties of human behaviors in the real-world interacting scenarios. Thanks to the domain knowledge associated with the negative data augmentation, the S-LSTM model trained by our method ranks the 1st place on the Trajnet++ leaderboard at the time of publication, significantly outperforming other methods in terms of safety.

Figure 6 shows the qualitative results of different learning methods in three representative test cases in the ETH dataset [pellegrini_improving_2010] and the UCY dataset [lerner_crowds_2007]:

  • Parallel: pedestrians head towards similar directions.

  • Opposite: pedestrians walk to the opposite directions.

  • Orthogonal: pedestrians meet at a large angle.

While collisions occur in the vanilla baseline, our method consistently outputs socially compliant solutions. For instance, in the parallel scenario, our predicted trajectory for the primary agent stays in the middle of two other neighbors at all time steps instead of sliding towards either of them. In the opposite scenario, our method adjusts the trajectories of both the primary and the opposite agent cooperatively. Similarly, in the orthogonal scenario, our method jointly twists the trajectories of multiple interactive agents, allowing each of them to pass the crowded spot smoothly.

(a) Vanilla - Parallel
(b) Vanilla - Opposite
(c) Vanilla - Orthogonal
(d) Ours - Parallel
(e) Ours - Opposite
(f) Ours - Orthogonal
Figure 6: Qualitative results of D-LSTM [kothari_human_2020] models trained with different methods on three test cases in the ETH dataset [pellegrini_improving_2010] and UCY ZARA dataset [lerner_crowds_2007]. The vanilla method leads to collisions between the primary (black) and the nearby agent (red) at the 12th, 4th and 10th predicted step respectively, whereas our method outputs collision-free trajectories at all times.

4.3 Imitation Learning

Next, we examine the effectiveness of Social-NCE applied to imitation learning for robot navigation in dense crowds [chen_decentralized_2016, chen_socially_2017, chen_crowd-robot_2019]

. We use an open-sourced simulator of crowd navigation

[chen_crowd-robot_2019], where the task for the robot is to navigate through 5 simulated pedestrians and arrive at the target destination with time efficiency. In each time step, the robot receives the observable states of other agents and outputs an action. We follow the evaluation protocol in [chen_crowd-robot_2019], which quantifies the performance of a policy using three metrics: navigation time, collision rate, and the accumulated reward as follows:


where is the minimum separation distance between the robot and the humans during the time interval .

4.3.1 Behavioral Cloning

For imitation learning, we collect a demonstration dataset that consists of 5000 simulation episodes using the pre-trained SARL policy in [chen_crowd-robot_2019] as an expert. We train an imitator on the collected data for 200 epochs and evaluate the average performance of the last 10 models saved every 5 training epochs.

As shown in Table 2, the imitation learning algorithm trained with random negative sampling fails to outperform the vanilla baseline. In fact, it even worsens the learned policy. In contrast, our method (with weight ) leads to consistently higher reward and lower collision rate. Specifically, our method reduces the collision rate by approximately 69% compared with the vanilla baseline and attains an average reward of , which is highly close to the result of the demonstrator in [chen_crowd-robot_2019].

Method Reward Time (s) Collision (%)
Vanilla 0.28 0.01 10.31 0.07 11.11 1.45
Random 0.24 0.02 10.32 0.12 18.60 4.69
Ours 0.32 0.01 10.33 0.07 3.40 1.36
Table 2: Quantitative results of imitation learning with different methods on a 5k demonstration dataset. Higher is better for reward, and lower is better for the other metrics. Compared with the vanilla baseline, our method brings down the collision rate by approximately 69%.

4.3.2 Low-data Regime

The performance of the standard behavioral cloning approach often degrades substantially when provided with limited demonstrations. We examine the potential of the proposed Social-NCE for data-efficient imitation learning by comparing policies trained on datasets of different sizes. As shown in Figure 7, with decreasing amounts of demonstrations, the performance of the vanilla method drops sharply. Notably, the baseline model trained on 2k episodes of demonstrations causes collisions in of test cases. In contrast, our method succeeds in retaining a much higher reward and safety in the low-data regime. For instance, the collision rate of our method with 2k demonstrations is comparable to the baseline with 5k demonstrations. Similarly, our method using 5k training data obtains a higher reward than the counterpart using 10k training data. This result suggests that the imitator can absorb a considerable amount of useful information from the designed social contrastive task, greatly alleviating the information shortage in small training sets.

Figure 7: Social-NCE for imitation learning with different amounts of demonstrations. The conventional behavioral cloning method suffers from a significant performance drop in the low data regime, whereas our method is able to retain much better results thanks to the additional information absorbed from the social contrastive task.

4.4 Reinforcement Learning

Finally, we evaluate the proposed Social-NCE for reinforcement learning (RL) algorithms on the crowd navigation task. We adopt the Rainbow DQN [hessel_rainbow:_2017], a state-of-the-art model-free RL method, as baseline and follow the architecture of value-based SARL policy [chen_crowd-robot_2019] to build the encoder . To effectively apply Social-NCE to the Rainbow agent, we add a linear layer after the interaction and pooling modules. We also replace the planning module in SARL by dueling and categorical layers, as in the standard Rainbow agent [hessel_rainbow:_2017]. In order to isolate the impact of Social-NCE, we use the dense reward function proposed in [semnani_multi-agent_2020], which eliminates the necessity of imitation pre-training in [chen_crowd-robot_2019]:


where is the Euclidean distance between the robot and its goal, is a control parameter. Other simulation settings are kept the same as Section 4.3.

Figure 8: Learning curves of Rainbow DQN [hessel_rainbow:_2017] with different methods for crowd navigation. Results are averaged across 8 random seeds. The shaded area spans one standard deviation. In contrast to the vanilla and random negative counterparts, the agent with Social-NCE is significantly more sample efficient and achieves higher final reward.
Method Reward w.r.t. fraction of dataset
100% 50% 25% 10%
Vanilla 80.1% 75.2% 53.2% 14.4%
Random 81.3% 71.5% 51.7% 7.0%
Ours 91.6% 84.6% 79.2% 69.0%
Table 3: Offline RL normalized scores attained by the vanilla rainbow and Social-NCE agents (higher is better). Normalized score is calculated as: 100 (agent score random play score) (optimal agent score random play score), as in [mnih_human-level_2015]. Our Social-NCE consistently facilitates the recovery of a near-optimal policy and is particularly advantageous in the low-data regime.

4.4.1 Off-policy Reinforcement Learning

We first validate our Social-NCE method in the standard off-policy setting, where an RL agent learns from the replay buffer data gathered over the learning process. We set the temperature as and weight as for the Social-NCE loss. Figure 8 shows the experimental results of each method averaged over 8 random seeds.

The vanilla Rainbow agent reaches a reward value of 0.6 using more than 4000 episodes. In comparison, the agent equipped with our method demonstrates a much higher sample efficiency. It attains the same level of reward in less than 2000 episodes and quickly obtains a collision-free policy thanks to the prior knowledge from the social contrastive task. Additionally, our method also offers a slight improvement in the final performance. On the contrary, the random negative sampling is not able to provide any significant performance gain, similar to the experimental results above.

Horizon Vanilla 1 2 3 4 1-4
Reward 0.283 0.008 0.281 0.019 0.296 0.009 0.311 0.009 0.307 0.012 0.323 0.005
Time (s) 10.306 0.068 10.345 0.065 10.281 0.141 10.322 0.134 10.348 0.107 10.334 0.072
Collision (%) 11.11 1.45 11.24 3.46 9.13 2.02 5.83 1.62 6.09 2.26 3.40 1.36
Table 4: Social-NCE for imitation learning with different choices of sampling horizon. Higher is better for reward, and lower is better for the other metrics. Taking multiple time steps (1-4) into account simultaneously yields better results than a fixed horizon of one time step.

4.4.2 Offline Reinforcement Learning

Lastly, we explore the potential of our method in the offline RL setting, in which an agent learns from a static dataset of logged experiences without additional interactions with the environment [levine_offline_2020]. The offline RL setting has attracted rapidly growing attentions due to its tremendous promise for making good use of immerse experience datasets. Nevertheless, most deep reinforcement learning algorithms are highly vulnerable to the distribution mismatch between the policy being trained and the ones used for data collection [lange_batch_2012, fujimoto_off-policy_2019, kumar_stabilizing_2019, islam_off-policy_2019, agarwal_optimistic_2020].

To verify our method in the offline RL setting, we collect a dataset using the vanilla rainbow agent in the following process: (i) 10k episodes are collected during online RL training from scratch, (ii) multiple free explorations are carried out using online RL checkpoint models trained after episodes, where . Each free exploration contains 5k episodes, and the full dataset is made up of 30k episodes of trials in total. We train an offline Rainbow policy on of the experiences randomly-sampled from the dataset, where . We use temperature and weight .

Table 3 reports the average performance of different offline methods across 10 random seeds. Due to the aforementioned distributional shift, no offline methods attain reward scores as high as the online RL algorithm. Nevertheless, our Social-NCE substantially narrows the performance gap and consistently delivers better results than the vanilla rainbow. Notably, the offline policy with our method achieves comparable performance to the best of vanilla baseline using only of the collected data.

4.5 Ablation: Event Encoder

To validate the benefits of event encoder , we compare the performance of Social-NCE for imitation learning with different sampling horizons. When the sampling horizon is a fixed value, we use a simplified location encoder that only takes the location of a sample as input. Table 4 reports the results obtained by using contrastive samples either at a single fixed horizon in a range from 1 to 4 or across all four steps simultaneously. Among the single-step choices, yields significant performance gains on both reward and collision metrics. On the contrary, does not provide any improvements in comparison to the baseline due to its short-sightedness. When taking all four steps into account, our method attains the best result, suggesting the importance of establishing social contrastive tasks at multiple horizons.

5 Conclusion

In this work, we present a contrastive learning method for socially-aware motion representation learning in the multi-agent context. The proposed Social-NCE loss, combined with safety-driven negative sampling, consistently boosts the performance of recent human trajectory forecasting and crowd navigation algorithms in various settings. The strength of our method suggests that incorporating negative data augmentations by means of contrastive learning can be a promising alternative to conventional interactive data collections for building robust neural motion models. We hope this approach will also be useful for other sequential decision problems that involve complex spatial-temporal dynamics but are short of critical data.


This work is supported by the Swiss National Science Foundation under the Grant 2OOO21-L92326. We thank Parth Kothari for helpful suggestions on human trajectory forecasting experiments. We also thank Yifan Sun, Taylor Mordan, Mohammadhossein Bahari, Lorenzo Bertoni, Sven Kreiss and other VITA members for valuable feedback on early drafts.