DANTE: Deep Affinity Network for Clustering Conversational Interactants

07/24/2019 ∙ by Mason Swofford, et al. ∙ Yale University Stanford University 1

We propose a data-driven approach to visually detect conversational groups by identifying spatial arrangements typical of these focused social encounters. Our approach uses a novel Deep Affinity Network (DANTE) to predict the likelihood that two individuals in a scene are part of the same conversational group, considering contextual information like the position and orientation of other nearby individuals. The predicted pair-wise affinities are then used in a graph clustering framework to identify both small (e.g., dyads) and bigger groups. The results from our evaluation on two standard benchmarks suggest that the combination of powerful deep learning methods with classical clustering techniques can improve the detection of conversational groups in comparison to prior approaches. Our technique has a wide range of applications from visual scene understanding, e.g., for surveillance, to social robotics.



There are no comments yet.


page 1

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual conversational group detection is the task of identifying in images the sets of individuals that engage in situated discussions. This problem has a wide range of applications. For example, effective detection of groups can help reason about social interactions in surveillance domains [8]. Likewise, group detection is essential for socially aware robot navigation in human environments [24]. However, visually detecting conversations is an intricate problem that requires perceiving subtle aspects of human interactions. One way of recognizing social group interactions is by analyzing spatial patterns of human behavior [12]. In particular, Face Formations, or F-Formations in short as denoted by A. Kendon [17], have been shown to naturally emerge during situated social conversations. These formations are the result of people needing to communicate in close proximity while sustaining a shared, focus of attention. They are often observed as face-to-face, side-by-side or circular spatial arrangements in open spaces. The type of arrangement that emerges during conversations depends on the number of people that interact together and their conversation topic. Most prior work on conversational group detection based on the visual analysis of spatial human behavior has focused on modeling properties of F-Formations explicitly [16, 26, 27]. For instance, people tend to keep a social distance from one another during conversations [15] and orient their bodies towards the center of their group [17]. But these approaches do not typically account for the malleability inherent in human spatial behavior. For example, people naturally adapt to crowded environments and modify their spatial formations by interacting closer if need be. How can we design robust methods for visually detecting group conversations based on spatial human behavior? Inspired by the recent success of Deep Learning (DL) in related combinatorial problems like multi-target visual tracking [28] and image clustering [3], this work explores the potential of leveraging the powerful approximation capabilitiess of DL for conversational group detection.

Figure 1: Example problem setting. Given people’s position and orientation in a scene, as the one shown on the left, the goal is to identify the individuals that are part of the same conversational group. We approach this problem from a graph clustering perspective (right) and leverage deep learning to compute pair-wise affinities among individuals (nodes in the graph). The affinities represent the likelihood that people belong to the same group.

We propose to detect group conversations by combining classical ideas from graph clustering with modern deep learning techniques. In particular, we pose the problem of detecting groups as finding sets of related nodes in a weighted graph. The nodes of the graph correspond to individuals in a scene with associated position and orientation information obtained from image processing, as illustrated in Figure 1

. The graph edges connect the nodes of two nearby people. Edges have an associated affinity (weight) that encodes the likelihood that the two people belong to the same conversational group. While prior work has used simple heuristics to compute edge weights for group detection 

[16, 29], we propose to learn from data what constitutes appropriate affinity values. To this end, we introduce a novel Deep Affinity NeTwork for clustEring conversational interactants (DANTE). Our insight is that such a function approximator should not only depend on the spatial configuration of the two individuals for which the affinity is computed, but also on relevant social context. In our implementation, we operationalize this idea by defining the context based on the position and orientation of other nearby humans, which is especially informative in crowded social gatherings. A pair of individuals may appear to be in a group based on their positions and orientations, but if there is actually another person in-between them, then it is hard for the two people of interest to sustain a conversation [17]. Our data-driven formulation is able to cope with these kinds of situations without additional ad-hoc steps, e.g., as in [27, 31]

, to verify that the detected groups effectively conform with F-Formations. We experimentally evaluate the proposed approach using established datasets within the computer vision community. Our results indicate that the combination of classical clustering techniques with powerful data-driven pattern recognition based on neural networks advances the state of the art in the detection of group conversations.

2 Related Work

Early research within the computer vision community on the analysis of human spatial behavior was motivated by surveillance applications and often focused on processing video sequences of public human environments [32, 11, 6, 4]. These early approaches identified two key features for spatial analysis: human position and orientation information; which we also use in our work. Other automated methods for spatial analysis have been implemented using infrared tracking [13], first-person cameras [9, 20, 2], or other wearable sensors [19, 5]. Readers interested in a more comprehensive description of computational models of spatial human behavior are encouraged to refer to related reviews on human activity analysis [1] and interaction detection [30] (chapter 3.2). The next paragraphs focus on describing related work that is close to our effort. Most prior work on detecting F-Formations is based on mathematical models of sustained spatial arrangements [7, 26, 10, 27, 31]. These model-based approaches tend to formalize the transactional segments of individuals, which are the space that extends forward from their lower body and includes whatever they are currently engaged with. Then, these methods find the intersection of transactional spaces in a scene, which corresponds to the o-space of an F-Formation when people engage in conversations [17]. For example, [7, 26, 31] use voting schemes to find o-spaces, while Setti et al. [27] use an iterative graph-cuts approach. We consider the latter work [27] in our evaluation as it provides state of the art results on group detection from a single image, i.e., without temporal reasoning. Our approach falls within the category of group detection methods based on graph clustering [32, 16, 29]. In this setting, individuals correspond to nodes in an undirected, weighted graph. The goal is to partition the graph into groups of nodes that represent human interactions. Note that soft group assignments are also possible [4], but we focus on the hard assignment problem in this work. Similar to [16, 29], we use the Dominant Sets clustering algorithm to detect F-Formations in social scenes. Different to these prior efforts, though, we do not use hand-crafted heuristics based on the position and orientation of people [16] nor a model of human attention [29] to assign weights to graph edges and perform clustering. Instead, we propose to learn in a data-driven fashion a non-linear function that predicts how likely it is that two pairs of individuals are taking part in a conversation, given their social context. A recent method by Sanghvi, Yonetani and Kitani [25] proposed to use DL for group detection in the context of learning communication policies. As part of their policy network, the authors introduce a communication gating module that automatically infers group membership. Their results for group detection are comparable to those of prior state of the art, including [27].

Figure 2: Group detection approach. Our method receives as input the position and orientation

of all individuals in a scene (a). With this information, we create an interaction graph (b) and compute pair-wise affinities with DANTE (c). The affinities are used to assemble an affinity matrix (d) to cluster nodes with the Dominant Sets algorithm (e).

We evaluate our approach on standard datasets for group detection within the computer vision community [33, 7]. The datasets provide ground truth group annotations, as well as position and head orientation for the individuals in the scene.111Even though A. Kendon [17] defined transactional segments based on people’s lower body orientation [17], we use head orientation as a key feature for group detection. Our rationale for this decision is evaluating our approach using established datasets  [7, 33]. As in [23], though, we believe that future work should consider both body and head orientation for interaction analysis. Our method could easily be extended to this end. The latter features were gathered with automated computer vision techniques, thus providing realistic inputs for our experimental evaluation.

3 Method: Group Detection with DANTE

Our group detection method is illustrated in Fig. 2. The input to our method is a set of visually observed individuals in the scene , where an individual is represented by the pair of the individual unique identifier and its feature of spatial information . The spatial information encoded in the feature includes the 2D position of the individual in the planar layout of the environment, , and its orientation, . The position and orientation data is provided relative to a world coordinate frame statically attached to the environment. Note that in our computation instead of using directly the orientation we encode the orientation with sine and cosine, and , to avoid issues with angles wrapping around . Additionally, the projection forward in the direction a person is looking is the result of a simple multiplication of the sine and cosine by some value, which is the information employed by several other group detection algorithms [7, 26, 27]. We theorize that our model can easily learn to process orientation data in a useful manner with the proposed representation. The output of our method is another set, but in this case of detected conversational groups, . Each group is composed of the identifiers of the individuals that belong to it, . The conversational groups are mutually exclusive: an individual can only belong to one conversational group. Our approach represents the scene as a social interaction graph with a set of nodes or vertices , edges , and non-negative affinities . As shown in Fig. 2(b), each node corresponds to an individual, and the undirected edges connect pairs of nodes. For each edge, its affinity score, or weight, indicates how likely is it that the two individuals connected to it belong to the same conversational group. The goal of our approach is to cluster individuals based on these scores. We compute affinities using a novel data-driven method (Fig. 2(c)) and organize all weights into an affinity (or similarity) matrix (Fig. 2(d)). This matrix is then used for clustering nodes with the Dominant Sets algorithm [16] (Fig. 2(e)). The next sections further detail the two main steps of this process: computing affinities and clustering.

Figure 3: DANTE components. The pairwise affinity of a pair of individuals and is computed from two types of features: the local dyad features and the global context features. All spatial data that is input to DANTE is transformed to the canonical frame between and before computation.

3.1 Affinity Scoring

We propose to use deep learning to compute suitable affinity values for clustering conversational interactants. To this end, we introduce a Deep Affinity NeTwork for clustEring (DANTE). As shown in Fig. 3, DANTE is structured to reason about two types of information: the spatial configuration of the two individuals of interest for which the affinity is computed, and their social context. Computing affinities based on these two types of information within a graph clustering framework is a novel contribution of our work. Our approach does not need additional ad-hoc steps, as in [27, 31], to verify that the detected groups effectively conform with the notion of F-Formations [17]. The inputs to DANTE are the features (position and orientation) for the individuals in a scene, . The original features in the world reference frame are transformed and expressed relative to a canonical frame of reference defined with respect to the pair of individuals whose affinity is being computed. The canonical frame is defined as follows: Without loss of generality, let this pair of analyzed individuals have identifiers and . Then, the their canonical frame is located at the middle point between them in the global frame . For setting orientation of the canonical reference frame , we align the axis of to point from one individual to the other, as illustrated in Fig. 4. We use this canonical frame to transform individuals’ features and compute the pair-wise affinity of persons and , . Applied to the whole social interaction graph, DANTE provides in the pairwise affinities for all possible pairs of individuals in the scene. The affinities per pair of individuals computed by DANTE are based on two types of features: the dyad features and the context features. Dyad features encode the interactions between pairs of individuals. Context features encode global information about the spatial configuration of the crowd around the individuals of interest. The following section first details the Dyad Transform, the operation to obtain dyad features. The subsequent one describes the Context Transform, the operation to compute the context features.

3.1.1 Dyad Transform

The Dyad Transform of DANTE is in charge of computing local features for the two people for which we want to compute an affinity score. Without loss of generality, assume again that DANTE is computing the affinity for the individuals and

. Then, the input to the Dyad Transform are the feature vectors

and . Each of these vectors contains the position and orientation of the corresponding person in the the local frame

. The Dyad Transform applies independently a multi-layer perceptron (mlp) to each of the input features

and (top part of Fig. 3), resulting in a feature encoding for each individual of dimensionality . The mlp is composed of

dense layers followed by ReLU activations. The result is a matrix of features in

, which is finally flattened into a dyad feature vector .

Figure 4: We transform the features input to DANTE from the world frame to a local frame relative to the two people of interest (, ). This transformation reduces the variability in the input space in comparison to processing features from directly.

3.1.2 Context Transform

DANTE’s Context Transform computes a global feature representation for the social context of the dyad of interest. Our design for this model component is inspired by prior work on point cloud processing [22] and visual tracking [14]. At its core, the Context Transform uses a symmetric function to handle unordered and potentially variable number of inputs. In our case, these inputs correspond to the set of individual feature vectors with the position and orientation of people nearby the individuals of interest. More formally, assume again that DANTE is computing the affinity . Then, the input to the Context Transform is a feature set with each vector as in previous sections. Note that in our implementation, we represent the input set as a matrix in , with each row corresponding to the data of one person. Importantly, to compute the context feature for a pair of individuals and , the set of individual features is transformed to the canonical reference frame between and , . Similarly to the Dyad Transform, the Context Transform first applies independently a multi-player perceptron to each of the rows of the input matrix (bottom part of Fig. 3). The mlp is composed of dense layers followed by ReLU activations, resulting in a matrix of features in

. The latter matrix is finally transformed by max pooling along its rows. The output is a context feature vector

. Note that max pooling is the key symmetric operation that makes the Context Transform invariant to input permutation.

3.1.3 Combining Dyad and Context Features

Finally, DANTE uses the dyad features and context features to compute an affinity score. To this end, DANTE first concatenates the two feature vectors column-wise, resulting in a new vector in . Then, an mlp is used to transform the combined features into a joint representation. In this case, the mlp is composed of

dense layers, each followed by ReLU activations. Finally, one more dense layer projects down the resulting features into a scalar value. This layer uses a sigmoid activation function to constraint the output to the


3.1.4 Additional Implementation Details

To train DANTE, we use the log loss between each predicted affinity and the true affinity, corresponding to whether or not two people are part of the same conversational group. We choose the hyper-parameters for the multi-layer perceptrons and their sizes experimentally, as further detailed in Sec. 4.

3.2 Dominant Sets Grouping

Once the affinities for each pair of individuals in the social interaction graph are computed, our approach proceeds to group people using the Dominant Sets (DS) algorithm by Hung and Kröse [16]. Dominant sets (clusters) in the context of the algorithm are a generalization of maximal cliques to edge-weighted graphs with no self-loops [21]. Our social interaction graph is one of such graphs. For a detailed explanation of the Dominant Sets algorithm, we refer the reader to sections 3.2 and 6 of [16]. In short, the DS algorithm iteratively finds clusters that satisfy the following property: the mutual affinity between all of the cluster members is higher than the affinity between any of its members and those outside of it. This property describes very compact structures, which are well suited to represent F-formations of any size. Although dominant sets can also be applied to social interaction graphs with asymmetric affinities, symmetric affinities have been reported to yield superior results [16, 29]. Thus, we assume symmetric affinities in our work by setting each and equal to the average of their predicted values.

4 Evaluation

We conduct systematic evaluations of our proposed group detection approach in established benchmarks from the computer vision community. The following sections describe the datasets that we consider in our experiments, our evaluation metrics, experimental procedure, and results.

4.1 Datasets

We consider two publicly available datasets of social interactions in our evaluation:

  • [leftmargin=*]

  • Cocktail Party Dataset [33]

    . Contains about 30 minutes of video recordings of a cocktail party in a lab environment. The video shows 6 people conversing with one another and consuming drinks and appetizers. The party was recorded using four synchronized cameras installed in the corners of the room. Subjects’ positions were logged using a particle filter-based body tracker with head pose estimation

    [18]. Groups were annotated at 5 second intervals, resulting in a total of 320 frames with ground truth annotations.

  • Coffee Break Dataset [7]. Images were collected using a single camera outdoors. A total of 14 people engaged in conversations during coffee breaks. Group interactions are of small sizes. People tracking is rough, with orientations only taking values of 0, 1.57, 3.14, and 4.71. Compared to Cocktail Party, the spatial data provided by the Coffee Break dataset is far noiser and more realistic. A total of frames have ground truth group annotations.

4.1.1 Data Augmentation

Due to the small size of the datasets, we augment them during training. We flip position and orientation data over the horizontal and vertical axes of the world coordinate frame , giving four times as many training examples. Since groups are define based on person ID’s, they do not need to be adjusted for the augmented data. In addition to the above transformations, we also consider augmenting the training data with synthetic examples from [7]. The synthetic data consisted of 100 different situations created by psychologists. In each situation, simulated people took part in F-Formations and others did not. Ground truth group as well as people positions and orientations are provided as part of this synthetic dataset.

Model Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Overall
GComm - - - - - 0.60
CTCG 0.31 0.46 0.09 0.17 0.43 0.29
GCFF 0.49 0.68 0.52 0.71 0.80 0.64
DANTE 0.73 0.80 0.56 0.72 0.83 0.73
DANTE+Synthetic 0.70 0.83 0.59 0.64 0.79 0.71
DANTE-NoContext 0.60 0.68 0.33 0.75 0.83 0.64
Table 1: Results for the Cocktail Party dataset. The results for Fold 1-5 and Overall are computed with T=1.
Model Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Overall
GComm - - - - - 0.63
CTCG 0.48 0.32 0.51 0.48 0.58 0.48
GCFF 0.50 0.35 0.73 0.71 0.84 0.63
DANTE 0.39 0.40 0.74 0.77 0.89 0.64
DANTE+Synthetic 0.48 0.36 0.77 0.77 0.89 0.65
DANTE-NoContext 0.46 0.43 0.76 0.77 0.89 0.66
Table 2: Results for the Coffee Break dataset. The results for Fold 1-5 and Overall are computed with T=1.

4.2 Evaluation Metrics

We consider standard evaluation metrics for visual conversational group detection. A given group is said to be correctly estimated if of their members are correctly estimated and if no more than false subjects are identified, where is the cardinality of the labeled group and is a defined tolerance threshold. Common values of are and [7, 26, 27, 29, 31, 25]. We center on attention on evaluating methods based on since it is more challenging that . Let (true positive) to be a correctly detected group, (false negative) be a non-detected group, and (false positive) be a group that was detected but did not exist. Then, we measure our accuracy with three metrics: precision, recall, and score. Precision is , recall is , and score is .

4.3 Baselines

We focus on comparing the proposed approach against two state-of-the-art methods. First, we compare our proposed approach against the graph-cuts method by Setti et al. [27] (GCFF) for detecting conversational groups from images, given that it outperforms prior model-based group detection approaches, e.g., [7, 26]. Second, we compare results against the game theoretic appraoch of Vascon et al. [29] (GTCG) because this method relies on graph clustering like our approach and tends to give better performance than [16]. We fine tune hyper-parameters for these methods using the same data that we use for training our approach, as further detailed in the next section. Although we did not re-run the group communication approach by Sanghvi, Yonetani, and Kitani [25] (GComm), we report results from their publication as a reference. This approach is of interest as well because it uses deep learning for conversational group detection.

4.4 Experimental Setup

Due to the small size of our dataset, we use 5-fold cross validation to measure performance and study variability in the results. Each fold is taken as a continuous section of data due to the inherent auto-correlation of time-series images. We select data for validation from the training set such that it separates as much as possible the data that is actually used for training from the one that is used for testing. Test data is only used for computing final results after hyper-parameters are chosen based on the validation set. The hyper-parameters for DANTE include the number of layers in its multi-layer perceptrons, as well as the size of these layers. We train the method through gradient descent with the Adam optimizer, a learning rate of 0.0001 and a batch size of 32 samples. In order to fairly compare our results against previous work, we fine-tuned the state-of-the-art baselines [27, 29] using the corresponding training and validation data for each fold. The average results for the graph-cuts approach of Setti et al. were slightly improved in comparison to [27]. Note that [29] does not present results for T=1 F1.

4.5 Results

4.5.1 Comparison Against Baselines

Table 1 and 2 show quantitative results for the Cocktail Party and Coffee Break dataset, respectively. The first five columns show detailed results per fold, and the last one shows the overall, averaged result for the T=1 F1 metric. In general, our proposed approach outperforms the baselines in the Cocktail Party dataset (DANTE row in Fig. 1). The average improvement (Overall column) is 9% over GCFF, 13% over GComm, and 44% over CTCG. On a fold-by-fold basis, our approach also performs the best. These results strongly suggest that our approach improves the state-of-the-art in conversational group detection when the input data has reasonable quality.

Figure 5: Evaluation of Conversational Grouping Algorithms. First Column: Original image from the dataset, Second Column: ground truth conversational group, Third Column results from GCFF [27], Fourth Column: our results with DANTE. The wall with the door corresponds to the top side of the diagrams. In comparison to DANTE, GCFF tends to be more inclusive and create wrong large conversational groups. This result aligns with prior findings [31].

The benefits of our approach are less noticeable in the Coffee Break dataset, although our method still outperforms the baselines in terms of Overall F1 score by small margin. It also performs the best on all folds, with the exception of Fold 1. One reason for this discrepancy is that Fold 1 had the most noisy spatial information for the individuals in the scene. This hurt prior work, but was especially harmful to our method, because of its dependency on data.

(a) Ground truth groups: (not interacting). Without global context, DANTE-NoContext groups people and , even though they are clearly occluded by other people.

(b) Ground truth groups: . DANTE accounts for the large group when computing pairwise-affinities, while DANTE-NoContext gives Person 4 low pairwise-affinities due to his far distance from others.
Figure 6: Ablative analysis. Left: Image from the Cocktail Party dataset), Middle: DANTE, Right: DANTE-NoContext. Grouped individuals are the same color.

Training DANTE with synthetic data from [7] lowered our overall group detection results by 2% in the Cocktail Party dataset. In the Coffee Break dataset, the synthetic data slightly improved the results by 1%. We attribute these mixed results to the differences in the amount of data that each benchmark provides and their quality. Figures 5 and 7 provide qualitative results in the Cocktail Party dataset. In particular, Fig. 5 compares example results between our method and the GCFF approach [27]. In rows 1, 3, and 4, GCFF chooses larger groups due to a penalty on small group sizes. This preference for larger groups often overrides information in the data, such as a person facing away from the proposed group. In row 2, Person 5 is likely excluded from GCFF’s primary group due to a heuristic which prevents grouping two people if there is someone else in between them. In comparison, our deep learning approach considers context to learn more nuanced behaviors, such as how one orients oneself when leaving a group (row 1), how one behaves when standing on the outskirts (rows 2 and 3), and how people arranged in a ring can still form smaller groups (row 4). This flexibility largely comes from not employing brittle heuristics to account for context and instead allowing the model to learn from data. Fig. 7 shows failure cases by our method. In case (a), one of the main limitations of our type of approach becomes evident: useful information (e.g. posture or gaze) to correctly assign the individual 1 to the right group is not available to DANTE. Our method only has access to 2D position and orientation, which can difficultates interaction analysis. In case (b), another limitation is apparent: DANTE lacks environmental features (e.g. table or wall locations), which could explain the large space in between the two predicted groups. Instead, DANTE infers this large empty space to signify that the two cohorts are separate groups. These failure cases illustrate opportunities for future improvement.

4.5.2 Effect of the Context Transform in DANTE

We hypothesized that adding contextual information to the affinity computation by DANTE would improve the results of our group detection approach. To explore if this was effectively the case, we performed a small ablation study. In particular, we evaluated a version of DANTE that only reasoned about the position and orientation of the individuals of interest using the Dyad Transform. The results for the Cocktail Party dataset are presented in Table 1 in the row corresponding to DANTE-NoContext. As expected, excluding the Context Transform from DANTE resulted in 9% worse overall group detection performance. Figure 6 shows example, qualitative results. Comparing DANTE vs. DANTE-NoContext, we can observe qualitatively that the social context input to the affinity computation is highly relevant to the group detection task. In Fig. 6(a), two people that are separated by another interaction are grouped with one another, even though this would be unlikely in real situations. In real life, the two people would have trouble communicating with each other when a conversation is happening in-between them. In Fig. 6(b), a person is missed in a group interaction. These results suggest that our complete version of DANTE is able to reason about complex spatial patterns, without ad-hoc steps to verify F-Formations, while DANTE without contextual information is only able to reason about more basic interactions. Surprisingly, the group detection results were slightly better without the Context Transform in the Coffee Break dataset (Table 2). We attribute the lack of benefit of the context in this case to the more noisy data provided by the Coffee Break dataset. This idea is supported by the fact that DANTE performs comparatively worse than DANTE-NoContext on the noisiest fold, Fold 1. Worth noting, DANTE-NoContext slightly outperformed other baselines in this benchmark even without the information from the global feature. These results reinforce the idea that deep learning can help with the group detection task.

(a) Ground truth groups: DANTE estimates that Individual 1 does not belong to a conversational group while he is grasping an object from the coffee table.

(b) Ground truth groups: DANTE estimates two groups while all individuals are conversing together.
Figure 7: Failure cases by the proposed approach. Left: original image from the Cocktail Party dataset, Middle: estimated groups by DANTE, Right: ground truth. Grouped individuals are the same color.

5 Conclusions & Future Work

We presented a novel approach for visual conversational group detection. Our method combined graph clustering with modern deep learning techniques to identify group interactions based on visual patterns of spatial behavior. Under the challenging T=1 F1 metric, our method clearly outperformed previous work in the Cocktail Party dataset and achieved slighly better results in the Coffee Break dataset. From an algorithmic point of view, clear improvements were derived from better affinity scores used for graph clustering. Additionally, the use of data-driven methods allowed our approach to cope with complex spatial patterns of behavior without ad-hoc steps to verify group interactions. Overall, our work highlights the potential of using the powerful approximation capabilities of deep learning for human interaction analysis. Although several datasets exist for group detection within the computer vision community, the size of these datasets is generally small for deep learning and this sometimes limited the performance of the proposed method. There are good reasons for the datasets being small, like data collection with groups of people being time consuming and expensive. Thus, two interesting avenues for future research would be scalable data collection of spatial behavior and unsupervised detection of conversational groups.

6 Acknowledgments

This work has been partially supported by JD.com American Technologies Corporation (“JD”) under the SAIL-JD AI Research Initiative. This article solely reflects the opinions and conclusions of its authors and not JD or any entity associated with JD.com.


  • [1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Computing Surveys (CSUR), 43(3):16, 2011.
  • [2] S. Alletto, G. Serra, S. Calderara, F. Solera, and R. Cucchiara. From ego to nos-vision: Detecting social relationships in first-person views. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 580–585, 2014.
  • [3] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pages 5879–5887, 2017.
  • [4] M.-C. Chang, N. Krahnstoever, and W. Ge. Probabilistic group-level motion analysis and scenario recognition. In Proc. of the 2011 International Conference on Computer Vision (ICCV), pages 747–754, 2011.
  • [5] C.-W. Chen, R. C. Ugarte, C. Wu, and H. Aghajan. Discovering social interactions in real work environments. In Proc. of Face and Gesture 2011, pages 933–938, 2011.
  • [6] W. Choi, K. Shahid, and S. Savarese. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In Proc. of the 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), pages 1282–1289, 2009.
  • [7] M. Cristani, L. Bazzani, G. Paggetti, A. Fossati, D. Tosato, A. Del Bue, G. Menegaz, and V. Murino. Social interaction discovery by statistical analysis of f-formations. In BMVC, volume 2, page 4, 2011.
  • [8] M. Cristani, R. Raghavendra, A. Del Bue, and V. Murino. Human behavior analysis in video surveillance: A social signal processing perspective. Neurocomputing, 100:86–97, 2013.
  • [9] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social interactions: A first-person perspective. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1226–1233, 2012.
  • [10] T. Gan, Y. Wong, D. Zhang, and M. S. Kankanhalli. Temporal encoded f-formation system for social interaction detection. In Proceedings of the 21st ACM international conference on Multimedia, pages 937–946. ACM, 2013.
  • [11] W. Ge, R. T. Collins, and B. Ruback. Automatically detecting the small group structure of a crowd. In Proc. of the 2009 Workshop on Applications of Computer Vision, WACV, pages 1–8. IEEE, 2009.
  • [12] E. Goffman. Behavior in public places. Simon and Schuster, 2008.
  • [13] G. Groh, A. Lehmann, J. Reimers, M. R. Frieß, and L. Schwarz. Detecting social situations from interaction geometry. In Proc. of the 2010 IEEE Second International Conference on Social Computing, pages 1–8. IEEE, 2010.
  • [14] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2255–2264, 2018.
  • [15] E. T. Hall. The hidden dimension, volume 609. Garden City, NY: Doubleday, 1910.
  • [16] H. Hung and B. Kröse. Detecting f-formations as dominant sets. In Proceedings of the 13th international conference on multimodal interfaces, pages 231–238. ACM, 2011.
  • [17] A. Kendon. Conducting interaction: Patterns of behavior in focused encounters, volume 7. CUP Archive, 1990.
  • [18] O. Lanz. Approximate bayesian multibody tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1436–1449, 2006.
  • [19] D. O. Olguín, B. N. Waber, T. Kim, A. Mohan, K. Ara, and A. Pentland. Sensible organizations: Technology and methodology for automatically measuring organizational behavior. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1):43–55, 2009.
  • [20] H. S. Park, E. Jain, and Y. Sheikh. 3d social saliency from head-mounted cameras. In Advances in Neural Information Processing Systems, pages 422–430, 2012.
  • [21] M. Pavan and M. Pelillo. Dominant sets and pairwise clustering. IEEE transactions on pattern analysis and machine intelligence, 29(1):167–172, 2007.
  • [22] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
  • [23] E. Ricci, J. Varadarajan, R. Subramanian, S. Rota Bulo, N. Ahuja, and O. Lanz. Uncovering interactions and interactors: Joint estimation of head, body orientation and f-formations from surveillance videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 4660–4668, 2015.
  • [24] J. Rios-Martinez, A. Spalanzani, and C. Laugier. From proxemics theory to socially-aware navigation: A survey. International Journal of Social Robotics, 7(2):137–153, 2015.
  • [25] N. Sanghvi, R. Yonetani, and K. Kitani. Learning group communication from demonstration. In Workshop on Models and Representations for Natural Human-Robot Communication at the 2018 Robotics: Science and Systems Conference (RSS), 2018.
  • [26] F. Setti, O. Lanz, R. Ferrario, V. Murino, and M. Cristani. Multi-scale f-formation discovery for group detection. In 2013 IEEE International Conference on Image Processing, pages 3547–3551. IEEE, 2013.
  • [27] F. Setti, C. Russell, C. Bassetti, and M. Cristani. F-formation detection: Individuating free-standing conversational groups in images. PloS one, 10(5):e0123783, 2015.
  • [28] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Multi-person tracking by multicut and deep matching. In European Conference on Computer Vision, pages 100–111. Springer, 2016.
  • [29] S. Vascon, E. Z. Mequanint, M. Cristani, H. Hung, M. Pelillo, and V. Murino. Detecting conversational groups in images and sequences: A robust game-theoretic approach. Computer Vision and Image Understanding, 143:11–24, 2016.
  • [30] M. Vázquez. Reasoning About Spatial Patterns of Human Behavior During Group Conversations with Robots. PhD thesis, The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, July 2017.
  • [31] M. Vázquez, A. Steinfeld, and S. E. Hudson. Parallel detection of conversational groups of free-standing people and tracking of their lower-body orientation. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3010–3017. IEEE, 2015.
  • [32] T. Yu, S.-N. Lim, K. Patwardhan, and N. Krahnstoever. Monitoring, recognizing and discovering social networks. In Proc. of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 1462–1469, 2009.
  • [33] G. Zen, B. Lepri, E. Ricci, and O. Lanz. Space speaks: towards socially and personality aware visual surveillance. In Proc. of the 1st ACM International Workshop on Multimodal Pervasive Video Analysis, pages 37–42, 2010.