Since the beginning of early civilizations, social relationships derived from each individual fundamentally form the basis of social structure in our daily life. Today, apart from social interactions that occur in physical world, people also communicate through various social media platforms, such as Facebook and Instagram. Large amount of images and videos have been uploaded to the internet that explicitly and implicitly capture people’s social relationship information. Humans can naturally interpret the social relationships of people in a scene. In order to build machines with intelligence, it is necessary to develop computer vision algorithms that can interpret social relationships.
Enabling computers to understand social relationships from visual data is important for many applications. First, it enables users to pose a socially meaningful query to an image retrieval system, such as ‘Grandma playing with grandson’. Second, visual privacy advisor systems(Orekondy et al, 2017) can alarm users about potential privacy risks if the posted images contain sensitive social relationships. Third, robots can better interact with people in daily life by inferring people’s characteristics and possible behaviors based on their social relationships. Last but not least, surveillance systems can better analyse human behaviors with the understanding of social relationships.
In this work, we aim to build computational models that address the problem of visual social relationship recognition in images. We start by defining a set of social relationship categories. With reference to the relational models theory (Fiske, 1992) in social psychology literature, we define a hierarchical social relationship categories which embed the coarse-to-fine characteristic of common social relationships (as illustrated in Fig. 1). Our definition follows a prototype-based approach, where we are interested in finding exemplars that parsimoniously describe the most common situations, rather than an abstract definition that could cover all possible cases.
Social relationship recognition from images is a challenging task for several reasons. First, images have wide variations in scale, scene, human pose and appearance, as well as occlusions. Second, humans infer social relationships not only based on the physical appearance (e.g., color of clothes, gender, age, etc.), but also from subtler cues (e.g., expression, proximity, and context) (Alletto et al, 2014; Ramanathan et al, 2013; Zhang et al, 2015b). Third, a pair of people in an image might have multiple plausible social relationships, as shown in Fig. 2. While previous works on social relationship recognition only consider the majority consensus (Li et al, 2017a; Sun et al, 2017), it remains a challenging issue to make use of the ambiguity in social relationship labels.
A preliminary version of this work was published earlier (Li et al, 2017a). We have extended this work in the following manner: First, we propose a novel Adaptive Focal Loss, that addresses label ambiguity challenge and class imbalance problem in training. Second, we improve the Dual-Glance model in (Li et al, 2017a) with network modifications (see Section 3.2). Third, we conduct additional experiments on two dataset (i.e. People in Social Context (Li et al, 2017a) and Social Domain and Relation (Sun et al, 2017)), and achieve significant performance improvement over previous methods.
The key contributions can be summarized as:
We propose a Dual-Glance model, that mimics the human visual system to explore useful and complementary visual cues for social relationship recognition. The first glance fixates at the individual person pair of interest, and performs prediction based on its appearance and geometrical information. The second glance exploits contextual cues from regions generated by Region Proposal Network (RPN) (Ren et al, 2015) to refine the prediction.
We propose a novel Attentive R-CNN. Given a person pair, the attention is selectively assigned on the informative contextual regions. The attention mechanism is guided by both bottom-up and top-down signals.
We propose a novel Adaptive Focal Loss. It leverages the embedded ambiguity in social relationship annotations to adaptively modulate the loss and focuses training on hard examples. Performance is improved compared to using other loss functions.
To study social relationships, we collected the People in Social Context (PISC) dataset. It consists of 23,311 images and 79,244 person pairs with manually labeled social relationship labels. In addition, PISC consists of 66 annotated occupation categories.
We perform experiments with ablation studies on PISC and the Social Domain and Relation (SDR) (Sun et al, 2017) dataset, where we quantitatively and qualitatively validate the proposed method.
The remainder of the paper is organized as follows. First, we review the related work in Section 2. Then we elaborate on the proposed Dual-Glance model in Section 3, and the Adaptive Focal Loss in Section 4. Section 5 details the PISC dataset, whereas the experiment details and results are delineated in Section 6. Section 7 concludes the paper.
2 Related Work
2.1 Social Relationship
The study of social relationships lies at the heart of social sciences. Social relationships are the cognitive sources for generating social action, for understanding individual’s social behavior, and for coordinating social interaction (Haslam and Fiske, 1992). There are two forms of representations for relational cognition. The first approach represents relationship with a set of theorized or empirically derived dimensions (Conte and Plutchik, 1981). The other form of representation proposes implicit categories for relation cognition (Haslam, 1994). One of the most widely accepted categorical theory is the relational models theory (Fiske, 1992). It offers a unified account of social relations by proposing four elementary prototypes, namely communal sharing, equality matching, authority ranking, and market pricing. In this work, inspired by the relational models theory, we identify 5 exemplar relationships that are common in daily life and visually distinguishable (i.e. friends, family members, couple, professional and commercial). We group them into two relation domains, namely intimate relation and non-intimate relation, as illustrated in Fig. 1.
In the computer vision literature, social information has been widely adopted as supplementary cues in several tasks. Gallagher and Chen (2009) extract features describing group structure to aid demographic recognition. Shao et al (2013) use social context for occupation recognition in photos. Qin and Shelton (2016) exploit social grouping for multi-target tracking. For group activity recognition, social roles and relationship information have been implicitly embedded into the inference model (Choi and Savarese, 2012; Deng et al, 2016; Direkoglu and O’Connor, 2012; Lan et al, 2012a, b). Alletto et al (2014) define ‘social pairwise feature’ based on F-formation and use it for group detection in egocentric videos. Recently, Alahi et al (2016); Robicquet et al (2016) model social factor for human trajectory prediction.
Many studies focus on relationships among family members, such as siblings, husband-wife, parent-child and grandparent-grandchild. Such studies include kinship recognition (Wang et al, 2010; Chen et al, 2012; Guo et al, 2014; Shao et al, 2014; Xia et al, 2012) and kinship verification (Fang et al, 2010; Xia et al, 2012; Dibeklioglu et al, 2013) in group photos. Most of these works leverage facial information to infer kinship, including the location of faces, facial appearance, attributes and landmarks. Zhang et al (2015b) discover relation traits such as “warm”, “friendly” and “dominant” from face images. Another relevant topic is intimacy prediction (Yang et al, 2012; Chu et al, 2015) based on human poses.
For video based social relation analysis, Ding and Yilmaz (2014) discover social communities formed by actors in movies. Marín-Jiménez et al (2014) detect social interactions in TV shows, whereas Yun et al (2012) study human interaction in RGBD videos. Ramanathan et al (2013) study social events and discover pre-defined social roles in a weakly supervised setting (e.g. birthday child in a birthday party). Lv et al (2018) propose to use multimodal data for social relation classification in TV shows and movies. Fan et al (2018) analyze shared attention in social scene videos. Vicol et al (2018) construct graphs to understand the relationships and interactions between people in movies.
Our study also partially overlaps with the field of social signal processing (Vinciarelli et al, 2012), which aims to understand social signals and social behaviors using multiple sensors. Such works include interaction detection, role recognition, influence ranking, personality recognition, and dominance detection in group meeting (Gan et al, 2013; Hung et al, 2007; Rienks et al, 2006; Salamin et al, 2009; Alameda-Pineda et al, 2016).
Very recently, Li et al (2017a); Sun et al (2017) studied social relationship recognition in images. We (Li et al, 2017a) propose a Dual-Glance model with Attentive R-CNN to exploit contextual cues, whereas Sun et al (2017) leverage semantic attributes learnt from other dataset as intermediate representation to predict social relationships. Two datasets have been collected, namely the PISC dataset (Li et al, 2017a) and the SDR dataset (Sun et al, 2017) (see detailed comparison in Section 5). In this paper, we extend our work (Li et al, 2017a) with Adaptive Focal Loss, improved Dual-Glance model, and additional experiments on both datasets.
2.2 Region-based Convolutional Neural Networks
The proposed Attentive R-CNN incorporates Faster R-CNN (Ren et al, 2015) pipeline with attention mechanism to extract information from multiple contextual regions. The Faster R-CNN pipeline has been widely exploited by many researchers. Gkioxari et al (2015) propose R*CNN, that makes use of a secondary region in an image for action recognition. Johnson et al (2016) study dense image captioning that focuses on the regions. Li et al (2017b) adopt the Faster R-CNN pipeline as basis framework to study the joint task of object detection, scene graph generation and region captioning.
Attention model has been recently proposed and applied to image captioning (Xu et al, 2015; You et al, 2016), visual question answering (Yang et al, 2016) and fine-grained classification (Xiao et al, 2015). In this work, we employ attention mechanism on the contextual regions, so that each person pair can selectively focus on its informative regions to better exploit contextual cues. Our attentive R-CNN can also be viewed as a soft Multiple-Instance Learning (MIL) approach (Maron and Lozano-Pérez, 1997), where the model receives bags of instances (contextual regions) and bag-level labels (relationship class), and learns to discover informative instances for correct prediction.
2.3 Focal Loss
The proposed Adaptive Focal Loss is inspired by the Focal Loss (Lin et al, 2017) for object detection. Focal loss is designed to address the imbalance in samples between foreground and background classes during training, where a modulating factor is introduced to down-weight the easy examples. Our Adaptive Focal Loss not only addresses class imbalance, but more importantly, takes into account the uncertainty in visually identifying social relationship labels.
3 Proposed Dual-Glance Model
Given an image and a target person pair highlighted by bounding boxes , our goal is to infer their social relationship . In this work, we propose a Dual-Glance relationship recognition model, where the first glance module fixates at and , and the second glance module explores contextual cues from multiple region proposals . The final score over possible relationships, , is computed via
is a weight vector, and
is the element-wise multiplication of two vectors. We use softmax to transform the final score into a probability distribution. Specifically, the probability that a given pair of people having relationshipis calculated as
An overview of the proposed Dual-Glance model is shown in Fig. 3.
3.1 First Glance Module
The first glance module takes in input image and two human bounding boxes. First, we crop three patches from and refer them as , , and . and each contains one person, and contains the union region that tightly covers both people. The three patches are resized to pixels and fed into three CNNs, where the CNNs that process and share the same weights. The outputs from the last convolutional layer of the CNNs are flattened and concatenated.
We denote the geometry feature of the human bounding box as
, where all the parameters are relative values, normalized to zero mean and unit variance.and are concatenated and processed by a fully-connected (fc) layer. We concatenate its output with the CNN features for , and to form a single feature vector, which is subsequently passed through another two fc layers to produce first glance score, . We use to denote the output from the penultimate fc layer. serves as a top-down signal to guide the attention mechanism in the second glance module. We set =4096 with the same dimension as the regional features in Attentive R-CNN.
3.2 Attentive R-CNN for Second Glance Module
For the second glance module, we adapt Faster R-CNN (Ren et al, 2015) to make use of multiple contextual regions. Faster R-CNN processes the input image with Region Proposal Network (RPN) to generate a set of region proposals with high objectness. For each person pair with bounding boxes and , we select the set of contextual regions from as
where computes the Intersection-over-Union (IoU) between two regions, and is the upper threshold for IoU overlap. The threshold encourages the second glance module to explore cues different from that of the first glance module.
We then process with a CNN to generate a convolutional feature map conv(). For each contextual region , ROI pooling is applied to extract a fixed-length feature vector from conv(), which is then processed by a fc layer to generate regional feature . We denote as the bag of regional feature vectors for . Each regional feature is then fed to another fc layer to generate a score for the th region proposal:
Not all contextual regions are informative for the target person pair’s relationship. Therefore we assign different attention to the region scores so that more informative regions could contribute more to the final prediction. In order to compute the attention, we first take each local regional feature , and combine it with the top-down feature from the first glance module (which contains semantic information of the person pair) into a vector via
where , and is the element-wise multiplication. Then, we calculate the attention over the
th regional score with the sigmoid function:
where is the weight matrix, and is the bias term.
Given the attention, the output score of the second glance module is computed as a weighted average of all regional scores:
Note that the Dual-Glance model described above has several differences compared with our previously proposed model (Li et al, 2017a): (i) We add a new fc6 layer in the Attentive R-CNN model to increase the depth of the network. (ii) We add ReLU non-linearity to compute , which introduces sparse representation that is more robust. (iii) We modify (1) to use element-wise weighting instead of a scalar weight, so that the network can learn to better fuse the scores. Those modifications can individually improve the performance, and together they lead to +0.7% improvement in mAP for relationship recognition while other settings remain the same as Li et al (2017a).
4 Adaptive Focal Loss
Given a target person pair, our proposed Dual-Glance model outputs a probability distribution over the relationships. In order to train the model to predict higher probability for the ground truth target relationship , the standard loss function adopted by Li et al (2017a); Sun et al (2017) is the cross entropy (CE) loss defined as
In the task of social relationship recognition, there often exists class imbalance in the training data. The classes with more samples can overwhelm the loss and lead to degenerate models. Previous work addresses this with a heuristic sampling strategy to maintain a manageable balance during training(Li et al, 2017a). Recently, in the field of object detection, focal loss (FL) has been proposed (Lin et al, 2017), where a modulating factor is added to the cross entropy loss:
The modulating factor down-weights the loss contribution from the vast number of well-classified examples, and focuses on the fewer hard examples, where the focusing parameteradjusts the rate at which easy examples are down-weighted.
In a wide range of visual classification tasks (e.g. image classification (Russakovsky et al, 2015), object detection (Lin et al, 2014), visual relationship recognition (Krishna et al, 2017), etc.), the common approach to determine the ground truth class of a sample is to take the majority vote from human annotations. While this approach has been effective, we argue that social relationship recognition is different from other tasks. The annotation of social relationship has a higher level of uncertainty (as suggested by the agreement rate in Section 5.2), and the minority annotations are not necessarily wrong (as shown in Fig. 2). Therefore, taking the majority vote and ignoring other annotations has the potential disadvantage of neglecting useful information.
In this work, we propose an Adaptive Focal Loss that takes into account the ambiguity in social relationship labels. For each sample, instead of using the hard label from majority voting, we transform the annotations into a soft label , which is a distribution calculated by dividing the number of annotations for each relation with the total number of annotations for that sample. Then we define the adaptive FL as
The adaptive FL inherits the ability to down-weight easy examples from the FL, and extends the FL with two properties to address label ambiguity: (i) Instead of considering only the single target class, the adaptive FL takes the sum of losses from all classes, so that all annotations can contribute to training. (ii) The modulating factor is adaptively adjusted for each class based on the ground truth label distribution. The loss still demands the model to predict high probability for the predominant class, but the constraint is relaxed if not all annotations agree. For example, if 4 out of the 5 annotators agree on friends as the label, the adaptive FL term for will decrease to 0 if output , hence it will push to 0.8 instead of 1. Note that if the ground truth annotations all agree on the same class , then and for , the adaptive FL is the same as the FL.
The same philosophy of learning from ambiguous label distributions has also been studied by Gao et al (2017), where they use the Kullback-Leibler (KL) divergence loss defined as
where is the cross entropy between the output distribution and the label distribution, and is the entropy of the label distribution. Since is independent of the parameters of the model, minimizing is equivalent to minimizing .
The difference between KL divergence and the proposed adaptive focal loss is the per-class modulating factor. While KL divergence uses the ground truth label distribution to modulate the per-class loss, adaptive focal loss uses both and the model’s output to determine modulation, thereby down-weighting the easy examples and focusing training on the hard examples.
In practice, similar as Lin et al (2017), we use an -balanced variant of the adaptive FL defined as
is determined by inverse class frequency via
where is the total number of annotations for relationship , and is set to be 0.5 as a smoothing factor. We find that the -balanced adaptive focal loss yields slightly better performance over the non--balanced form.
5 People in Social Context Dataset
The People in Social Context (PISC) dataset is an image dataset that focuses on social relationship study (see example images in Fig. 4). In this section, we first describe the data curation pipeline. Then we analyze the dataset statistics and provide comparison with another dataset for social relationship study, following the presentation style by Goyal et al (2017); Agrawal et al (2018).
5.1 Curation Pipeline
The PISC dataset was curated through a pipeline of three stages. In the first stage, we collected around 40k images that contain people from a variety of sources, including Visual Genome (Krishna et al, 2017), MSCOCO (Lin et al, 2014), YFCC100M (Thomee et al, 2016), Flickr, Instagram, Twitter and commercial search engines (i.e. Google and Bing). We used a combination of key words search (e.g. co-worker, people, friends, etc.) and people detector (Faster R-CNN (Ren et al, 2015)) to collect the image. The collected images have high variation in image resolution, people’s appearance, and scene type.
In the second and third stage, we hired workers from CrowdFlower platform to perform labor intensive manual annotation task. The second stage focused on the annotation of person bounding box in each image. Following Krishna et al (2017), each bounding box is required to strictly satisfy the coverage and quality requirements. To speed up the annotation process, we first deployed Faster R-CNN (Ren et al, 2015) to detect people on all images, followed by asking the annotators to re-annotate the bounding boxes if the computer-generated bounding boxes were inaccurately localized. Overall, 40% of the computer-generated boxes are accepted without re-annotation. For images collected from MSCOCO and Visual Genome, we directly used the provided groundtruth bounding boxes.
Once the bounding boxes of all images had been annotated, we selected images consisting of at least two people, and avoided images that contain crowds of people where individuals cannot be distinguished. In the final stage, we requested the annotators to identify the occupation of all individuals in the image, as well as the social relationships of all person pairs. To ensure consistency in the occupation categories, the annotation is based on a list of reference occupation categories. The annotators could manually add a new occupation category if it was not in the list.
For social relationships, we formulate the annotation task as multi-level multiple choice questions based on the hierarchical structure in Fig. 1. We provide example images to help annotators understand different relationship classes. We also provide instructions to help annotators distinguish between professional111 The people are related based on their professions (e.g. co-worker, coach and player, boss and staff, etc.) and commercial relationship222 One person is paying money to receive goods/service from the other (e.g. salesman and customer, tour guide and tourist, etc.). Annotators can choose the option ‘not sure’ at any level if they cannot confidently identify the relationship. Each image was annotated by at least five workers, Overall, 7928 unique workers have contributed to the annotation.
|Dataset||PISC||SDR (Sun et al, 2017)|
|Image source||Wide variety (see Section 5.1)||Flickr photo album|
|Number of image||23,311||8,570|
|Number of person pair||79,244||26,915|
|Person’s identity||Different images, different people||Multiple images, same person|
|Person’s bounding box||Full-body||Head only|
5.2 Dataset Statistics
In total, the PISC dataset consists of 23,311 images with 79,244 pairs of people. For each person pair, if there exists a relationship class which at least 60% of the annotators agree on, we refer it as a ‘consistent’ example and assign the majority vote as its class label. Otherwise we refer it as an ‘ambiguous’ example. The top part of Fig. 5 shows the distribution of each type of relationships. We further calculate the agreement rate on the consistent set by dividing the number of agreed human annotations with the total number of annotations. As shown in the bottom part of Fig. 5, the agreement rate reflects how visually distinguishable a social relationship class is. The rate ranges from 74.1% to 92.6%, which indicates that social relationship recognition has certain degree of ambiguity, but is a visually solvable problem nonetheless.
For occupations, 10,034 images contain people that have recognizable occupations. In total, there are 66 identified occupation categories. The occupation occurrence and the agreement rate for the 26 most frequent occupation categories are shown in Fig. 6. Since two source datasets, i.e. MSCOCO and Visual Genome, are highly biased towards ‘baseball player’ and ‘skier’, we limit the total number of instances per occupation to 2000 based on agreement rate ranking to ensure there are no bias towards any particular occupation.
5.3 Comparison with SDR Dataset
The Social Domain and Relation (SDR) dataset (Sun et al, 2017) is a subset of the PIPA dataset (Zhang et al, 2015a) with social relation annotation. Table 1 provides the details of both datasets. In comparison, our PISC dataset has multiple advantages. First and foremost, the PISC dataset contains more images and more person pairs. Second, the images in SDR dataset all come from Flickr photo albums, while our images are collected from a wide variety of sources. Therefore, the images in PISC dataset are more diverse. Third, since the images in SDR dataset were originally collected for the task of people identification (Zhang et al, 2015a), the same person would appear in multiple images, which further reduce the diversity of the data. Last but not least, our PISC dataset provides full-body person bounding box annotation, while SDR dataset provides the head bounding box and uses that to approximate the body bounding box.
In this section, we perform experiments and ablation studies to fully demonstrate the efficacy of the proposed method on both PISC and SDR dataset. We first delineate the dataset and training details, followed by experiment details and discussion.
6.1 Dataset Details
PISC. On the collected PISC dataset, we perform two tasks, namely domain recognition (i.e. Intimate and Non-Intimate) and relationship recognition (i.e. Friends, Family, Couple, Professional and Commercial). We refer to each person pair as one sample. For domain recognition, we randomly select 4000 images (15,497 samples) as test set, 4000 images (14,536 samples) as validation set and use the remaining images (49,017 samples) as training set. For relationship recognition, since there exists class imbalance in the data, we sampled the test and validation split to have balanced class. To do that, we select 1250 images (250 per relation) with 3961 samples as test set and 500 images (100 per relation) with 1505 samples as validation set. The remaining images (55,400 samples) are used as training set.
All the samples used above are selected only from the consistent samples, where each relationship sample are agreed by a majority of annotators. For the relationship recognition task, we enrich the consistent training set with ambiguous samples to create an ambiguous training set. It contains a total of 58,885 samples, or 3445 samples more than the consistent training set.
SDR. The SDR dataset is annotated with 5 domains and 16 relationships (Sun et al, 2017). However, the class imbalance is severe for the relationship classes. 7 out of the 16 classes have no more than 40 unique individuals. In the test set, 4 classes have less than 20 samples (person pairs). In the validation set, 6 classes have no more than 5 samples. We tried to re-partition the dataset, but the issue that a same person appears across multiple images makes it very difficult to form a test and validation set with reasonable class balance. Therefore, we only perform domain recognition task, where the imbalance is less severe. The 5 domains include Attachment, Reciprocity, Mating, Hierarchical power and Coalitional groups. Note that the samples in SDR dataset are all consistent samples.
|Union (Lu et al, 2016)||75.2||81.5||75.3||49.3||42.7||52.5||45.0||70.2||49.4|
|Pair (Sun et al, 2017)||76.9||82.1||76.5||54.9||58.1||58.5||47.3||72.7||52.3|
|All Attributes (Sun et al, 2017)||78.1||82.9||77.6||57.5||46.5||59.7||63.2||80.1||55.0|
|Dual-Glance + Occupation||85.8||85.8||83.5||65.9||60.1||63.6||55.2||87.9||61.1|
|Dual-Glance + All Attributes||85.5||85.2||83.6||65.4||58.9||67.8||59.4||81.5||57.7|
6.2 Training Details
In the following experiments, we experiment with both Focal Loss using hard label as supervision and the proposed Adaptive Focal Loss using soft label distribution as supervision. We set the focusing parameter to be 2 in focal loss and 1 in adaptive focal loss, which yield best performance respectively. Unless otherwise specified, Section 6.3 uses focal loss on the consistent training set, Section 6.4 experiments with various loss functions, Section 6.5-6.7 use adaptive focal loss on the ambiguous training set.
We employ pre-trained CNN models to initialize our Dual-Glance model. For the first glance, we fine-tune the ResNet-101 model (He et al, 2016). For the second glance, we fine-tune the Faster R-CNN model with VGG-16 as backbone (Ren et al, 2015)
. We employ two-stage training, where we first train the first-glance model until the loss converges, then we freeze the first-glance model, and train the second-glance model. We train our model with Stochastic Gradient Descent and backpropagation. We set learning rate as 0.01, batch size as 32, and momentum as. During training, we use two data augmentation techniques: (1) horizontally flipping the image, and (2) reversing the input order of a person pair (i.e. ifand are a couple, then and are also a couple.).
6.3 Baselines vs. Dual-Glance
We evaluate multiple baselines and compare them to the proposed Dual-Glance model to show its efficacy. Formally, the compared methods are as followed:
Union: Following the predicate prediction model by Lu et al (2016), we use a CNN model that takes the union region of the person pair as input, and outputs their relationship.
Location: We only use the geometry feature of the two individuals’ bounding boxes to infer their relationship.
Pair+Loc.: We extend Pair by using the geometry feature of the two bounding boxes.
Pair+Loc.+Union: First-Glance model illustrated in Fig. 3, which combines Pair+Loc. with Union.
Pair+Loc.+Global: Model structure is the same as first-glance, except that we replace the union region with the entire image as global input.
R-CNN: We train a R-CNN using the region proposals in (3), and use average pooling to combine the regional scores.
All Attributes (Sun et al, 2017): We follow the method by Sun et al (2017) and extract 9 semantic attributes (age, gender, location&scale, head appearance, head pose, face emotion, clothing, proximity, activity) using models pre-trained on multiple annotated datasets. Then a linear SVM is used for classification. The SVM is calibrated to produce probabilities for calculating mAP. For attributes that require head bounding boxes (e.g. age, head pose, face emotion, etc.), we use a pre-trained head detector to find the head bounding box within each person’s ground-truth body bounding box.
Dual-Glance: Our proposed model (Section 3).
Dual-Glance+Occupation: We first train a CNN for occupation recognition using the collected occupation labels. Then during social relationship training, we concatenate the occupation score (from the last layer of the trained CNN) for each person with the human-centric feature as the new human-centric feature for the first glance.
Dual-Glance+All Attributes: We fuse the score from baseline 8 with the score from the dual-glance model for the final prediction.
Table 2 shows the results for both domain recognition task and relationship recognition task on the PISC dataset. We can make several observations from the results. First, Pair+Loc. outperforms Pair, which suggests that peoples’ geometric location in an image contains information useful to infer their social relationship. This is supported by the law of proxemics (Hall, 1959) which says people’s interpersonal distance reflects their relationship. However, the location information alone cannot be used to predict relationship, as shown by the results of Location. Second, adding Union to Pair+Loc. improves performance. The performance gain is lesser if we use the global context (entire image) rather than the union region. Third, using contextual regions is effective for relationship recognition. R-CNN achieves comparable performance to the first-glance model by using only contextual regions. The proposed Dual-Glance model outperforms the first-glance model by a significant margin ( for domain recognition, for relationship recognition).
Visual attributes also provide useful mid-level information for social relationship recognition. Combining All Attributes with Dual-Glance slightly improves performance, while Dual-Glance+Occupation achieves the best performance among all methods. However, All Attributes itself cannot outperform the proposed first-glance method. The reason is because of the unreliable attribute detection caused by frequently occluded head/face in the PISC dataset or the domain shift from source datasets (where the attribute detectors are trained) to target dataset (where the attribute detectors are applied, i.e. PISC).
|End-to-end Finetuned (Sun et al, 2017)||59.0|
|All Attributes (Sun et al, 2017)||67.8|
|Loss Function||Training Set||Training Supervision||First-Glance||Dual-Glance|
|Cross Entropy||Consistent||Single label||57.4||63.9|
|Focal Loss*||Single label||58.7||65.2|
|KL divergence||Soft label||58.7||65.1|
|Adaptive Focal Loss||Soft label||59.7||66.4|
|KL divergence||Ambiguous||Soft label||59.1||65.8|
|Adaptive Focal Loss||Soft label||61.2||68.3|
Fig. 7 shows some intuitive illustrations where the dual-glance model correctly classifies relationships that are misclassified by the first-glance model.
Fig. 8 shows the confusion matrix of relationship recognition with the proposed Dual-Glance model, where we include no relation (NOR) as the 6th class. The model tends to confuse the intimate relationships, especially, misclassifying family and couple as friends.
Table 3 shows the result of domain recognition task on SDR dataset. End-to-end Finetuned (Sun et al, 2017) is a double-stream CNN model that uses the person pair as input, similar to our Pair except for weight sharing. All Attributes is the best-performing method by Sun et al (2017), where a set of pretrained models from other dataset are used to extract semantic attribute representations (e.g. age, gender, activity, etc.), and a linear SVM is trained to classify relation using the semantic attributes as input. Compared with the results from Sun et al (2017), both our First-Glance and Dual-Glance yield better performance. While First-Glance slightly outperforms All Attributes, Dual-Glance achieves more improvement by utilizing contextual regions.
6.4 Efficacy of Adaptive Focal Loss
In this section, we conduct relationship recognition experiment on the PISC dataset using various loss functions and two training data. We experiment with cross entropy loss (8), focal loss (9), KL divergence loss (11) and the proposed adaptive focal loss (10) on both consistent training set and ambiguous training set (see Section 6.1 for dataset details). Note that we use the -balanced version for all losses while is computed as in (13).
Table 4 shows the result. There are several observations we can make. First, comparing cross entropy loss and focal loss that both use single target label as training supervision, focal loss yields better performance (). Second, adaptive focal loss achieves further improvement on focal loss. With dual-glance model, the improvement is in mAP if we use the same consistent training set. If we train on the ambiguous set, the improvement boosts to . Third, KL-divergence loss produces similar performance compared to focal loss on consistent set, and slight improvement on ambiguous set. On both training sets, KL-divergence gives lower mAP compared to adaptive focal loss. And last but not least, compared with the cross entropy loss by Li et al (2017a), the proposed adaptive focal loss with ambiguous training set increases mAP by +4.4% using Dual-Glance model. The results demonstrate that the minority social relationship annotations do contain useful information, and the proposed adaptive focal loss can effectively exploit the ambiguous annotations for more accurate relationship recognition.
6.5 Variations in Contextual Regions
In order to encourage the attentive R-CNN to explore contextual cues that are not used by First-Glance, we set a threshold in (3) to suppress regions that highly overlap with the person pair. Another influence factor in attentive R-CNN is the number of region proposals from RPN, which can be controlled by a threshold on the objectness score. In this section, We experiment with different combinations of and with the dual-glance model trained using adaptive focal loss on PISC dataset. As shown in Fig. 9, and produce the best performance on relationship recognition.
|Method||Ground Truth||Faster R-CNN|
6.6 Ground Truth vs. Automatic People Detection
In this section, we study the propose method using ground truth annotation of person’s bounding box. In other words we assume to possess a person detector that works as well as human. In this section, we test the robustness of our proposed method with automatic person detector. We employ Faster R-CNN (Ren et al, 2015) person detector pre-trained on MSCOCO dataset. Same as Ren et al (2015), for each person in the test set, we treat all output boxes with IoU overlap with the ground truth box as positives, and apply greedy non-maximum suppression to select the highest scoring box as final prediction. In total, 3171 out of 3961 person pairs have been detected, while the average IoU overlap between detection boxes and ground truth is 79.7%.
Table 5 shows the relationship recognition result. Using automatic person detector leads to decrease in mAP for first-glance model. The decrease is slighter for dual-glance model (), because the attentive R-CNN is less affected by person’s bounding box. The relatively insignificant performance decrease indicates that our proposed model is robust to person detection noise, and can be applied in a fully automatic setting.
6.7 Analysis on Attention Mechanism
In this section we demonstrate the importance of the attention mechanism on the proposed Dual-Glance model. We remove the attention module and experiment with two functions to aggregate regional scores, which are and . Table 6 shows the relationship recognition result on PISC dataset. Adding attention mechanism leads to improvement for both and . The performance improvement is more significant for . For dual-glance without attention, performs best, While for dual-glance with attention, performs best. This is because assumes that there exists a single contextual region that is most informative of the relationship, but sometimes there is no such region. On the other hand, consider all regions, but could be distracted by irrelevant ones. However, with properly guided attention, can better exploit the collaborative power of relevant regions for more accurate inference.
|Without Attention||With Attention|
6.8 Visualization of Examples
The attention mechanism enables different person pairs to exploit different contextual cues. Some examples are shown in Fig. 10. Taking the images on the second row as an example, the little girl in red box is useful to infer that the other girl on her left and the woman on her right are family, but her existence indicates little of the couple in black.
Fig. 11 shows examples of the misclassified cases. The model fails to pick up gender cue (misclassifies friends as couple in the image at row 3 column 3), or picks up the wrong cue (the white board instead of the vegetable in the image at row 2 column 3). Fig. 12 shows examples of correct recognition for each relationship category in the PISC test set. We can observe that the proposed model learns to recognize social relationship from a wide range of visual cues including clothing, environment, surrounding people/animals, contextual objects, etc. For intimate relationships, the contextual cues varies from beer (friends), gamepad (friends), TV (family), to cake (couple) and flowers (couple). In terms of non-intimate relationships, the contextual cues are mostly related to the occupations of the individuals. For instance, goods shelf and scale indicate commercial relationship, while uniform and documents imply professional relationship.
In this study, we address the problem of social relationship recognition, a key challenge to bridge the social gap towards higher-level social scene understanding. To this end, we propose a dual-glance model, which exploits useful information from the person pair of interest as well as multiple contextual regions. We incorporate attention mechanism to assess the relevance of each region instance with respect to the person pair. We also propose an adaptive focal loss, that leverages the ambiguity in social relationship labels for more effective learning. The adaptive focal loss can be potentially used in a wider range of tasks that have a certain degree of subjectivity, such as sentiment classification, aesthetic prediction, image style recognition, etc.
In order to facilitate research in social scene understanding, we curated a large-scale PISC dataset. We conduct extensive experiments and ablation studies, and demonstrate both quantitatively and qualitatively the efficacy of the proposed method. Our code and data are available at https://doi.org/10.5281/zenodo.831940.
Our work builds a state-of-the-art computational model for social relationship recognition. We believe that our work can pave the way to more studies on social relationship understanding, and social scene understanding in general.
This research was carried out at the NUS-ZJU SeSaMe Centre. It is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centre in Singapore Funding Initiative.
- Agrawal et al (2018) Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don’t just assume; look and answer: Overcoming priors for visual question answering. In: CVPR, pp 6904–6913
- Alahi et al (2016) Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S (2016) Social LSTM: Human trajectory prediction in crowded spaces. In: CVPR, pp 961–971
- Alameda-Pineda et al (2016) Alameda-Pineda X, Staiano J, Subramanian R, Batrinca LM, Ricci E, Lepri B, Lanz O, Sebe N (2016) SALSA: A novel dataset for multimodal group behavior analysis. IEEE Trans Pattern Anal Mach Intell 38(8):1707–1720
- Alletto et al (2014) Alletto S, Serra G, Calderara S, Solera F, Cucchiara R (2014) From ego to nos-vision: Detecting social relationships in first-person views. In: CVPR Workshops, pp 594–599
- Chen et al (2012) Chen Y, Hsu WH, Liao HM (2012) Discovering informative social subgraphs and predicting pairwise relationships from group photos. In: ACMMM, pp 669–678
- Choi and Savarese (2012) Choi W, Savarese S (2012) A unified framework for multi-target tracking and collective activity recognition. In: ECCV, Lecture Notes in Computer Science, vol 7575, pp 215–230
Chu et al (2015)
Chu X, Ouyang W, Yang W, Wang X (2015) Multi-task recurrent neural network for immediacy prediction. In: ICCV, pp 3352–3360
- Conte and Plutchik (1981) Conte HR, Plutchik R (1981) A circumplex model for interpersonal personality traits. Journal of Personality and Social Psychology 40(4):701
- Deng et al (2016) Deng Z, Vahdat A, Hu H, Mori G (2016) Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In: CVPR, pp 4772–4781
- Dibeklioglu et al (2013) Dibeklioglu H, Salah AA, Gevers T (2013) Like father, like son: Facial expression dynamics for kinship verification. In: ICCV, pp 1497–1504
- Ding and Yilmaz (2014) Ding L, Yilmaz A (2014) Learning social relations from videos: Features, models, and analytics. In: Human-Centered Social Media Analytics, pp 21–41
- Direkoglu and O’Connor (2012) Direkoglu C, O’Connor NE (2012) Team activity recognition in sports. In: ECCV, Lecture Notes in Computer Science, vol 7578, pp 69–83
- Fan et al (2018) Fan L, Chen Y, Wei P, Wang W, Zhu SC (2018) Inferring shared attention in social scene videos. In: CVPR, pp 6460–6468
- Fang et al (2010) Fang R, Tang KD, Snavely N, Chen T (2010) Towards computational models of kinship verification. In: ICIP, pp 1577–1580
- Fiske (1992) Fiske AP (1992) The four elementary forms of sociality: framework for a unified theory of social relations. Psychological review 99(4):689
- Gallagher and Chen (2009) Gallagher AC, Chen T (2009) Understanding images of groups of people. In: CVPR, pp 256–263
- Gan et al (2013) Gan T, Wong Y, Zhang D, Kankanhalli MS (2013) Temporal encoded F-formation system for social interaction detection. In: ACMMM, pp 937–946
- Gao et al (2017) Gao B, Xing C, Xie C, Wu J, Geng X (2017) Deep label distribution learning with label ambiguity. IEEE Trans Image Processing 26(6):2825–2838
- Gkioxari et al (2015) Gkioxari G, Girshick RB, Malik J (2015) Contextual action recognition with R*CNN. In: ICCV, pp 1080–1088
- Goyal et al (2017) Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR, pp 6325–6334
- Guo et al (2014) Guo Y, Dibeklioglu H, van der Maaten L (2014) Graph-based kinship recognition. In: ICPR, pp 4287–4292
- Hall (1959) Hall ET (1959) The silent language, vol 3. Doubleday New York
- Haslam (1994) Haslam N (1994) Categories of social relationship. Cognition 53(1):59–90
- Haslam and Fiske (1992) Haslam N, Fiske AP (1992) Implicit relationship prototypes: Investigating five theories of the cognitive organization of social relationships. Journal of Experimental Social Psychology 28(5):441–474
- He et al (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
- Hung et al (2007) Hung H, Jayagopi DB, Yeo C, Friedland G, Ba SO, Odobez J, Ramchandran K, Mirghafori N, Gatica-Perez D (2007) Using audio and video features to classify the most dominant person in a group meeting. In: ACMMM, pp 835–838
Johnson et al (2016)
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: CVPR, pp 4565–4574
- Krishna et al (2017) Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer vision 123(1):32–73
- Lan et al (2012a) Lan T, Sigal L, Mori G (2012a) Social roles in hierarchical models for human activity recognition. In: CVPR, pp 1354–1361
- Lan et al (2012b) Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012b) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562
- Li et al (2017a) Li J, Wong Y, Zhao Q, Kankanhalli MS (2017a) Dual-glance model for deciphering social relationships. In: ICCV, pp 2650–2659
- Li et al (2017b) Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017b) Scene graph generation from objects, phrases and region captions. In: ICCV, pp 1261–1270
- Lin et al (2014) Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: ECCV, Lecture Notes in Computer Science, vol 8693, pp 740–755
- Lin et al (2017) Lin T, Goyal P, Girshick RB, He K, Dollár P (2017) Focal loss for dense object detection. In: ICCV, pp 2980–2988
- Lu et al (2016) Lu C, Krishna R, Bernstein MS, Fei-Fei L (2016) Visual relationship detection with language priors. In: ECCV, Lecture Notes in Computer Science, vol 9905, pp 852–869
- Lv et al (2018) Lv J, Liu W, Zhou L, Wu B, Ma H (2018) Multi-stream fusion model for social relation recognition from videos. In: MMM, pp 355–368
- Marín-Jiménez et al (2014) Marín-Jiménez MJ, Zisserman A, Eichner M, Ferrari V (2014) Detecting people looking at each other in videos. International Journal of Computer Vision 106(3):282–296
- Maron and Lozano-Pérez (1997) Maron O, Lozano-Pérez T (1997) A framework for multiple-instance learning. In: NIPS, pp 570–576
- Orekondy et al (2017) Orekondy T, Schiele B, Fritz M (2017) Towards a visual privacy advisor: Understanding and predicting privacy risks in images. In: ICCV, pp 3686–3695
Qin and Shelton (2016)
Qin Z, Shelton CR (2016) Social grouping for multi-target tracking and head pose estimation in video. IEEE Trans Pattern Anal Mach Intell 38(10):2082–2095
- Ramanathan et al (2013) Ramanathan V, Yao B, Fei-Fei L (2013) Social role discovery in human events. In: CVPR, pp 2475–2482
- Ren et al (2015) Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp 91–99
- Rienks et al (2006) Rienks R, Zhang D, Gatica-Perez D, Post W (2006) Detection and application of influence rankings in small group meetings. In: ICMI, pp 257–264
- Robicquet et al (2016) Robicquet A, Sadeghian A, Alahi A, Savarese S (2016) Learning social etiquette: Human trajectory understanding in crowded scenes. In: ECCV, Lecture Notes in Computer Science, vol 9912, pp 549–565
Russakovsky et al (2015)
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252
Salamin et al (2009)
Salamin H, Favre S, Vinciarelli A (2009) Automatic role recognition in multiparty recordings: Using social affiliation networks for feature extraction. IEEE Trans Multimedia 11(7):1373–1380
- Shao et al (2013) Shao M, Li L, Fu Y (2013) What do you do? occupation recognition in a photo via social context. In: ICCV, pp 3631–3638
- Shao et al (2014) Shao M, Xia S, Fu Y (2014) Identity and kinship relations in group pictures. In: Human-Centered Social Media Analytics, pp 175–190
- Sun et al (2017) Sun Q, Schiele B, Fritz M (2017) A domain based approach to social relation recognition. In: CVPR, pp 3481–3490
- Thomee et al (2016) Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L (2016) YFCC100M: the new data in multimedia research. Commun ACM 59(2):64–73
- Vicol et al (2018) Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) Moviegraphs: Towards understanding human-centric situations from videos. In: CVPR, pp 8581–8590
- Vinciarelli et al (2012) Vinciarelli A, Pantic M, Heylen D, Pelachaud C, Poggi I, D’Errico F, Schröder M (2012) Bridging the gap between social animal and unsocial machine: A survey of social signal processing. IEEE Trans Affective Computing 3(1):69–87
- Wang et al (2010) Wang G, Gallagher AC, Luo J, Forsyth DA (2010) Seeing people in social context: Recognizing people and social relationships. In: ECCV, Lecture Notes in Computer Science, vol 6315, pp 169–182
- Xia et al (2012) Xia S, Shao M, Luo J, Fu Y (2012) Understanding kin relationships in a photo. IEEE Trans Multimedia 14(4):1046–1056
Xiao et al (2015)
Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: CVPR, pp 842–850
- Xu et al (2015) Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: ICML, pp 2048–2057
- Yang et al (2012) Yang Y, Baker S, Kannan A, Ramanan D (2012) Recognizing proxemics in personal photos. In: CVPR, pp 3522–3529
- Yang et al (2016) Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: CVPR, pp 21–29
- You et al (2016) You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: CVPR, pp 4651–4659
- Yun et al (2012) Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: CVPR Workshops, pp 28–35
- Zhang et al (2015a) Zhang N, Paluri M, Taigman Y, Fergus R, Bourdev LD (2015a) Beyond frontal faces: Improving person recognition using multiple cues. In: CVPR, pp 4804–4813
- Zhang et al (2015b) Zhang Z, Luo P, Loy CC, Tang X (2015b) Learning social relation traits from face images. In: ICCV, pp 3631–3639