CommuNety: A Deep Learning System for the Prediction of Cohesive Social Communities

07/29/2020 ∙ by Syed Afaq Ali Shah, et al. ∙ 0

Effective mining of social media, which consists of a large number of users is a challenging task. Traditional approaches rely on the analysis of text data related to users to accomplish this task. However, text data lacks significant information about the social users and their associated groups. In this paper, we propose CommuNety, a deep learning system for the prediction of cohesive social networks using images. The proposed deep learning model consists of hierarchical CNN architecture to learn descriptive features related to each cohesive network. The paper also proposes a novel Face Co-occurrence Frequency algorithm to quantify existence of people in images, and a novel photo ranking method to analyze the strength of relationship between different individuals in a predicted social network. We extensively evaluate the proposed technique on PIPA dataset and compare with state-of-the-art methods. Our experimental results demonstrate the superior performance of the proposed technique for the prediction of relationship between different individuals and the cohesiveness of communities.



There are no comments yet.


page 1

page 5

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the pervasiveness of low cost digital cameras and advent in computer vision and machine learning approaches, the collection and analysis of large image data has become a trivial task. As the value of photos is greatly determined by who appears in those photos (e.g., celebrity), labeling photos with their identities becomes an essential task

[16, 23, 25, 15].

The popularity of social applications and social networking services (SNS) such as Facebook, Twitter, LinkedIn, Weibo, MOMO and Flickr has led to the formation of online social networks of users on these sites. At present, analyzing online comments (e.g., tweets) is a popular method to determine effective communities in social networks. Pfeil et al. [18] proposed a study about the age differences of users in online social communities. They extracted information from MySpace’s user profile pages and divided the users into teenagers and older people communities. Users in the same community have common features, for example, teenagers have larger friends networks than older users. While text data contains rich information, it can be noted that the existing methods are unable to utilise the text data to get sufficient information about the social users. In addition, the social networks need to be more comprehensive and accurate [30]. With the advent of imaging technology and the availability of portable high resolution cameras such as on smartphones, users can now upload their images and profiles to social media websites and share photos with other users who are part of their social community [26, 4]. Social media users upload countless photos of social activities each day, and the relationship among those who appear in these photos cannot be mined accurately only from text data. Hence, defining online social networks with user-uploaded images, and extraction of human features, such as faces or body from photos becomes an important procedure in building social networks [7, 13, 27]. Note that the popular SNS applications have very large user bases. In 2018 alone, Facebook had 2.2 billion monthly active users. Flickr had over 90 million monthly users, and the number of monthly users of Weibo exceeded 0.44 billion [5]. Therefore, the mining of a potential relationship between social network users is a challenging problem.

To overcome the above challenges, in this paper, we propose a deep learning system, CommuNety, which uses image data for the prediction of comprehensive and cohesive online social communities. The proposed system complements existing works and is helpful in discovering communities when there are no explicit relationships (e.g., discovering communities in an image database) or discovering communities when not all relationships are directly represented in the network e.g., two people may not be friends on social media and may have never interacted on the platform, however, if they appear together in some photos, they have a relationship which can only be discovered using images.

Several deep learning algorithms have been developed in recent years and have achieved significant breakthroughs in image recognition tasks. In 2014, Simonyan and Zisserman proposed a Deep Convolutional Neural Network (CNN) architecture and achieved an outstanding classification performance


. Parkhi et al. applied the VGG (Visual Geometry Group) network structure to face recognition task and achieved results comparable to other face recognition techniques


. Razavian et al. have demonstrated that the features extracted from CNN are powerful and the models trained using CNN features have superior performance. Such features can be used for visual recognition tasks


Inspired by prior approaches, in this paper, we propose a high-performance face recognition model, which learns distinctive image features. The proposed model is then used to predict community network and its hierarchy that is centered at the target person. In our proposed technique, every photo in the training set is also ranked using the term frequency inverse document frequency (TF-IDF) numerical statistics. Then, the strength of relationship between each pair of persons in the photos is represented by the sum of the TF-IDF values of their group photos. As a result, the social network outputted from the proposed prediction system contains all the persons who have direct or indirect relationship with the target person and different relationship strength among them.

Recognizing people from high-quality photos, which contain high-resolution facial images, is a trivial task for humans. However, well trained autonomous system still struggle with this challenging task. This is because of the variations in natural images, such as changes in illumination and viewpoint change or head rotation. Moreover, although some progresses have been made recently in recognition from a frontal face without face location, non-frontal views are more common in social media photo albums. A few face recognition techniques perform face detection as a preliminary step [21, 11, 29]. Note that face detection can be regarded as a two-class (face versus non-face) classification problem. However, these techniques cannot deal with significant variations in face images such as head rotation and view changes, etc. to detect and recognise faces. Other model-based approaches require that the initial locations of faces are known in advance [10] and then they perform face tracking to recognise individuals in the image data. This paper overcomes these challenges. The significance of this research is to recognise people from any viewpoint and associate them with established cohesive social communities [28].

The contributions of this paper can be summarised as follows:

  • First, we propose a deep learning model to predict cohesive social communities or networks using image data and face recognition.

  • Second

    , we propose a novel algorithm to calculate the relationship strength among people in the predicted social network. The improved social networks are quite informative. We also present the final social networks using data visualization techniques.

  • Third, we propose novel features for image-generated networks compared with other social communities formed in social media.

  • Four, we perform extensive evaluation of the proposed technique. Our experimental results demonstrate the superior performance of the proposed system on the PIPA (“People In Photo Albums”) dataset.

The rest of this paper is organised as follows. The next section discusses the prior works related to this research. Section III describes the proposed methodology and provides information about the training of face recognition model, construction of social networks, and analysis of the predicted social networks. The PIPA dataset used for the evaluation and data pre-processing are discussed in Section IV. Section V presents our experimental results. The paper is concluded in Section VI.

Ii Literature Review

Chen et al.[3]

proposed a technique to identify family and non-family images, which were collected from social media and to predict the pairwise relationship of persons who were in the same family images. To categorize different group types or events, a bag-of-face-subgraphs (BoFG) was proposed. BoFG contained meaningful subgraphs, which represented a group photo, and the occurring frequency of these subgraphs was adequate to identify specific image types. The authors trained an SVM classifier using BoFG features and their technique achieved an accuracy of 89

on family image recognition. In addition, a Naive Bayesian classifier was used to predict the pairwise relationship by getting the image frequency of appearance of the informative subgraph in the image collections. Their proposed technique achieved good improvement over prior works, especially in image categorization area. However, there are still several limitations of their technique. For instance, the images used in the training and testing phases are frontal face images. Hence, if BoFG is applied to open world images, which contain large number of non-frontal facse, the performance would be significantly affected. Moreover, in their proposed method, the pairwise relationship is identified based on the gap of age and gender in a household. This special feature is not feasible for other types of relationship, which do not involve age and gender gap. This limits the application of their proposed technique on real world social network data.

Kim et al. [12] developed an associative network structure called Face Co-occurrence Networks (FCON), which was used to recommend reliable social friends and explore relationships among people based on tagged personal photos. FCON consists of vertices (V) and edges (E), where V is a set of faces, which appeared in photos and E is a set of links between each pair of faces (aka. co-occurrences of faces), both V and E are accumulated. Converting all photos into a global FCON, the weights of V and E in the network were obtained by accumulating V and E in each subnetwork. Subsequently, parts of weights which were related to the target user were calculated to get a set of scores and compare these scores with a pre-set threshold. Finally, using the vertices which have scores higher than the threshold to establish a target user-centric relationship network. Besides, the authors also develop a web-based system named VizFaceCo for data visualization. An aspect that is obviously worth improving is that their technique does not include face recognition. The photos are manually annotated with corresponding names before building FCON. In contrast, in our proposed technique, automatic face recognition is used as a core technology.

Oh et al. [14] proposed an optimized model for person recognition, called naeil2, which is capable of handling large variations in person images. naeil2 consists of seventeen cues (including five vanilla regional cues, two head cues, ten attribute cues) and DeepID2+ face recognition module. All the cues were obtained from the seventh layer (fc7) of AlexNet [14], and concatenated together. Finally, these cues and DeepID2+ using L2 normalization was combined to build the final naeil2 model.

naeil2 has been shown to achieve an outstanding person recognition performance. However, this model relies on multiple features such as several body cues to identify persons. In most of the social media photos, multiple body cues are hard to capture and therefore their proposed technique fails in these situations.

Dong et al. [6]

proposed a method for human age identification. They proposed CNN based DeepID architecture. In their method, the loss function for classification was modified and a distance term was added to the loss function to emphasize on the relationships between labels. They used different parts of face images to train multiple classifiers, and by comparing the accuracy of an exact match (AEM), the eye region was found to be the most significant feature, which can reflect the age of the person. To further improve AEM, different models were combined, and the best model combination was shown to achieve good performance.

They also described in detail the transfer learning strategy adopted in their work that used fewer data samples to train their model to achieve good performance. Concretely, they used large-scale data sets to train a face recognition model, then transferred the parameters of convolutional layers to another network which had same architecture, but the parameters in fully connected layers were randomly initialized. This new network was fine-tuned using the small-scale dataset to get the desired age classification model.

The limitation of their technique is that the performances of the face recognition models were not outstanding compared to the state-of-the-art face recognition techniques e.g., the lowest error rate for DeepID is 0.4. In other words, the accuracy of the best model is 60 [6]. One of the reason is that the architecture of CNN used in DeepID is relatively simple, for instance, the DeepID only has four convolutional layers. Their model was not able to handle the image data complexity, therefore it was under-fitting. The accuracy of recognition can be improved by using a more complex CNN architecture and more training data [8]. In this paper, we overcome the limitations of the prior methods and propose a novel technique for the prediction of social communities using images and calculate the relationship strength between the social media users.

Iii Proposed Methodology

In this section, we describe our proposed deep learning based system, CommuNety, to predict social community centered on the input image. We propose a novel algorithm to calculate the relationship strength between people in the predicted social community. The proposed deep learning pipeline consists of two phases including the face recognition phase and, the community prediction and formation phase. In face recognition phase, our proposed deep learning model is trained to perform accurate face recognition. In the community prediction and formation phase, we develop a novel Face Co-occurrence Frequency algorithm and calculate the relationship strength to predict communities. In addition, we also propose novel features for the predicted communities by analyzing their properties.

Iii-a Face Recognition Phase

To establish accurate and cohesive social networks, the most challenging task is to accurately recognize persons in given photos. We first detect faces in the input images by using the Viola and Jones algorithm [24]. The outcome of face detection (i.e., the bounding box of faces) is validated by the annotations provided in [28]. The detected face images along with their labels are then fed to our deep learning based face recognition system. These images are used for the training and testing of our proposed deep neural network, which is discussed in the following section.

Iii-A1 Deep Neural Network Architecture

We propose a deep face recognition architecture to extract discriminating features for face recognition task. The proposed deep learning architecture is composed of sixteen blocks. The first eleven blocks consist of convolutional layers. Each block followed by one non-linear activation function ReLU, and five max-pooling layers are interspersed between blocks to reduce computational load. The last three blocks are called Fully Connected (FC) layers. The last layer is a softmax layer for multi-class classification and its dimension is equal to the number of class labels in task.

Iii-A2 Deep Network Training and Testing

The neural network is trained as a multi-class classifier to recognize persons using their face images. The class probability is computed using the following equation, which computes probability in the range between 0 and 1 for each class:


where N represents the number of classes and is the probability of class j. is the output of the jthneuron in the soft-max layer. Its role is to increase the probability of true class label. In addition, we use cross-entropy loss function as in Eq. (2) for the softmax layer:


where al is the actual label of input.

During testing, given a test face image, the network then predicts the class label for the input test image. The output of face recognition is then used in the subsequent modules and to predict the social community as discussed below.

Iii-B Community Prediction and Formation Phase

Once the faces have been successfully recognised in the input images, the next phase is to predict the social communities using facial images. We propose two algorithms to predict social communities using our face recognition system, and compute relationship strength for each pair of connected nodes in the communities.

Iii-B1 Recursive Face Co-occurrence Frequency

Our proposed algorithm to predict communities is similar to FCON [12], however, it is recursive in nature. A dictionary is first defined to store face co-occurrence frequencies as follows:


where i is the target candidate, key is the class of the kth person in the dataset, and key value represents the number of times the kth person appears in the given album. The key values are initially zero.

Fig. 1: Social Network Construction. Left: Person A is the root, which is directly connected to persons B, C and D. Right: The final network has two layers, as only person E is on the latest layer and E does not connect to any new person who is not in the current network.

Given an input target person’s face image, our proposed face recognition system recognizes and collects all the photos of the target class. Next, comparison of the photo labels of other classes with the collected target photo labels is performed. The images of other classes are input to face recognition system for predicting the class label of the input image. When the input person’s class is predicted, the key value corresponding to the person’s name in the co-occurrence frequency dictionary is incremented by one. After assigning all the matched photos to the dictionary, persons whose key values are larger than zero are considered as directly connected with the target in the social network. Below, we provide a definition of elements contained in the predicted social network.

Definition. Root, nodes, and layers:

  1. Initial target is the root of its social network, meanwhile, it is on layer 0.

  2. Other people in the social network are nodes. Nodes that are directly connected with the root are on layer 1; similarly, the nodes on the second layer are connected to the nodes on the first layer.

  3. Because multiple persons may occur in a group photo, therefore the nodes on the same layer may be connected.

Each person on layer 1 is treated as a new target and the same method (as stated above) is followed to build their corresponding single-layer social network. The new community network is then integrated with the previous network to build a 2-layer social network. We only add people who do not already exist in the previous network to the second layer. This process is repeated until no more new person can be found to join the network, and finally a complete social network centered on the initial target person is set up. Figure 1 shows an example of social network construction process. Person A is the root, which is directly connected to persons B, C and D. The final network has two layers, as only person E is on the latest layer and E does not connect to any new person who is not in the current network.

Iii-B2 Prediction of Relationship Strength in Social Communities

To predict the relationship strength of persons in social communities, we propose an image ranking algorithm to assign scores to the images that determine the relationship strength in a community. To achieve this, TF-IDF is used for ranking in the predicted community.

TF-IDF is a statistical analysis technique for weighing that reflects the importance of a word for the documents in a corpus. This importance is obtained by comparing the relative frequency of a word in a particular document with the inverse proportion of the word in the entire corpus [19]. In the proposed technique, the corpus consists of all the group photos, where each photo represents a document in the corpus, and the words are replaced by persons in the group photos.

In the proposed technique, the formula for TF-IDF is defined as follows. Given the group photo set G, a candidate c, and a single group photo g G, the TF-IDF is represented as:


where is the number of times c appears in g, G represents the size of the group photo set, and equals the number of group photos in which c appears in G.

The TF-IDF formula can be separated into two terms TF and IDF as follows:


Since a given person in each photo can only appear once, therefore . Meanwhile, each candidate has their own fixed IDF value, as the number of times they appear in the entire photo collection is fixed. The more a person appears in the photo collection, the smaller the IDF they receive, and is considered as the lesser important in a specific photo.

The group photos are ranked by calculating the averages of the TF-IDFs of all candidates in each photo:


where k represents the number of persons in g.

Intuitively, the score of a photo is related to the IDF values of the persons in the photo. For example, if the persons in a photo appear only a few times in the entire collection, the score of this photo is high. On the contrary, if most people in a photo appear in the photo collection many times, then the IDFs of these people are small, and the significance of this photo is low.

Ultimately, the strength of the relationship between each pair of connected people in the social network is represented by the sum of the scores of all of their photos in which both persons appear. Each edge in the social network is assigned a weight representing the strength of the relationship (larger the better) between the two persons connected by the edge.

Iv Image Data for Evaluation

Iv-a Dataset:

The proposed technique is evaluated on People In Photo Albums (PIPA) dataset. PIPA dataset contains 37107 Flickr personal photo album images, with 63188 head annotations of 2356 identities, all the images have Creative Commons Attribution License [28]. The dataset is divided into train, val, test, and leftover sets, with a proportion of 45, 15, 20, 20, respectively. We used the same experimental protocol as in [14], hence the same image sets are used in this paper.

The train, val and test sets in the dataset contain distinct identities i.e., the class labels in training and test set were totally different. Therefore these image data cannot be used directly for the proposed technique. Besides, the number of photos from different identities varies significantly, e.g., the minimum number was only 5, and such a small data size is not enough to train the proposed model. Therefore, we pre-process the data for our deep learning model.

Iv-B Data Pre-processing

In data pre-processing stage, data cleaning, redistribution, and data augmentation are performed.

First, we crop all the face images of identities in the train, val and test sets. All instances are then resized to 224x224 to fit the input size of the proposed model. Second, data cleaning is performed. The cropped images have different appearances, including the front face, the side face, and even the back of heads. The back of head does not contribute in recognition and could affect the training of our deep network, therefore, these images are removed from the dataset. An example of these images/instances is shown in Fig 2. Third, the training and test images are randomly selected with the proportion 80 and 20, respectively. Because the instance size of each class is different, the stratified sampling method is used for data allocation to avoid significantly biased results [8]. The last step is to perform data augmentation. We set 8 as the minimum number of instances per class. For the classes with insufficient instances, we perform data augmentation by rotating their instances by different angles, flipping and scaling. As a result of this, the numbers of instances in those classes are expanded. Figure 3 presents different steps involved in our data pre-processing.

Fig. 2: An Example of Poor Quality Instances i.e., Head Images.
Fig. 3: Data Pre-processing. (1) Personal photos in PIPA Dataset. (2) Re-sized head images cropped from original photos. (3) Good quality head images. (4) Training and test data. (5) Augmented training data.

After pre-processing and data augmentation, the dataset has 2356 classes, training set contains 41533 face images out of which 8613 of them are augmented. The test set contains 8230 face images. The distribution of images in the pre-processed dataset is shown in Table I.

Split All Train Test
Instances (augmentation) 49763 41533 (8613) 8230
Identities 2356 2356 2356
Average identity 21.12 17.63 3.49
TABLE I: Statistics of the pre-processed dataset

V Experimental Results

In the following, we first train the proposed model for the face classification task, then construct the desired social networks.

During training, we use Stochastic Gradient Descent (SGD) and back propagation to decrease the Loss Function. SGD randomly chooses one training instance at each step and calculates the gradients based only on that single instance. This speeds up the algorithm as it only manipulates little data at each iteration, especially on huge training sets


. In addition, to find the most satisfactory gradient, the learning rate is set to gradually decrease in the range of 0.005 to 0.00001 as the number of epochs increases. The learning rate changed every 30 epochs on average.

The model is trained to solve the multi-class classification task. It is assessed by top 1 error rate of classification. We compared the highest probability class of each sample with the actual classes, the top-1 error reflects the proportion of the number of incorrectly predicted samples to the total number of input samples.

V-1 Comparison with State-of-the-art

We compare our proposed technique with the state-of-the-art methods including the deep learning model naeil2 [14] and DeepFace [1]. Our experimental results are reported in Table II.

Method Accuracy
naeil2 [14] 83.88
DeepFace [1] 46.66
Proposed Technique 86.87
TABLE II: Comparison of the proposed technique with the state-of-the-art methods.

As can be noted, the proposed model’s classification accuracy is 86.87, i.e., the top-1 error rate is 1-0.8687 = 0.1313. naeil2 [14] fine-tuned the pre-trained AlexNet model using head images in PIPA Dataset and achieved accuracy of 83.88 [14]. DeepFace achieved an accuracy of 46.66% on PIPA dataset. These results demonstrate the superior performance of the proposed technique, which relies on face images to predict social communities.

V-2 Implementation details

Our technique is developed in MATLAB. All our experiments have been performed on a machine with Intel Corei5 CPU and 16GB RAM.

V-a Community Prediction and Formation:

For social networks construction task, a complete social network prediction system is devised that is built on top of our face recognition model. The proposed system is evaluated on PIPA dataset. The input to social network prediction system is a face image of the target person, and a predicted social network graph starts with the target person as its node as shown in Figure 4. Figure 4 is divided into three parts by the dotted line, where each part represents a layer. The numbers beside the nodes are people’s identity, e.g., person 137 is the target, and they are also the root of this network. Moreover, to enhance visibility, the edges emitted from the same node have the same color.

Fig. 4: An Example of a Predicted Social Network.
Fig. 5: Precision and Recall versus the Minimum Frequency Threshold.
Fig. 6: Social Network Centered at Person ID 1.
Fig. 7: Social Network Centered at Person ID 137.

V-A1 Performance of Community Formation Task

Precision and Recall criterion is used for evaluating all predicted networks. For each predicted label (predicted person name) in a network, it can only be judged whether it is consistent with the true label, no matter which class it belongs to. Thus, the multi-class classification tasks are converted into binary ones. Every predicted label is recorded as one of true negatives (TN), false positives (FP), false negatives (FN), or true positives (TP) base on the classification result.

Precision is the accuracy of the positive predictions [8], the equation is shown as:


where TP represents the number of people who exist in the networks and are correctly predicted. On the contrary, FP are wrongly classified person into the networks. However, precision is deceptive in some cases, for example, predicting one person as TP and to ensure that is correct, the precision is equal to 100, but the network could not be constructed by the single person. Hence, precision is necessarily utilized along with recall, a.k.a. true positive rate (TPR). As Eq. (9) shows, FN represents the number of persons who should be in the social networks and are not there.


The precision may be improved by setting a minimum face frequency. However, this negatively affects recall. In Figure 5, we study this trade-off and observe that precision does not improve much whereas recall is severely affected. Thus, we set minimum face frequency to be zero for the rest of the experiments. Only those individuals whose face frequency is greater than this threshold are classified into the corresponding social network.

Figure 5 shows the precision and recall for different thresholds and frequencies. When the threshold increases, the precision is not significantly improved, however, the recall is greatly reduced. We therefore empirically set the threshold to 0. The achieved precision with this threshold is 75.84, and the recall is 76.13.

V-B Prediction of relationship strength between candidates and analysis of social network properties

To explore the relationship strength between candidates, we first calculate the IDF of each person using Eq. (6). The score of a photo is represented by the average of IDFs of all candidates in the photo. Once all the IDFs have been computed, the number of persons in each photo is calculated. The statistics of IDF and photo score is shown in Table III.

Type Min Max Average
IDF 2.45 4.35 3.31
Photo Score 2.45 4.35 3.08
TABLE III: statistics of IDF and photo score

As discussed in Section 3.2, the relationship strength between two candidates is obtained by summing the scores of photos in which both appear together. Hence, we improve our social network prediction system by enabling it to record the face co-occurrence frequencies and corresponding photo names simultaneously. Then, find the scores of those photos from the previously defined photo score library to calculate the relationship strength.

All scores that reflect the relationship strengths are then displayed in the final social network graph. The example plots are shown in Figure 6 and Figure 7. These social networks are built with two different target persons, respectively. The nodes are replaced by candidates’ face images for visualisation, moreover, in addition to the scores on the edges, the thickness of the edges also reflects different relationship strength.

V-C Analysis of Predicted Social Community

In this section, we analyze the whole social community set after inputting all the test data to our proposed system and keeping all distinct communities. By counting the size distribution of social communities and the density of the communities, the cohesiveness of these predicted communities using image data is achieved. Community size and density are two of the primary network properties. Community size represents the number of nodes in a community. Community density refers to the Actual Connection-Maximum Connection ratio of a community as in Eq. (10):


where is the number of edges in social network , represents the amount of nodes in the network. The more the edges, the denser is the community. A community is more cohesive when it has larger density and smaller size [1, 2].

To further explore the social networks, all the candidates are integrated into communities with different size. Figure 8 shows the size distribution of all communities. Although the largest network size is 167, most of the social network sizes predicted using image data in this paper contain fewer than 10 persons. Therefore, the analysis of network density focuses on the network size of 3 to 10. The average network density of each size is shown in Figure 9. Although the network density gradually decreases as the network size expands, the minimum network density is still 57.14.

To explore the features of image-generated community, a comparison between the communities predicted by the proposed technique and other social network communities built in [9] is conducted. These communities include Twitter Friendship Network, Epinions Social Network, Wikipedia Vote Network and EU Email Communication Network. These four networks were constructed using textual data, such as user profiles, emails, and questionnaires.

Table IV shows the statistics of network properties calculated on our image-generated community set and four text-generated networks. Intuitively, the communities predicted using PIPA dataset have smaller sizes and higher densities than other four networks. Because small network size and high network density lead to cohesive communities, this indicates that the predicted communities are cohesive.

Fig. 8: Size Distribution of Communities
Fig. 9: Average Network Density
Property Twitter Epinions Wekipedia Email Image-generated
Size 500 500 500 500 3-10
Edges 3099 13739 11672 2396 2-45
Density 6.18 27.47 23.34 4.79 88.51-57.14
TABLE IV: Statistics for Comparison of Image-generated Network and Text-generated Network

Vi Conclusion and Future Work

In this paper, we propose a deep learning based social network prediction system, CommuNety. The input to our deep neural network is an image of a target person, and the output is a target-centered predicted social network, which also presents the relationship strength of persons in the predicted social network.

Due to lack of labeled image data and hardware limitation, data augmentation is used for the training of the proposed face recognition model. The training data is augmented by image rotation and all CNN features of images are exported from fc6 layer of the deep learning model. The features are fed to three fully connected layers to train the face recognition system. To predict and build social networks, face co-occurrence frequency technique is proposed to recognize people in the dataset who are directly or indirectly related to the target, and at the same time, use the face recognition model to classify each person’s identity. As the deep neural network is the core of the social network prediction system, hence its classification accuracy limits the performance of the system. We also propose a photo ranking algorithm to rank photos in the data set based on the TF-IDFs of persons in the same photos. Consequentially, relationship strength of identities in social network depends not only on the number of group photos, but also on the scores of these photos. This information is more valuable than simply constructing a social network. In addition, the social networks predicted using image data are smaller and more cohesive than other social networks.

In our future work, we aim to optimize and further improve our social network prediction system. We will consider using more complex deep learning model architectures and generate more training examples. Moreover, we also intend to explore other valuable information such as text and location of individuals from social networks and use them as additional features to improve the prediction of our proposed CommuNety system.


This research is supported by Murdoch University, Australia. The authors would like to thank Dr Ammar Mahmood for useful discussion regarding face recognition and deep learning.


  • [1] R. Brunelli and D. Falavigna (1995) Person identification using multiple cues. IEEE transactions on pattern analysis and machine intelligence 17 (10), pp. 955–966. Cited by: §V-1, §V-C, TABLE II.
  • [2] S. Candan, L. Chen, T. B. Pedersen, L. Chang, and W. Hua (2017) Database systems for advanced applications. Springer International Publishing. External Links: Document Cited by: §V-C.
  • [3] Y. Y. Chen, W. H. Hsu, and H. Y. M. Liao (2012) Discovering informative social subgraphs and predicting pairwise relationships from group photos. Proceedings of the 20th ACM international conference on Multimedia, pp. 669–678. External Links: Document Cited by: §II.
  • [4] M. Cheung and J. She (2019) Detecting social signals in user-shared images for connection discovery using deep learning. IEEE Transactions on Multimedia. Cited by: §I.
  • [5] DMR (2018) Digital company statistics. Note: Cited by: §I.
  • [6] Y. Dong, Y. Liu, and S. Lian (2016)

    Automatic age estimation based on deep learning algorithm

    Neurocomputing, pp. 4–10. External Links: Document Cited by: §II, §II.
  • [7] S. Garg, K. Kaur, N. Kumar, and J. J. Rodrigues (2019)

    Hybrid deep-learning-based anomaly detection scheme for suspicious flow detection in sdn: a social multimedia perspective

    IEEE Transactions on Multimedia 21 (3), pp. 566–578. Cited by: §I.
  • [8] A. Geron (2017)

    Hands-on machine learning with scikit-learn and tensorflow

    O’Reilly Media, Inc. ©2017. Cited by: §II, §IV-B, §V-A1, §V.
  • [9] A. Hashmi, F. Zaidi, A. Sallaberry, and T. Mehmood (2013) Are all social networks structurally similar? a comparative study using network statistics and metrics. CoRR. Cited by: §V-C.
  • [10] R. Hsu, M. Abdel-Mottaleb, and A. K. Jain (2002) Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 696–706. External Links: Document Cited by: §I.
  • [11] W. Hu and H. Hu (2019) Disentangled spectrum variations networks for nir-vis face recognition. IEEE Transactions on Multimedia. Cited by: §I.
  • [12] H. N. Kim, A. E. Saddik, and J. G. Jung (2012) Leveraging personal photos to inferring friendships in social network services. Expert Systems with Applications, pp. 6955–6966. Cited by: §II, §III-B1.
  • [13] D. Lu, X. Liu, and X. Qian (2016) Tag-based image search by social re-ranking. IEEE Transactions on Multimedia 18 (8), pp. 1628–1639. Cited by: §I.
  • [14] S. J. Oh, R. Benenson, M. Fritz, and B. Schiele (2017) Person recognition in social media photos. CoRR, pp. . External Links: Document Cited by: §II, §IV-A, §V-1, §V-1, TABLE II.
  • [15] E. Oro, C. Pizzuti, N. Procopio, and M. Ruffolo (2017) Detecting topic authoritative social media users: a multilayer network approach. IEEE Transactions on Multimedia 20 (5), pp. 1195–1208. Cited by: §I.
  • [16] E. G. Ortiz and B. C. Becker (2014) Face recognition for web-scale datasets. ELSEVIER Computer Vision and Image Understanding 118 (), pp. 153–170. External Links: Document Cited by: §I.
  • [17] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. British Machine Vision Conference. Cited by: §I.
  • [18] U. Pfeil, R. Arjan, and P. Zaphiris (2009) Age differences in online social networking – a study of user profiles and the social capital divide among teenagers and older users in myspace. Computers in Human Behavior, pp. 643–654. External Links: Document Cited by: §I.
  • [19] J. Ramos (2003) Using tf-idf to determine word relevance in document queries. Note: mlittman/courses/ml03/iCML03/papers/ramos.pdf Cited by: §III-B2.
  • [20] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. CoRR. Cited by: §I.
  • [21] S. A. Shah, U. Nadeem, M. Bennamoun, F. Sohel, and R. Togneri (2017)

    Efficient image set classification using linear regression based image reconstruction


    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    pp. 99–108. Cited by: §I.
  • [22] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR. Cited by: §I.
  • [23] L. Sun, X. Wang, Z. Wang, H. Zhao, and W. Zhu (2016) Social-aware video recommendation for online social groups. IEEE Transactions on Multimedia 19 (3), pp. 609–618. Cited by: §I.
  • [24] P. Viola and M. J. Jones (2004) Robust real-time face detection. International journal of computer vision 57 (2), pp. 137–154. Cited by: §III-A.
  • [25] C. Weng, W. Chu, and J. Wu (2009) Rolenet: movie analysis from the perspective of social networks. IEEE Transactions on Multimedia 11 (2), pp. 256–271. Cited by: §I.
  • [26] L. Xu, T. Bao, L. Zhu, and Y. Zhang (2018) Trust-based privacy-preserving photo sharing in online social networks. IEEE Transactions on Multimedia 21 (3), pp. 591–602. Cited by: §I.
  • [27] J. Zhang, Y. Yang, L. Zhuo, Q. Tian, and X. Liang (2019)

    Personalized recommendation of social images by constructing a user interest tree with deep features and tag trees

    IEEE Transactions on Multimedia 21 (11), pp. 2762–2775. Cited by: §I.
  • [28] N. Zhang, M. Paluri, Y. Taigman, R. Fergus, and L. Bourdev (2015) Beyond frontal faces: improving person recognition using multiple cues. CoRR, pp. . Cited by: §I, §III-A, §IV-A.
  • [29] Z. Zhang, J. Han, E. Coutinho, and B. Schuller (2018) Dynamic difficulty awareness training for continuous emotion prediction. IEEE Transactions on Multimedia 21 (5), pp. 1289–1301. Cited by: §I.
  • [30] Z. Zhao, Q. Yang, H. Lu, T. Weninger, D. Cai, X. He, and Y. Zhuang (2017) Social-aware movie recommendation via multimodal network learning. IEEE Transactions on Multimedia 20 (2), pp. 430–440. Cited by: §I.