Sentiment analysis based on images has an inherent subjectivity since it involves the visual recognition of objects, scenes, actions, and events. Therefore, the algorithms need to be robust, make use of different types of features and have a generalization capability to cover different domains. Even so, the problem is still challenging because distinct people may have different opinions (in terms of sentiment polarity) about the same image.
Nowadays, a considerable amount of people share their experiences and opinions on the most diverse subjects on online social networks. This generates a vast amount of data, and their proper analysis plays an essential role in several segments, ranging from prediction of spatial events (Zhao et al., 2016), and rumor analysis (Middleton and Krivcovs, 2016) to the study of urban social behavior (Mueller et al., 2017; Cranshaw et al., 2012). Specifically, sentiment analysis expressed by users in social networks has several applications, since a better understanding of their opinions about specific products, brands, places, or events can be useful in decision making.
Although sentiment analysis in textual content has already been developed considerably, its use in visual means is a hot-trend topic of research (Campos et al., 2017; Chen et al., 2014; Dhall et al., 2015), inspired by the fact that image sharing has become prevalent (You et al., 2015; Parrett, 2016). The development of novel techniques for this purpose may complement text-based approaches (Ribeiro et al., 2016; Benevenuto et al., 2015; Zhou and Huang, 2017; Zhan et al., 2019), even those that use deep learning (Huang et al., 2017; Baly et al., 2016), as well as, enable new services on platforms where the shared content is predominantly visual, such as Instagram, Snapchat, and Flickr111https://www.instagram.com; https://www.snapchat.com and https://www.flickr.com.. For the images illustrated in Figure 1, shared on Flickr, if one considers just the associated textual content, important characteristics of the sentiment visually expressed by users are not captured correctly.
The present study focus on the understanding of the sentiment of urban outdoor images shared by users on social networks. They can carry valuable information about urban areas since indoor images tend not to reflect specific characteristics of these scenarios (establishments in the same area probably have very different internal appearances). Therefore, it is important to explore the performance of the sentiment analysis techniques in the specific context of outdoor areas to further allow high-level tasks such as the semantic classification of urban areas, for example.
Based on the remarkable advances that deep features (ConvNet (LeCun et al., 2015)) are providing in several areas (LeCun et al., 2015), they are used in this work for sentiment analysis in urban outdoor images. Four different architectures were compared, three widely used in machine learning and one specifically designed for sentiment analysis (You et al., 2015). We extended these architectures to consider three polarities (positive, negative, and neutral) and also considered the combination of a set of classifiers, a strategy known as ensemble. Besides, to analyze the impact of scene attributes for sentiment analysis we created a variation for each architecture by merging in a fully connected layer the convolutional features and semantic attributes extracted from the SUN (Scene UNderstanding) attribute dataset (Patterson and Hays, 2012; Patterson et al., 2014). We carried out experiments in two different datasets and concluded that the incorporation of semantic features in the models tends to improve the accuracy result of previous initiatives for the context studied, especially for less complex ConvNets.
In summary, the main contributions of this work are four-fold:
The proposal of a novel dataset of geolocalized urban outdoor images, extracted from Instagram, and labeled as positive, negative or neutral by five different volunteers. Most of the previous initiatives do not consider the neutral polarity, but we believe its introduction leads to results closer to reality since the volunteers of our research classified a significant amount of images as neutral.
, all fine-tuned with pretrained ImageNet weights(Deng et al., 2009), and the state-of-art architecture of You et al. (You et al., 2015), designed specifically for sentiment analysis.
The merging of deep features with knowledge related to semantics derived from the SUN attributes, initially designed for high-level scene understanding. They represent the main categories used by people in describing scenes. We observed that the use of such attributes tend to improve the accuracy of low-complex ConvNet architectures.
The demonstration of the feasibility of our approach in a real-world scenario. We analyzed the sentiment in outdoor images shared on Chicago through Flickr – new images were obtained for this analysis. This evaluation shows that our approach could be useful to understand the subjective characteristics of areas and their inhabitants, for example. The results suggest that particular areas of the city tend to concentrate more images of a specific class of sentiment, having predominant inherent characteristics.
The remainder of this paper is organized as follows. Section 2 reviews the literature on sentiment analysis. Section 3 details the methodology used in this study. Section 4 discusses the results obtained. Section 5 presents an evaluation of our results in a real-world scenario. Section 6 presents potential implications and limitations of the study. Finally, Section 7 concludes the study.
2. Related Work
Automated sentiment analysis is essential for several tasks, including those related to the understanding of human behavior and decision-making support. For instance, Kramer et al. (Kramer et al., 2014) suggested that emotions expressed on Facebook can be transferred to other people without their awareness, which could cause large-scale emotional contagion in virtual communities. Choudhury et al. (De Choudhury et al., 2013)estimated the risk of depression by analyzing behavioral attributes such as social engagement, emotion, language, and linguistic styles in tweets from users diagnosed with clinical depression. Golder and Macy (Golder and Macy, 2011) found that positive and negative emotions expressed on Twitter match well-known daytime and seasonal behavioral patterns in different cultures.
Algorithms for this active field of research are mostly concentrated on textual content, being broadly divided into two classes (Medhat et al., 2014): lexicon-based, that compute the text polarity by using a dictionary of words with associated semantic information; and learning-based
, that use supervised learning to train a classifier to derive semantic relationships between text and sentiments. The combination of both approaches is also possible(Medhat et al., 2014). However, as pointed out in the survey of Zimbra et al. (Zimbra et al., 2018), state-of-the-art algorithms from both categories are still very far from achieving a satisfactory performance, with a classification accuracy often below 70% on Twitter benchmarks.
Comparatively little has been published for sentiment analysis on visual content, such as images and videos. Traditionally, algorithms for this task were based on low-level visual attributes such as colors (Siersdorfer et al., 2010), texture (Jia et al., 2012), image gradients (Li et al., 2012), metadata and speech transcripts (Hanjalic et al., 2012), and descriptors inspired by psychology and art theory (Machajdik and Hanbury, 2010). Trying to assess the impact of high-level abstraction for sentiment analysis, Yuan et al. (Yuan et al., 2013) developed Sentribute, a framework that uses mid-level attributes such as material (metal, vegetation, …), functional (playing, cooking, …), surface (rusty, glossy, …), spatial (natural, man-made, …), and facial expressions. Extending the idea of Sentribute, Borth et al. (Borth et al., 2013) developed SentiBank, a detector library to analyze the visual concepts that are strongly related to sentiments by using 1,200 adjectives-nouns pairs mined from Flickr images.
In recent years, inspired by the breakthroughs of convolutional neural networks (ConvNet)(LeCun et al., 2015) in machine learning, many authors proposed novel architectures for sentiment analysis. Cai and Xia (Chen et al., 2014) proposed to combine two ConvNets, one fed with textual and other with visual features, to predict sentiments from tweets. Chen et al. (Cai and Xia, 2015)
used transfer learning from ImageNet weights(Deng et al., 2009) in their ConvNet to deal with biased training data, which only contains images with strong sentiment polarity. You et al. (You et al., 2015)
used a probabilistic ConvNet sampling to reduce the impact of noisy data by removing training instances with similar sentiment scores. Recurrent Neural Network(Bengio et al., 1994)
, such as Long Short-Term Memory (LSTM)(Hochreiter and Schmidhuber, 1997), designed to recognize patterns in temporal sequences of data were also used for sentiment analysis (Mou et al., 2016; Sun et al., 2016). Discriminative face features (Wu et al., 2016), extracted from ConvNets architectures, were also used to estimate people happiness in the wild. While there are some initiatives showing the importance of outdoor images to the semantic classification of urban areas, such as the study of Santani et al. (Santani et al., 2018) that extracts labels for outdoor urban images, none of the previous studies focused on sentiment analysis of urban outdoor images.
The combination of a set of classifiers, a strategy known as ensemble, has been effectively used by many authors to improve the classification performance. In 2017, Araque (Araque et al., 2017) proposed to fuse deep and classic hand-crafted features for sentiment analysis of textual data in social applications by using an ensemble of classifiers. Also in 2017, Poria et al. (Poria et al., 2017) proposed a multimodal affective data analysis framework based on visual, audio and textual data to create an ensemble application of convolutional neural network for sentiment analysis in video content; and Azani et al. (Al-Azani and El-Alfy, 2017) investigated the use of ensemble learning for sentiment analysis of tweets in Arabic dialect.
In this present study, in addition to a comparative study of different ConvNets already consecrated in the area of machine learning, we proposed a new approach that incorporates into the classification process other attributes extracted directly from the images (without using metadata). We also analyzed the classification performance of ensembles of ConvNets.
Sentiment analysis based on images has an inherent subjectivity since it involves the visual recognition of objects, scenes, actions, and events. Therefore, learning approaches need to be robust and have a generalization capability able to cover different domains.
In this context, we evaluate four different experimental setups, illustrated in Figure 2, with varying image datasets, as well as with and without combining the activation maps of the convolution layers with SUN (Patterson et al., 2014) attributes. SUN attributes represent the principal categories used by people to describe scenes and are intended to be an intermediate representation used in applications such as scene classification, search based on semantics, and automatic labeling, to mention some examples.
The first two experimental setups (Figures 2(a)-(b)) are based on a publicly available image dataset (DeepSent) where only positive and negative polarities are considered. The last two (Figures 2(c)-(d)) use the novel urban outdoor image dataset proposed in this work (OutdoorSent), dataset that takes into account an additional polarity: neutral. Section 3.2 details these datasets.
Each setup compared four ConvNet architectures, whose performance has excelled in different areas: VGG-16 (Simonyan and Zisserman, 2014), Resnet (He et al., 2016), and Inception (Szegedy et al., 2016), fine-tuned with pre-trained ImageNet weights (Deng et al., 2009), and the state-of-art architecture of You et al. (You et al., 2015), designed specifically for sentiment analysis. Finally, we evaluate the performance of ensembles by combining different ConvNets. Section 3.3 presents the proposed ConvNet framework for sentiment analysis in outdoor images.
In order to evaluate the generalization power of the selected ConvNets, we used datasets with images from different domains. Specifically, we considered one dataset composed of indoor and outdoor images (Section 3.2.1), and another one containing only outdoor images (Section 3.2.2).
3.2.1. Twitter dataset - DeepSent
The first dataset used in this work, known as DeepSent, consists of 1,269 images of Twitter and is available in (You et al., 2015). All samples were manually labeled positive or negative by five people using the Amazon Mechanical Turk222https://www.mturk.com. (AMT) crowd-sourcing platform.
The dataset is subdivided according to the consensus of the labels, that is, depending on how many people attributed the same sentiment to a given image. Table 1 details the distribution of the results, where “five agree” indicates that all five AMT workers labeled the image with the same sentiment, “four agree” indicates that at least four gave the same label and “three agree” that at least three agreed on the rating.
|Sentiment||Five agree||Four agree||Three agree|
Figure 3 shows examples of images labeled as positive and negative.
The dataset proposed in this study, called OutdoorSent, is composed only by outdoor images. We aim to facilitate studies that demand more representative characteristics of different urban areas, considering that indoor images of establishments in the same area can be quite different.
For this dataset, we take into account 40,516 publicly available Instagram images, all of them are geolocalized in the city of Chicago. From that initial dataset, we select only the 19,411 classified as outdoor by the ConvNet Places365 (Zhou et al., 2018) pre-trained for the Places2 dataset (a repository of eight million images). Specifically, the Residual Network architecture (WideResNet18) was used because it has obtained the best classification accuracy and speed.
To create a training dataset, a subset of 1,950 images were randomly selected. Each of them was labeled based on the evaluation of five different volunteers, who graded each image according to the sentiment it represents: 1 (negative), 2 (slightly negative), 3 (neutral), 4 (slightly positive), and 5 (positive). Since we use five different classes for the characterization of sentiment, the subdivision of the dataset according to the consensus is not feasible. Thus, the final label was defined based on the average of the three intermediate grades, that is, excluding the highest and lowest grades:
Average below 2.5: negative;
Average between 2.5 and 3.5: neutral;
Average above 3.5: positive.
Each of the 30 volunteers (mostly undergraduate students) labeled 300 images divided into blocks of 15 images to make the process less tiring and error-prone (since users could respond to each block at different times). The interface was built using GoogleForms, with each block of images explaining the purpose of the project and the methodology that should be used, accompanied by an example. The choice of the images that would be shown in each block was random, and the forms were generated automatically with the aid of a script. Thus, it was possible to guarantee that each form contained images without repetitions. Each volunteer received 15 distinct forms, and the responses for each user were counted only once. Using internal management, we ensure that five different volunteers answered each form.
Table 2 details the class distribution. The difference between the number of samples in the different classes is due to the nature of the images of the dataset, which has more neutral and positive images.
|Sentiment||Quantity of images|
Besides, it should be emphasized that the inclusion of the neutral class aims to make the classification more realistic. In addition to allowing people to classify images where is not possible to attribute a positive or negative sentiment, it enables a very subjective task to be relaxed, since different people may have a different opinion about the same image. Although previous studies have considered the neutral sentiment, in some cases it is attributed to inherently negative scenes, such as photos of fatal accidents that show the victim (Ahsan et al., 2017). Due to that, our dataset is more representative.
Figure 4 illustrates examples of images labeled as positive, neutral, and negative.
3.3. ConvNet Framework
Most state-of-the-art architectures for visual content sentiment analysis are shallow, in terms of convolutional layers, to extract deep features that could represent human sentiments. For instance, You et al. (You et al., 2015) in 2015, used a ConvNet with only two convolutions in a seminal work for the field. However, as observed by Le Cun et al. (LeCun et al., 2015), deep convolutional neural networks are essential for the learning process of complex structures because the first convolutional layers typically represent low-level features such as edges, orientations and spatially, while deep layers combine these coarse features to recognize complex structures.
Therefore, we consider in our framework deep state-of-the-art ConvNet architectures widely used in machine learning: VGG-16 (Simonyan and Zisserman, 2014) (the smallest architecture with 13 layers of convolution), Resnet (He et al., 2016) (50 layers), and InceptionV3 (Szegedy et al., 2016) (42 layers). These architectures were fine-tuned with pre-trained ImageNet weights (Deng et al., 2009) to speed up the training phase. We also considered the architecture of You et al. (You et al., 2015) (2 convolutional layers) since it was explicitly designed for sentiment analysis. The ConvNet framework we developed for sentiment analysis of outdoor images is shown in Figure 5.
Another key aspect of our framework is the merging process of convolutional activation maps with scene attributes. Thus, the feature descriptor used in the first fully connected layer has + dimensions: features extracted by the convolutional layers joined with SUN attributes. We believe that the synergistic combination of both features can obtain information related to high-level semantics, improving the classification results. SUN attributes are extracted using the ConvNet Places365 (Zhou et al., 2018) pre-trained for the Places2 dataset.
During the initial tests, we observed that sentiment analysis polarization into a binary classification of positive and negative images is not suitable to represent the reality; therefore, we created the OutdoorSent image dataset - where the neutral label is considered. To process these images, we extended the proposed framework with a “neutral” state, as done by Ahsan et al. (Ahsan et al., 2017) for social event images (but not focused on outdoor images).
Finally, although much progress has been made in sentiment analysis by using deep learning techniques, substantial obstacles remain. Notably, a fundamental problem is how to build a robust network architecture to generalize the knowledge acquired during the training. The combination of a set of classifiers, a strategy known as ensemble, has been used to alleviate this problem. As pointed by Dietterich (Dietterich, 2000), some reasons for that are: (i) the lack of training data compared to the size of the problem space produces many optimizations with the same accuracy, and by using an ensemble we can average the votes of the classifiers, so as to reduce the risk of choosing the wrong one; (ii) the optimization may fail to converge to the correct minimum and an ensemble with distinct starting points may approximate the optimal minimum; and, (iii) the space being searched could not be approximated by any machine learning algorithm but an ensemble can expand this space.
To study the impact of ensembles for sentiment analysis, we analyze different combinations of ConvNets by averaging their class probabilities for final classification.
4. Experiments and Results
In this section, we compare the performance of the four experimental setups discussed in the previous section (Figure 2). Besides evaluating the accuracy of the results, we analyze the images classified as most likely to belong to each class. We also assess the performance of different ensembles of ConvNets.
To achieve more significant results, we used 5-fold cross-validation, that is, each image dataset is first partitioned into k equally sized segments. Then, iterations of training and validation are performed, and within each iteration a different fold is held-out for validation while the remaining folds are used for learning.
4.1. Experiments on DeepSent - Twitter Dataset
, we use the three subsets of the DeepSent dataset (three-agrees, four-agrees, and five-agrees). As this dataset has only two classes (positive, and negative), the last layer of each neural network architectures has two neurons.
Table 3 shows the accuracies obtained for each architecture considered in the experimental setups using the DeepSent dataset. The last two rows show the results for two hand-crafted algorithms based on color histograms (GCH and LCH) by using the same dataset.
|without SUN attributes||73.8||72.6||69.2|
|with SUN attributes||83.2||80.5||77.2|
|VGG16||without SUN attributes||81.6||79.5||73.4|
|with SUN attributes||82.0||79.9||73.9|
|ResNet50||without SUN attributes||89.9||85.8||84.3|
|with SUN attributes||89.0||85.7||82.7|
|InceptionV3||without SUN attributes||89.4||85.0||82.2|
|with SUN attributes||89.8||85.7||82.0|
|GCH (You et al., 2015)||68.4||66.5||66.0|
|LCH (You et al., 2015)||71.0||67.1||66.4|
Note that ConvNets approaches improves significantly over the hand-crafted algorithms. The insertion of the SUN attributes improved the results of the simplest networks, getting to increase the accuracy of the ConvNet of You et al. (You et al., 2015) from 73.8% to 83.2%. However, in more complex ConvNets it is not observed a significant impact, even slighting degrading the results for ResNet50.
For each model that makes use of the SUN attributes, the five images that were most likely to belong to each class were selected. Figures 6 and 7 show the selected images to the negative and positive classes, respectively. As we can see on those figures, VGG16 (second row) erroneously classifies one positive and one negative sample. ResNet50 (third row), in turn, incorrectly classifies a positive image.
It is interesting to note that none of the top-ranked five images for all the scenarios evaluated appears in more than one architecture. This is not necessarily a problem, but just a suggestion that the learning process is different for each considered architecture.
4.2. Experiments on OutdoorSent
Table 4 shows the accuracy results for all ConvNets. As we can see, they are lower than that observed for the Twitter dataset (DeepSent). This was expected, since this dataset has three classes instead of two and, despite having only outdoor images, has a wider range of scenarios. Again, the use of SUN attributes improved the accuracy of the simplest architecture but had a low impact on the complex ones.
|Architecture||without SUN attributes||with SUN attributes|
|You et al. (You et al., 2015)||51.8||57.5|
Besides accuracy, the recall (fraction of relevant cases that have been retrieved over the total amount of relevant cases) was also calculated for each sentiment class (Positive, Neutral, and Negative). Table 5 shows the values obtained in the tests with and without SUN attributes. As can be seen, the incorporation of SUN attributes increased the hit rate of negative and positive classes but reduced to the neutral class. The negative class was the one with the lowest recall values, and, for this case, SUN attributes helped to improve the results considerably. For instance, the ConvNet of You et al. (You et al., 2015) presented a recall of without SUN attributes, and improved this metric to considering them. The best result was provided by InceptionV3 with SUN attributes (recall ), and this value is almost twice better than without SUN attributes.
|without SUN attributes||0.0||81.0||43.0|
|with SUN attributes||23.0||74.0||55.0|
|VGG16||without SUN attributes||6.0||72.0||52.0|
|with SUN attributes||9.0||72.0||57.0|
|ResNet50||without SUN attributes||27.0||75.0||53.0|
|with SUN attributes||28.0||71.0||57.0|
|InceptionV3||without SUN attributes||18.0||74.0||61.0|
|with SUN attributes||33.0||67.0||63.0|
For each model that makes use of the SUN attributes, the five images that had the highest probability of belonging to each class were selected. Figures 8, 9, and 10 show the selected images to the positive, neutral, and negative classes, respectively. Among these images, only one is repeated, indicating that each architecture is using different characteristics of the images for classification.
In the positive class, that achieved the best performance for all models, ResNet50 includes a neutral and a negative image, and InceptionV3 includes an image of the neutral class. In the images of the neutral class, two negative and three positive images were included (the only ConvNet that had no error in the first five images was InceptionV3). All architectures did misclassifications for some class, and, in general, the most problematic case was for negative instances, with the classification of six neutral (four of them by VGG16) and a positive.
4.3. Results for Ensemble of Classifiers
This section presents results regarding ensembles of ConvNets. For these experiments, we used as the final classification the average class probabilities of the ConvNets that compose the ensemble, i.e., each image receives a probability of belonging to a particular class of sentiment for each ConvNet in the ensemble; thus, the final label refers to the highest average probability towards a specific class. The classification accuracy for three distinct ensembles considering the OutdoorSent dataset is shown in Table 6.
|without SUN attributes||57.34|
|with SUN attributes||59.38|
|without SUN attributes||58.92|
|with SUN attributes||59.79|
|without SUN attributes||58.87|
|with SUN attributes||60.26|
We show these ensembles because they conducted to the best results. Observe that the ensemble classifier composed by one ConvNet designed explicitly for sentiment analysis (You et al. (You et al., 2015)), and two ConvNets widely used in machine learning (ResNet and Inception) achieved the best accuracy. However, we have only up to 1.26% of gain compared to each architecture separately. We conjecture that these ConvNets do not have a reasonable level of complementarity. Thus, according to the obtained results, we believe that the complexity introduced by an ensemble strategy does not pay off the computational cost.
5. Outdoor Sentiment in the City
To evaluate the proposed approach in a real-world scenario, we select an image subset from a publicly available dataset of media objects (images and videos) posted on Flickr from 2004 to 2014 (Thomee et al., 2016). We first extracted only geocoded images posted from Chicago, United States, obtaining 53,550 images. Next, we filter out images where the geolocation matches any building of the city. For this task, we explored Chicago’s buildings footprints, which are publicly available in Chicago’s official open data portal333Chicago Open Data – https://data.cityofchicago.org.. After this filter, we ended up with 17,469 images in the dataset.
When applying the ConvNet of You et al. (You et al., 2015) with SUN attributes to infer the associated sentiment (our preferable approach due to its simplicity and comparable results with other approaches), we obtained 8,492 positive, 8,584 neutral, and 393 negative images. Figure 11 presents the heatmaps for each class (positive, neutral, and negative) according to image geolocation.
Observe that downtown has a bigger concentration of images in all cases, as expected. However, the higher concentration part in this area is a bit different between all the sentiment classes, especially for the negative one. In addition, it is also interesting to note a more prominent occurrence of positive images in the water, a rare phenomenon for negative photos. These differences are an indication that image sentiment could help to understand the subjective characteristics of different areas of the city.
In order to favor this sort of evaluation, we performed a density-based clustering process using DBSCAN (Ester et al., 1996), an algorithm to find high-density regions separated from one another by regions of low-density. Points, images in our case, in low-density regions are classified as noise and ignored. DBSCAN requires two parameters: the minimum number of points needed to form a dense region ; and the radius .
We did this clustering process for positive, neutral, and negative images. For the positive and neutral images we set the parameter equal to and considered . For negative images, the and parameters were set to , and , respectively. values were chosen following the approach proposed by Ester et al. (Ester et al., 1996).
Figures 12, 13, 14 show the clustering results for positive, neutral, and negative images, respectively, thus highlighting areas according to this feature. Apart from downtown and some parts of the coast, we do not see much agreement between negative clusters and positive (and neutral) ones.
The performance of the positive and neutral classification, in general, is good. However, one could argue that a few positive images should be neutral and vice versa. For instance, most people could classify the baseball stadium in the bottom of Figure 13 as positive. Those cases are expected to happen due to the subjective nature of the problem.
Analyzing the negative images reported by our classifier (Figure 14), we can find more significant mistakes. For instance, a baby smiling, a family making a picnic in the park, and some animals resting in the zoo. Despite these problems, we believe that the performance is acceptable for a wide range of tasks because most of the instances were correctly classified as negative. A possible way to minimize such errors would be to perform aggregated analysis, for instance, consider a group of nearby images in the same area.
To investigate the association of the classified images with some demographic characteristics of geographic areas, we also performed an analysis considering the median income of households per census tract. We got the publicly available data provided in the American Community Survey, administered by the United States Census Bureau444https://www.census.gov/programs-surveys/acs.. We considered the study of 2015 because it covers the most recent period of the Flickr dataset studied.
We then mapped each image in one census tract, grouping each class of image in: Low Income, where the median household income per year is less than ; Medium Income, with ; and High Income where the median household income is bigger than .
Figure 15 presents the results, as well as some representative images (indicated by arrows) for low and high income. In general, in the low income areas, we have fewer images of all sentiment classes and a slightly smaller number of positive images. This tendency is also observed for the medium income areas. However, when looking at the high income areas, this tendency is inverse, and we have a significantly bigger number of positive images. This result is interesting because it suggests that different classes of images correlate with certain areas of the city.
Investigating the images of the low and high income groups, we find that the overall classification is coherent. As illustrated in Figure 15, most of the negative and positive images in all groups tend to reflect correctly the sentiment associated, especially in the positive ones. Again, we find that more probably mistakes are observed in the negative class, especially in the high income group. One could argue, for example, that the three bottom examples for the negative class of the high income group might be incorrectly classified. The same is valid for the bottom image of the examples for the negative class of the low income group.
We also observe some differences between groups regarding the same sentiment. For instance, positive images in the low income group tend to reflect more outdoor sports, nature, and houses. Nevertheless, for the high income group, we tend to observe more images related to sculptures, buildings, and sights.
Recent studies have shown that some information obtained from social media have the power to change our perceived physical limits as well as to help better understand the dynamics of cities. In this direction, there are some efforts to classify city areas under different aspects, for example, regarding the smell, cultural habits, noise, and visual aspects (Maisonneuve et al., 2009; Quercia et al., 2014; Quercia et al., 2015; Le Falher et al., 2015; Silva et al., 2017; Mueller et al., 2017).
This classification of urban areas may be useful for a variety of new applications and services. An example would be a new route suggestion tool that suggests the most visually pleasing way, which might be interesting for users in leisure time in the city, as was discussed in (Quercia et al., 2014). Our study has the potential to complement these proposals, considering another aspect in this direction: the sentiment about urban outdoor environment.
In addition, the information that can be obtained automatically from our work can help fields of study where the collection of similar information occurs in ways that do not scale easily, such as interviews and questionnaires. From the categorization of outdoor areas of the city according to the sentiment opinion understood by users, new socioeconomic studies on a large scale can be developed. By correlating these results with demographic indicators such as income and occurrence of crimes, non-obvious patterns can be understood, which may be useful in better urban planning and strategic public policies.
In this sense, it is worth mentioning the theory of “broken windows” proposed by Kelling & Coles. The idea behind this theory is that the appearance of outdoor areas can impact on neighborhood safety reality: a broken window leads to another and, in turn, to future crimes (Kelling and Coles, 1997). This and other theories can be revisited in large scale exploring our study.
Our study has some possible limitations. One is concerning the labeling of OutdoorSent dataset. Although we have counted on the collaboration of a diverse group of volunteers, we do not have the opinion of all the population strata. This means that the labels can be biased to specific groups of people. Another limiting factor of our proposal is the performance of our classifier. This is associated with the type of chosen images that becomes more challenging because of the greater diversity of possible variations. It is noteworthy that we improved the state-of-the-art, but we believe it may be possible to improve the classification performance with other approaches and techniques in future efforts.
This study investigates if semantic attributes (SUN attributes) help to enhance the performance of ConvNets for sentiment analysis in outdoor images. Our experiments considered four different ConvNets, three frequently used in the machine learning area and one designed for sentiment analysis in images.
We also considered two different datasets in the experiments, one proposed in this study (OutdoorSent), and another one publicly available (DeepSent). We find that semantic features tend to improve classification performance, and this is more significative for less complex ConvNets.
In addition, we also evaluated ensemble of classifiers. This strategy improves the results slightly. However, we believe that the complexity introduced by an ensemble strategy does not pay off in the studied problem. Recall that the ConvNet of You et al. (You et al., 2015) explores only two convolutions, being much simpler than the other investigated ConvNets. For the ConvNet of You et al. (You et al., 2015), the incorporation of semantic attributes improves the results considerably, being compared to the results observed for the complex ConvNets. This makes this strategy interesting to the problem under study.
We also showed in a real-world scenario that our results can help to understand the subjective characteristics of different areas of the city, helping to leverage new services.
This work is partially supported by the project CNPq-URBCOMP (#403260/2016-7). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The authors would like to thank also the research agencies CAPES, CNPq, Fundação Araucária, FAPEMIG, and FAPESP.
- Ahsan et al. (2017) U. Ahsan, M. De Choudhury, and I. Essa. 2017. Towards using visual attributes to infer image sentiment of social events. In International Joint Conference on Neural Networks (IJCNN). Anchorage, Alaska, United States, 1372–1379. https://doi.org/10.1109/IJCNN.2017.7966013
- Al-Azani and El-Alfy (2017) Sadam Al-Azani and El-Sayed M. El-Alfy. 2017. Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis in Short Arabic Text. In International Conference on Ambient Systems, Networks and Technologies. Madeira, Portugal, 359 – 366. https://doi.org/10.1016/j.procs.2017.05.365
- Araque et al. (2017) Oscar Araque, Ignacio Corcuera-Platas, J. Fernando Sánchez-Rada, and Carlos A. Iglesias. 2017. Enhancing deep learning sentiment analysis with ensemble techniques in social applications. Expert Systems with Applications 77 (2017), 236 – 246. https://doi.org/10.1016/j.eswa.2017.02.002
- Baly et al. (2016) Ramy Baly, Roula Hobeica, Hazem Hajj, Wassim El-Hajj, Khaled Bashir Shaban, and Ahmad Al-Sallab. 2016. A Meta-Framework for Modeling the Human Reading Process in Sentiment Analysis. ACM Transactions on Information Systems (TOIS) 35, 1, Article 7 (Aug. 2016), 21 pages. https://doi.org/10.1145/2950050
- Benevenuto et al. (2015) Fabrício Benevenuto, Matheus Araújo, and Filipe Ribeiro. 2015. Sentiment Analysis Methods for Social Media. In Proceedings of Brazilian Symposium on Multimedia and the Web (WebMedia). ACM, Manaus, Brazil, 11–11. https://doi.org/10.1145/2820426.2820642
- Bengio et al. (1994) Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning Long-term Dependencies with Gradient Descent is Difficult. Trans. Neur. Netw. 5, 2 (March 1994), 157–166. https://doi.org/10.1109/72.279181
- Borth et al. (2013) Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. 2013. Large-scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs. In Proceedings of the ACM International Conference on Multimedia. ACM, Barcelona, Spain, 223–232. https://doi.org/10.1145/2502081.2502282
- Cai and Xia (2015) Guoyong Cai and Binbin Xia. 2015. Convolutional Neural Networks for Multimedia Sentiment Analysis. In Natural Language Processing and Chinese Computing. Springer International Publishing, Cham, 159–167. https://doi.org/10.1007/978-3-319-25207-0_14
- Campos et al. (2017) Víctor Campos, Brendan Jou, and Xavier Giró i Nieto. 2017. From pixels to sentiment: Fine-tuning CNNs for visual sentiment prediction. Image and Vision Computing 65 (2017), 15 – 22. https://doi.org/10.1016/j.imavis.2017.01.011
- Chen et al. (2014) Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. 2014. DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks. CoRR abs/1410.8586 (2014). arXiv:1410.8586 http://arxiv.org/abs/1410.8586
- Cranshaw et al. (2012) Justin Cranshaw, Raz Schwartz, Jason I. Hong, and Norman M. Sadeh. 2012. The Livehoods Project: Utilizing Social Media to Understand the Dynamics of a City. In International AAAI Conference on Weblogs and Social Media. Dublin, Ireland.
- De Choudhury et al. (2013) Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. Predicting Depression via Social Media.. In International AAAI Conference on Weblogs and Social Media. Boston, United States, 1–10.
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L. Li, and and. 2009. ImageNet: A large-scale hierarchical image database. In . Miami, United States, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
- Dhall et al. (2015) A. Dhall, R. Goecke, and T. Gedeon. 2015. Automatic Group Happiness Intensity Analysis. IEEE Transactions on Affective Computing 6, 1 (Jan 2015), 13–26. https://doi.org/10.1109/TAFFC.2015.2397456
- Dietterich (2000) Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. In Proceedings of the International Workshop on Multiple Classifier Systems. Springer-Verlag, London, UK, UK, 1–15. http://dl.acm.org/citation.cfm?id=648054.743935
- Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining. AAAI Press, Portland, Oregon, 226–231. http://dl.acm.org/citation.cfm?id=3001460.3001507
- Golder and Macy (2011) Scott A Golder and Michael W Macy. 2011. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333, 6051 (2011), 1878–1881. https://doi.org/10.1126/science.1202775
- Hanjalic et al. (2012) Alan Hanjalic, Christoph Kofler, and Martha Larson. 2012. Intent and Its Discontents: The User at the Wheel of the Online Video Search Engine. In Proceedings of the ACM International Conference on Multimedia. ACM, Nara, Japan, 1239–1248. https://doi.org/10.1145/2393347.2396424
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, United States, 770–778. https://doi.org/10.1109/CVPR.2016.90
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1918.104.22.1685
- Huang et al. (2017) Minlie Huang, Qiao Qian, and Xiaoyan Zhu. 2017. Encoding Syntactic Knowledge in Neural Networks for Sentiment Classification. ACM Transactions on Information Systems (TOIS) 35, 3, Article 26 (June 2017), 27 pages. https://doi.org/10.1145/3052770
- Jia et al. (2012) Jia Jia, Sen Wu, Xiaohui Wang, Peiyun Hu, Lianhong Cai, and Jie Tang. 2012. Can We Understand Van Gogh’s Mood?: Learning to Infer Affects from Images in Social Networks. In Proceedings of ACM International Conference on Multimedia. ACM, Nara, Japan, 857–860. https://doi.org/10.1145/2393347.2396330
- Kelling and Coles (1997) George L Kelling and Catherine M Coles. 1997. Fixing broken windows: Restoring order and reducing crime in our communities. Simon and Schuster.
- Kramer et al. (2014) Adam DI Kramer, Jamie E Guillory, and Jeffrey T Hancock. 2014. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111, 24 (2014), 8788–8790.
- Le Falher et al. (2015) Géraud Le Falher, Aristides Gionis, and Michael Mathioudakis. 2015. Where is the Soho of Rome? Measures and algorithms for finding similar neighborhoods in cities. In International AAAI Conference on Weblogs and Social Media. Oxford, UK.
- LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (May 2015), 436–444. http://dx.doi.org/10.1038/nature14539
- Li et al. (2012) Bing Li, Songhe Feng, Weihua Xiong, and Weiming Hu. 2012. Scaring or pleasing: exploit emotional impact of an image. In Proceedings of the 20th ACM international conference on Multimedia. ACM, Nara, Japan, 1365–1366.
- Machajdik and Hanbury (2010) Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology and art theory. In Proceedings of the 18th ACM international conference on Multimedia. ACM, Firenze, Italy, 83–92.
- Maisonneuve et al. (2009) Nicolas Maisonneuve, Matthias Stevens, Maria E Niessen, and Luc Steels. 2009. NoiseTube: Measuring and mapping noise pollution with mobile phones. In Information technologies in environmental engineering. Springer, Thessaloniki, Greece, 215–228.
- Medhat et al. (2014) Walaa Medhat, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal 5, 4 (2014), 1093 – 1113. https://doi.org/10.1016/j.asej.2014.04.011
- Middleton and Krivcovs (2016) Stuart E. Middleton and Vadims Krivcovs. 2016. Geoparsing and Geosemantics for Social Media: Spatiotemporal Grounding of Content Propagating Rumors to Support Trust and Veracity Analysis During Breaking News. ACM Transactions on Information Systems (TOIS) 34, 3, Article 16 (April 2016), 26 pages. https://doi.org/10.1145/2842604
- Mou et al. (2016) Wenxuan Mou, Hatice Gunes, and Ioannis Patras. 2016. Automatic recognition of emotions and membership in group videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Las Vegas, United States, 27–35.
et al. (2017)
Willi Mueller, Thiago H.
Silva, Jussara M. Almeida, and
Antonio AF Loureiro. 2017.
Gender matters! Analyzing global cultural gender
preferences for venues using social sensing.
EPJ Data Science6, 1 (2017), 5. https://doi.org/10.1140/epjds/s13688-017-0101-0
- Parrett (2016) George Parrett. 2016. 3.5 million photos shared every minute in 2016. Deloitte. https://goo.gl/uwF81P.
- Patterson and Hays (2012) Genevieve Patterson and James Hays. 2012. SUN Attribute Database: Discovering, Annotating, and Recognizing Scene Attributes. In Proceeding of the 25th Conference on Computer Vision and Pattern Recognition (CVPR). Rhode Island, Greece.
- Patterson et al. (2014) Genevieve Patterson, Chen Xu, Hang Su, and James Hays. 2014. The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding. International Journal of Computer Vision 108, 1-2 (2014), 59–81.
- Poria et al. (2017) Soujanya Poria, Haiyun Peng, Amir Hussain, Newton Howard, and Erik Cambria. 2017. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing 261 (2017), 217 – 230. https://doi.org/10.1016/j.neucom.2016.09.117 Advances in Extreme Learning Machines (ELM 2015).
- Quercia et al. (2014) Daniele Quercia, Rossano Schifanella, and Luca Maria Aiello. 2014. The shortest path to happiness: Recommending beautiful, quiet, and happy routes in the city. In Proceedings of the 25th ACM conference on Hypertext and social media. Santiago, Chile, 116–125.
- Quercia et al. (2015) Daniele Quercia, Rossano Schifanella, Luca Maria Aiello, and Kate McLean. 2015. Smelly maps: the digital life of urban smellscapes. arXiv preprint arXiv:1505.06851 (2015).
- Ribeiro et al. (2016) Filipe N. Ribeiro, Matheus Araújo, Pollyanna Gonçalves, Marcos André Gonçalves, and Fabrício Benevenuto. 2016. SentiBench - a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science 5, 1 (2016), 1–29. https://doi.org/10.1140/epjds/s13688-016-0085-1
- Santani et al. (2018) Darshan Santani, Salvador Ruiz-Correa, and Daniel Gatica-Perez. 2018. Looking South: Learning Urban Perception in Developing Cities. ACM Trans. Soc. Comput. 1, 3, Article 13 (Dec. 2018), 23 pages. https://doi.org/10.1145/3224182
- Siersdorfer et al. (2010) Stefan Siersdorfer, Enrico Minack, Fan Deng, and Jonathon Hare. 2010. Analyzing and predicting sentiment of images on the social web. In Proceedings of the 18th ACM international conference on Multimedia. ACM, Firenze, Italy, 715–718.
- Silva et al. (2017) Thiago H. Silva, Pedro O.S. Vaz de Melo, Jussara M. Almeida, Mirco Musolesi, and Antonio A.F. Loureiro. 2017. A large-scale study of cultural differences using urban data about eating and drinking preferences. Information Systems 72, Supplement C (2017), 95 – 116. https://doi.org/10.1016/j.is.2017.10.002
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556
- Sun et al. (2016) Bo Sun, Qinglan Wei, Liandong Li, Qihua Xu, Jun He, and Lejun Yu. 2016. LSTM for dynamic emotion and group emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, Tokyo, Japan, 451–457.
- Szegedy et al. (2016) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, United States, 2818–2826. https://doi.org/10.1109/CVPR.2016.308
- Thomee et al. (2016) Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The New Data in Multimedia Research. Commun. ACM 59, 2 (Jan. 2016), 64–73. https://doi.org/10.1145/2812802
- Wu et al. (2016) Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2016. Multi-view common space learning for emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, Tokyo, Japan, 464–471.
et al. (2015)
Quanzeng You, Jiebo Luo,
Hailin Jin, and Jianchao Yang.
Robust Image Sentiment Analysis Using Progressively
Trained and Domain Transferred Deep Networks. In
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence(AAAI’15). AAAI Press, Austin, Texas, 381–388. http://dl.acm.org/citation.cfm?id=2887007.2887061
- Yuan et al. (2013) Jianbo Yuan, Sean Mcdonough, Quanzeng You, and Jiebo Luo. 2013. Sentribute: Image Sentiment Analysis from a Mid-level Perspective. In Proceedings of the International Workshop on Issues of Sentiment Discovery and Opinion Mining. ACM, Chicago, Illinois, Article 10, 8 pages. https://doi.org/10.1145/2502069.2502079
- Zhan et al. (2019) Xueying Zhan, Yaowei Wang, Yanghui Rao, and Qing Li. 2019. Learning from Multi-annotator Data: A Noise-aware Classification Framework. ACM Transactions on Information Systems (TOIS) 37, 2, Article 26 (Feb. 2019), 28 pages. https://doi.org/10.1145/3309543
- Zhao et al. (2016) Liang Zhao, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2016. Online Spatial Event Forecasting in Microblogs. ACM Transactions on Spatial Algorithms and Systems (TSAS) 2, 4, Article 15 (Nov. 2016), 39 pages. https://doi.org/10.1145/2997642
- Zhou et al. (2018) B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. 2018. Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (June 2018), 1452–1464. https://doi.org/10.1109/TPAMI.2017.2723009
- Zhou and Huang (2017) Guang-You Zhou and Jimmy Xiangji Huang. 2017. Modeling and Mining Domain Shared Knowledge for Sentiment Analysis. ACM Transactions on Information Systems (TOIS) 36, 2, Article 18 (Aug. 2017), 36 pages. https://doi.org/10.1145/3091995
- Zimbra et al. (2018) David Zimbra, Ahmed Abbasi, Daniel Zeng, and Hsinchun Chen. 2018. The State-of-the-Art in Twitter Sentiment Analysis: A Review and Benchmark Evaluation. ACM Trans. Manage. Inf. Syst. 9, 2, Article 5 (Aug. 2018), 29 pages. https://doi.org/10.1145/3185045