Semantic Image Cropping

07/15/2021 ∙ by Oriol Corcoll, et al. ∙ 5

Automatic image cropping techniques are commonly used to enhance the aesthetic quality of an image; they do it by detecting the most beautiful or the most salient parts of the image and removing the unwanted content to have a smaller image that is more visually pleasing. In this thesis, I introduce an additional dimension to the problem of cropping, semantics. I argue that image cropping can also enhance the image's relevancy for a given entity by using the semantic information contained in the image. I call this problem, Semantic Image Cropping. To support my argument, I provide a new dataset containing 100 images with at least two different entities per image and four ground truth croppings collected using Amazon Mechanical Turk. I use this dataset to show that state-of-the-art cropping algorithms that only take into account aesthetics do not perform well in the problem of semantic image cropping. Additionally, I provide a new deep learning system that takes not just aesthetics but also semantics into account to generate image croppings, and I evaluate its performance using my new semantic cropping dataset, showing that using the semantic information of an image can help to produce better croppings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 25

page 26

page 29

page 30

page 35

page 40

page 42

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Motivation

The goal of a photographer is to communicate stories, feelings or any kind of information through images. In order to achieve this, images must fulfil some requirements like having good composition, being aesthetically pleasing, transmit emotions and tell a story. In order to produce the perfect image, professionals use many different techniques, one of them is image cropping.

Image cropping is one of the most important tools used by professional photographers to solve the following three problems. First, it can enhance an image by removing unwanted or distracting elements. In this case the resulting cropping will include all the main subjects in the image but some of the background objects (like trees or posts) will be removed. Secondly, it can improve the visual quality of an image by making the main subject to stand out. This is usually achieved by picking one of the multiple subjects as the main one and then removing the other subjects, centring the image to the main subject or a combination of both. Thirdly, it can change the aspect ratio and size of the image so it can be shown in places where the real estate available is limited, like websites or digital frames without deforming the image or sacrificing its quality. Most of the state-of-the-art cropping systems try to solve the first problem and forget about the second and third one. In this thesis, I work on the last two problems and provide a new dataset that can be used to measure how well a cropping system performs when the last two problems are present and a new automatic cropping system that can be used as the baseline of the semantic cropping problem.

As mentioned before, the semantic cropping problem arises when an image depicts more than one subject and/or when there is a limitation on the target size of the image (aspect ratio or size), in both cases the cropping system needs to decide which is main subject of the image to provide the best cropping. In order to produce a cropping, a person would take into account the purpose or use of the image, for example in figure 1.1 a photographer may pick the green cropping if the image is intended to be in a website for pets or she may pick the red cropping if the image should reflect how people enjoy winter, note that this example assumes that the target aspect ratio is 1:1. In this thesis, I define the problem of finding the right cropping for an image and (possibly) an aspect ratio by taking into account contextual information (like an entity or description) as ”semantic cropping”.

Figure 1.1: Two subjects far apart to be included in an image with aspect ratio of 1:1. Green and red rectangles are two of the many possible croppings.

1.2 Thesis Structure

I have organised this thesis as follows. In the Chapter 2 Background

chapter, I present an overview of methods and literature related to Deep Learning, Convolutional Neural Networks and how these are used in the automatic image cropping problem. From the most basic methods that identify edges to generate croppings to models that learn the image’s saliency and aesthetic features and use them to generate pleasant croppings. Additionally, this chapter will outline some of the most relevant methods that are able to extract semantic information from an image.

In the Chapter 3 Resources chapter I will introduce the publicly available datasets relevant to the problem of automatic image cropping and I will go into the details of how these datasets can be used for the training and evaluation of different models including the one I suggest in this thesis. I will explain how these dataset differ from each other and why some are good for evaluation and some for training.

In the Chapter 4 Semantic Cropping Dataset chapter, I extend the list of publicly available resources with a new dataset that surfaces the problematic of automatically producing croppings that make the image as relevant as possible. Since most of the publicly available datasets that can be used to evaluate an algorithms performance ignore semantics and relevance, I have designed this new dataset specifically to evaluate the performance of different cropping algorithms when semantics and relevance are important.

I suggest a new model that can generate croppings taking into account semantics in the Chapter 5 Model chapter. Here I will explain in detail the architecture of the model and how it can automatically generate semantic and non-semantic croppings. I will also explain how this model is composed of three main sub-modules and how one of them can extract semantic information from the image. Additionally, I will explain how these sub-modules are configured for training and for inference.

Following this, I show in the Chapter 6 Experimental Results chapter the set of experiments I have done in order to evaluate the performance of the model for both, semantic and non-semantic croppings. By the same token, I compare the results achieved by my model to state-of-the-art solutions where their performance is publicly available. Additionally, I use the new dataset for semantic cropping to establish a baseline to this problem.

Finally, I summarise in the Chapter 7 Conclusions chapter my findings and I outline a series of next steps to improve the current state-of-the-art method in semantic cropping.

2.1 Deep Learning For Image Classification

Traditionally, machine learning for image classification consists of two steps. First, a data scientist trying to solve a problem would analyse the data and decide which features are the most important for the given problem. Then, with the help of traditional methods like HOG, SURF, PCA or LDA, she would produce low dimensional vectors that represent relevant key features that improve, as much as possible, the performance of the selected learning algorithm. Finally, she would use these features as the input of the learning algorithm or model which would perform the classification task required to solve the given problem. This approach is time consuming, expensive and requires good knowledge about the problem’s domain, what if these features could be learnt directly from the data without the need of manual feature engineering?

In 1989, Le Cun et al. [15, 17, 16]

already showed that this is possible, they designed a neural network that could learn to recognise handwritten digits and applied their model to a real world dataset containing handwritten US zip codes achieving 5% error rate on their testing dataset. At that time neural networks were mainly limited by the amount of data available and the computational power but, in 2012, the computational power was orders of magnitude higher and the creation of a large dataset like ImageNet

[2], containing 3.2 million images, together with the design of more complex and deeper networks allowed deep convolutional methods [14, 30, 8] to become the state-of-the-art in most computer vision classification benchmarks.

Since then, a lot of effort has been put on understanding why convolutional neural networks (CNN) work so well, for example Zeiler et al. [36] provides a way to visualise the different filters learned by a CNN at different layers of the network. These visualisations provide a way to diagnose the network and bring some light to what the network is really learning instead of just relaying on its output. Another technique, designed by Zhou et al. [37], is the construction of Class Activation Maps (CAM). This method provides a very simple and easy way of visualising the image areas used by a classification model to determine the class of the image, as shown in figure 2.1

. The CAM technique has been used in many different problems like image classification, object localisation, image captioning or image cropping, moreover, CAMs play a key role in my model for cropping and they are used to localise the areas of the image with the highest aesthetic value.

Figure 2.1:

Example of a class activation map for different output classes of a deep learning classifier. Zhou

et al. [37]

Measuring the aesthetic quality of an image is specially difficult due to the intrinsic subjectivity of the task, nonetheless it is an important problem in the area of photography that can have a huge impact in tasks like automatic photo edition or photo classification. The release of the relatively large AVA dataset [24], containing over 250.000 images with multiple aesthetic ratings per image, triggered a vast increase of papers [12, 23, 33, 34]

using CNNs to try to automatically estimate the aesthetic quality of an image. Most of these methods model the problem of measuring aesthetic quality as a classification problem where images are labelled as high-quality or low-quality. In this work, I use an aesthetic quality classifier in combination with CAM to provide image croppings that preserve the areas of the image that contain the highest aesthetic value.

2.2 Deep Learning For Object Detection

In order to understand what is happening in an image or video, it is essential to know what objects or entities are present in the image and how they relate or interact with each other. An important improvement done by researchers regarding image understanding is in the area of object detection where deep learning models are being used in real world problems like autonomous cars or people counting. In this regard, researchers have defined different sub-problems and benchmarks, for example the sub-problems of object proposals, semantic segmentation and instance segmentation are commonly used in literature to make evident the different degrees of complexity when detecting objects. The simplest of these sub-problems is to propose bounding boxes that enclose an object, where given an image as input the output is a list of bounding boxes for each identified object and its class which is picked from a small (10-100) predefined list of possible classes relevant to a specific problem. When the identification of an object is done at pixel level i.e. the output is not a bounding box (for example two points) but a polygon or mask indicating which pixels belong to the object, the problem is typically called semantic segmentation. It is important to mention that this sub-problem does not make any distinction between instances of a class i.e. if there are two objects of the same class, the output may contain only the combination of pixels from both objects forming one large object. When this distinction is important and a set of pixels per instance of the same class is needed, the problem is typically called instance segmentation.

Multiple object detection models have been proposed over the past few years, models like YOLO [28], SSD [22], Retinanet [20], Faster R-CNN [29] or Mask R-CNN [7] have scored state-of-the-art results in the most challenging benchmarks available for object detection. All these models have different properties and, to some extend, they behave very differently, despite this they can be classified into two main categories: one-stage and two-stage object detection architectures. Figure 2.2 shows an sketch of these two type of architectures.

Figure 2.2: Two stage vs one stage object detection models.

A two stage object detection model differs from the one stage model in that there is an extra stage to generate generic object proposals, the purpose of this stage is to generate, not very accurate, candidate bounding boxes and to ignore the background areas of the image so the next stage does the expensive task of classifying and refining the bounding boxes generated by the previous stage. The choice between a one or two stage object detection architecture comes with a trade-off between speed and accuracy. One stage models tend to be faster but less accurate, where two stage models are usually slower but more accurate.

One of the most popular frameworks for object detection, due to its very good results in different benchmarks, is the R-CNN framework. This framework has evolved since 2013 with the first R-CNN model [5] which generated candidate bounding boxes using the selective search technique [32] and then classified them with the combination of a modified version of AlexNet [14]

, a deep neural network for image classification which is the same CNN for classification used to classify the ImageNet dataset

[2]

in 2012, and a Support Vector Machine (SVM). The candidate bounding boxes that are positively classified as objects are then refined by a linear regression model (an SVM) to be as close as possible to the object. R-CNN performs very well but is very slow due to having to run the entire pipeline (CNN and SVMs) on every candidate bounding box produced by the Selective Search module. In order to reduce the computation time and increase performance, Girshick created Fast R-CNN

[6]. To avoid extracting image features for each candidate bounding box, Fast R-CNN generates a single feature map per image which is then reused for each candidate bounding box. Another improvement done by Fast R-CNN is eliminating the classification and regression SVMs and incorporating a classification and regression sub-networks to the CNN. Fast R-CNN is 213 times faster than its predecessor R-CNN and can process an image in around 300ms. Fast R-CNN can be improved further by removing the Selective Search step and incorporating a candidate bounding box generation module, this is what Ren et al. did in their Faster R-CNN [29]

model. Faster R-CNN replaces Selective Search with a Region Proposal Network (RPN), this neural network reuses the CNN extracted features to generate bounding box proposals as part of the CNN feed forward pass. Faster R-CNN runs in around 200ms at inference time and has better performance than Fast R-CNN in benchmarks like MS COCO or Pascal VOC. All these models produce bounding boxes for detected objects but can they be more precise than that? Mask R-CNN

[7] developed by He et al. produces pixel level segmentation by adding an extra module to the Faster R-CNN, this module computes a mask for each detected object, leading to state-of-the-art results in pixel level segmentation benchmarks.

One of the problems with Faster R-CNN or Mask R-CNN, and in general with two stage detectors, is their inference time which usually has a limitation of 10-20 FPS. One stage detectors have better inference time reaching 30-60 FPS by sacrificing detection accuracy. A recent paper [20] by Lin et al. closes the gap between two and one stage detectors suggesting that one stage object detection models can have the accuracy of a two stage detector and keeping the characteristic high speed of a one stage detector. In their paper, they suggest that the main problem with one stage detectors is the foreground-background class imbalance where most of the candidate bounding boxes are background i.e. do not contain an object. In order to address this problem they have designed a new loss, called focal loss, which makes background candidate bounding boxes to have less weight in the computed loss making the loss to converge much faster during training leading to an easier and faster classification of objects. This loss was tested on a new object detector model called Retinanet reaching similar accuracy than Faster R-CNN but with better inference time, around 122ms.

In this section, I have introduced Faster R-CNN and Mask R-CNN which are examples of two stage object detectors and Retinanet which is an example of one stage object detector. I use extensively Focal Loss and Retinanet in this thesis and I will describe them in more detail in the Chapter 5 Model chapter.

2.3 Deep Learning For Image Cropping

Photographers strive to produce the best image but what makes an image the best? An important factor of a photography is its aesthetic quality i.e. how much beauty it encloses. A popular way to increase the aesthetic quality of an image, and therefore the amount of beauty in it, is to enhance the main subject by remove unwanted or unnecessary elements. This technique is called image cropping and in the past few years, due to improvements done in deep learning, has gained popularity [1, 11, 18, 35, 4] between the computer vision research community.

Automatic image cropping methods have been traditionally grouped into salient-based and aesthetic-based methods. Aesthetic-based methods identify the most aesthetically pleasing regions in an image to then determine which candidate cropping is the best one. Traditional methods used manually engineered features that captured aesthetics, for example Nishiyama et al. [25] designed features for different photographic techniques like no camera shakes or having the right exposure but with the upswing of deep learning and datasets like the AVA dataset [24] with thousands of images, researchers began to design models using these technologies to estimate the aesthetic quality of an image, for example, Kao et al. [11] designed a model that learned to identify the areas in an image with the most aesthetic value by using the AVA dataset [24] to train a deep learning model and then used the Class Activation Map [37] technique described previously to produce a heat map highlighting the regions with the highest aesthetic value. Additionally, the authors use a SVM to, for each candidate cropping, give a higher score to the ones with a simplified boundary i.e. croppings that cut partially less objects have better scores. The generated heat map and the SVM is then used to rank candidate croppings.

Similarly, salient methods find the regions of the image that posses the highest aesthetic value, by prioritising the areas containing the most attention. What is attention? Attention can be defined in many different ways, Stentiford et al. [31] defined attention as the ratio of small regions in the image where their colour match each other and then used it to rank candidate croppings. Fang et al. [4] used similar definition of attention but in this case they used the attention or saliency maps to learn a cropping quality model used then to rank candidate croppings. Others like Wang et al. [35] use datasets, like the AVA [24] dataset that model aesthetics or Salicon [10] which tracks human eye movements when presented with an image to find out what areas attract the most attention, to created a model that learns both, aesthetics and salient features. This model uses a similar method to the Region Proposal Network in the Faster R-CNN to directly output candidate croppings with a quality score for each of them.

In this thesis, I use a modified version of the aesthetic model designed by Kao et al. [11] as the baseline for the generation and ranking of croppings and compare its performance against the full version of my semantic cropping model, in the chapter Chapter 5 Model I provide a more detailed description of this model and the modifications I have made to it.

2.4 Word Similarity Measures

Finally, an important functionality for this work is the ability of measuring how similar two labels or words are, this problem is commonly known as word similarity. This task is not trivial since similarity can be defined in various ways, moreover, the meaning of a word and the context where it is presented have to be taken into account to produce the right similarity value.

An existing solution to this problem is Wordnet, a lexical database that groups nouns, verbs adjectives and adverbs into sets of concepts or synsets, these synsets are linked to each other based on their conceptual relation. A variety of the similarity measures are implemented in the Wordnet [26] software developed by Pedersen et al., these measures output how similar two Wordnet’s synsets are, providing a similarity score. In this project, I use the similarity measure defined by Jiang and Conrath [9] to compute word similarity.

3.1 AVA Aesthetic Dataset

The Aesthetic Visual Analysis (AVA) dataset is a large collection of images taken from www.dpchallenge.com with a set of ratings that reflect their aesthetic quality. The dataset contains around 250.000 images and an average of 210 ratings per image where each rating can be from 1 to 10. It is important to mention that aesthetics is a very subjective concept and ratings may be very different for a given image.

Figure 3.1: Images in the AVA dataset classified as high or low aesthetic quality.

This dataset provides, in addition to the ratings, a set of tags for each image. The number of unique tags in the dataset is 66 and, as shown in figure 3.2, the distribution of images per tag is very diverse compared to similar datasets making the dataset a good candidate for training a neural network.

Figure 3.2: Tag distribution of the images in the AVA dataset. Murray et al. [24]

I use this dataset in a similar way as in the work done in [11] by Kao et al., each image is assigned to high aesthetics, low aesthetics or ignored (images assigned to the ignored class will not be used either in training, testing or benchmarking) as follows:

3.2 MS Coco Dataset

MS Coco is a dataset released by Microsoft in 2015, it contains 328.000 images with different objects belonging to 91 different classes leading to 2.5 million object instances. Furthermore, the dataset provides not just the class of an object but also pixel level instance segmentation, captions and key points information for each image.

Figure 3.3: Example of images in the MS Coco dataset with pixel level segmentation for each object. Lin et al. [21]

This dataset improves previous ones like Pascal VOC [3] by increasing the number of images and classes, additionally it also increases the number of instances per image. Furthermore, it introduces multiple captions per image which can be used to benchmark models for the problem of image understanding.

Figure 3.4: Class distribution in the MS Coco dataset compared to the Pascal VOC dataset. Lin et al. [21]

I use the MS Coco dataset to train an object detection model using the object classes and bounding boxes. In this thesis, I will not use neither pixel level segmentation nor captions information in this thesis.

3.3 FLMS Cropping Dataset

The dataset released by Fang et al. [4] referred in this thesis as the FLMS cropping dataset due to its authors names. This dataset was designed to evaluate the performance of automated cropping methods and provides a collection of 500 ill-composed images i.e. images are not cropped and have bad (not ideal) composition. Additionally to these 500 images, they released a set of 10 croppings per image gathered using Amazon Mechanical Turk (MTurk). MTurk workers had to pass a qualification test in order to make sure the provided croppings were made by professional photographers and followed industry standards. This dataset has been consolidated as one of the most popular dataset for benchmarking image cropping models.

Figure 3.5: Example of images in the FLMS cropping dataset with their MTurk croppings. Chen Fang [4]

I use the FLMS dataset to benchmark my implementation of the baseline cropping algorithm i.e. not taking into account semantics, and compare it to the semantic cropping algorithm and other results published by researchers. Similarly to other papers that try to solve the problem of automatic image cropping, the benchmarking of the baseline algorithm against other state-of-the-art methods is computed using the 10 croppings and taking the best match, I will explain this methodology in detail in the Chapter 6 Experimental Results chapter.

3.4 Flickr Cropping Dataset

Chen et al. collected a dataset [1] specifically designed for training and benchmarking cropping models. This dataset contains 3,413 images and 10 cropping pairs per image, each cropping pair is ranked against each other. The ranking between different croppings is the main novelty in this dataset and one of the reasons to use it to benchmark cropping models. The images in the dataset are public images in the website Flickr, the authors of the paper also used Amazon Mechanical Turk, as in the FLMS cropping dataset, to reduce the original set of images from 31,888 to 3,413 valid images. Additionally, the croppings were also generated using MTurk workers and then curated by a different set of MTurk workers.

Figure 3.6: Example of images in the Flickr cropping dataset with their ground truth cropping. Chen et al. [1]

As with the FLMS dataset I use the 10 unique croppings per image to compare different cropping algorithms and to measure how good they are. It is important to mention that the ranking between croppings provided by this dataset will not be used and I will only use the croppings on their own.

5.1 High-level Design

In order to solve the mentioned problems and challenges, I have created a system composed of three modules: semantic, aesthetic and cropping. Figure 5.1 shows the high-level design, outlining the three core modules of the cropping solution, their input, output and how they interact with each other. The semantic module is in charge of detecting the most relevant areas in the image for a given entity and provide a relevance score at pixel level. The aesthetic module identifies the areas of the image that contain the highest aesthetic quality, this module provides an aesthetic score per pixel. Finally, the cropping module generates a set of candidate croppings and uses the relevance and aesthetic pixel level scores to rank them to pick the best cropping as the output. In the following sections, I will describe in detail how each module works and how they are combined together to provide the final cropping.

Figure 5.1: High-level design of the semantic cropping solution.

5.2 Aesthetic Module

The use of deep learning models for predicting how aesthetically pleasing an image is, has become very popular in the past few years among the research community. As shown by Kao et al. in [11], convolutional neural networks together with class activation maps [37]

can be used to predict pixel level aesthetic scores. In this work, I use a similar model to Kao’s but in this case it is composed of: a typical feature extraction network like VGG16 or ResNet50, a Global Average Pooling (GAP) layer and a Softmax layer with two output classes, see figure

5.1. I use this model to classify images into high or low aesthetics. Additionally, as suggested in [37] some extra computations can produce pixel level aesthetic predictions i.e. it can create an aesthetic map with pixel level scores. The design of this aesthetic model, in contrast to [11], strives to be easy to implement but keeping the same performance achieved by Kao’s model.

Figure 5.2: Aesthetic module design.

The aesthetic model is composed of a CNN backbone that will extract features to then use them to classify the image into high or low aesthetics. One of the main reasons of using a popular feature extraction network like VGG16 or ResNet50, opposite to a custom network like the one suggested in Kao’s paper, is because it makes very easy to apply transfer learning and fine tuning techniques due to the publicly available weights trained on the ImageNet

[2] dataset, using these techniques leads to better performance and less training time. Additionally to having a different feature extraction sub-network, images are down scaled to 448x448, instead of the typical 224x224 of networks like VGG16 or ResNet50. The intuition behind this change is that aesthetics depends on very fine grain changes of light and colour which with low resolution images like 224x224 are hard to appreciate by the network.

In order to predict pixel level aesthetic quality, I use the Class Activation Map (CAM) [37] technique which allows to find the regions in the input image used by the CNN to classify the image as a particular class, furthermore, it allows to find how important those regions are to the assigned class. Following this idea, the aesthetic module classifies an image into high and low aesthetics, then computes the CAM only for the high aesthetics class, see figure 5.3 for an example. The aesthetic map is computed by combining the last convolutional layer of the backbone network with the weights between the GAP layer and the high aesthetic class unit in the Softmax layer as follows:

(5.1)

where is the last convolutional layer’s value at position of the feature map and is the weight between the GAP output and the Softmax unit for the high aesthetics class.

Figure 5.3: Example of an aesthetic map with pixel level scores overlaid on top of the original image generated by my aesthetic model.

5.2.1 Training

I train the aesthetic module in isolation using the images and ratings in the AVA dataset [24]. As mentioned in the Chapter 3 Resources chapter, the AVA dataset contains ratings between 1 and 10, in order to use them to train the model I preprocess the images by following a similar approach as in [11]. I produce a single rating for each image by computing the average rating, then I assign the images with an average rating of 4 or less to the class low aesthetics and images with an average rating of 7 or more to the class high aesthetics. Due to having ambiguous ratings, images with an average rating of 4, 5 or 6 will not be used for training.

I train the model using the VGG16 as backbone with 70% of the images for training, 20% for testing and 10% for validation. I use the categorical cross entropy loss modified to take into account the class imbalance created by how the split of images into low and high aesthetics was done, leaving a class imbalance of 3 high aesthetic images per 1 low aesthetic image:

(5.2)

where is the number of samples, is the number of classes, is the weight of each class, is the ground truth and is the predicted values, in this case , and . I use the Adam optimiser [13] with a learning rate of

and train the network during 20 epochs on batches of 70 images at a time using 1 GPU. At the end of each epoch, the set of unseen testing images are used to evaluate the performance of the model by producing an accuracy metric. Additionally, due to the use of pre-trained weights on ImageNet, I normalise pixel values to be between -1 and 1.

5.3 Semantic Module

The goal of this thesis is to provide a semantic cropping by identifying the areas of an image where an entity is present and provide the most aesthetically pleasing cropping that includes them. Extracting semantic information from an image is not a solved problem and depending on the use of this information different solutions have different degrees of complexity. A system that extracts semantic information can start from finding entities in an image to build a knowledge graph with the relations between entities, attributes and actions found in the images. For the purpose of this thesis, finding the position of a given entity is the complexity needed in the semantic module.

I divide the semantic module into two sub-modules, object detection and entity resolution. The first module, object detection, is in charge of fining the objects present in the image i.e. give a label or class to a set of pixels. The second module finds how likely a label produced by the object detection module is related to the input entity.

5.3.1 Object Detection

As mentioned in previous chapters, object detection networks have improved considerably in the last couple of years creating a very promising future for tasks that rely on this kind of networks, like semantic cropping. The object detection sub-module is in charge of finding objects in the input image, for this specific purpose I have chosen Retinanet [20], a one-stage pyramid-like CNN with the accuracy of a two-stage object detection network that extends the Feature Pyramid Network (FPN) defined in [19] by the same authors.

Figure 5.4: Retinanet object detection network by Lin et al. [20].

Retinanet uses a backbone network like ResNet50 as bottom-up pyramid of features (see section (a) in figure 5.4) and creates the top-down pyramid of features using the extracted features at different levels of the ResNet50 network in combination to new convolutional layers (see section (b) in figure 5.4). Additionally, at each level of the top-down pyramid an object classification and a bounding box regressor sub-networks are attached (see sections (c) and (d) in figure 5.4).

Figure 5.5: Retinanet top-down layer creation.

The first top-down layer is built by adding a 1x1 convolutional layer to the last bottom-up layer, the consecutive layers are built by up-scaling by 2 the previous top-down layer, the result is then combined with the bottom-up layer at the same level extended with a 1x1 convolutional layer as done with the first layer, then the combination of these two layers is extended with a 3x3 convolutional layer, as shown in figure 5.5.

In order to produce object bounding boxes and classes, Retinanet uses a set of anchors of different scales and aspect ratios to find these bounding boxes, as done in the Faster R-CNN [29] model. The figure 5.6 shows anchors with three different aspect ratios. These anchors are evaluated at the final convolutional feature map with a sliding window approach.

Figure 5.6: Example of the anchors used by Faster R-CNN and Retinanet, Ren et al. [29]. Note that in contrast to the rest of the thesis, here refers to the number of anchors and not the number of classes.

In addition to the bottom-up and top-down sub-networks, a classification and a box regression sub-networks are added to each level of the top-down sub-network. The classification sub-networks are composed of four 3x3 convolutional layers each with 256 filters using relu

as activation function and a final 3x3 convolutional layer with

filters, where is the number of classes and is the number of anchors, with a sigmoid

as activation function, this final layer produces a probability of an anchor belonging to one of the object classes. Similarly, the regression box sub-networks have the same four convolutional layers and an extra 3x3 convolutional layer with

filters, as shown in figure 5.4.

Another important aspect of Retinanet is the losses used for classification and box regression. A smooth L1 loss is used in the case of box regression, this is the same loss used in the Faster R-CNN model [29]. On the other hand, the classification sub-networks use a new loss defined by Lin et al. in [20], called focal loss. This loss takes advantage of the assumption that there are more background areas (like sky or grass) than objects and gives more importance to foreground areas i.e. areas close to an object. Focal loss is given by the formula:

(5.3)

Here is a balancing factor to give more importance in the loss to samples that are hard to classify i.e. and is a balancing factor for foreground and background areas. is defined as:

(5.4)
Retinanet Training

I configure the Retinanet model to use ResNet50 as backbone network and create three levels at the last convolutional layer of the last three convolutional blocks of the backbone and two additional levels built with a 3x3 convolutional layer each stack on top of the last Retinanet top-down layer, as suggested in [20]. I also use anchors with aspect ratios 1:2, 1:1 and 2:1 at scales , and . The area of the original (not scaled) anchors at each level are , , , and . The weights of the ResNet50 backbone are initialised with pre-trained weights on the ImageNet dataset [2]. I then train the Retinanet network using the MS Coco [21] dataset, more specifically, I use the 2017 instances dataset which contains images for 91 classes of objects. The training is done using the Adam [13] optimisation algorithm with a learning rate of on a host with 8 GPUs during 10 epochs with batches of 16 images.

Figure 5.7: Example of a semantic map where a Gaussian kernel was applied for the entity elephant with pixel level scores overlaid on top of the original image.

5.3.2 Entity Resolution

The Retinanet object detection model trained with the MS Coco presents a problem, it only recognises 91 classes of objects but my cropping algorithm accepts any type of entity. I define this problem as: In order to solve this problem, I split it into two stages, disambiguation and similarity. The input entity is just a string i.e. there is no semantics, context or additional meaning associated with it, for this reason I first find all possible meanings of a given entity. The second stage is to compute how similar each different meaning of the entity is to each of the objects output by Retinanet, finally the object with highest similarity score to any of the possible meanings of the entity is picked as the main object in the image if the score is higher than a threshold .

As mentioned in previous chapters, Wordnet is a lexical database that can do both, disambiguate and compute similarity scores between two different synsets. From the multiple similarity metrics and corpus available in the Wordnet package, I decided to use the Jiang-Conrath similarity together with the Brown corpus due to empirically finding that outperforms other similarity metrics and corpuses when comparing their performance on the entities of the semantic dataset and the objects detected by Retinanet.

5.3.3 Semantic Predictions

At prediction time, the semantic module first generates object candidates using Retinanet, then each of the candidates is compared to the input entity using the entity resolution sub-module. I then use the candidate object with the highest similarity score to generate a semantic map as follows, I create a new matrix with the same shape as the original image and set all the values to 0, then the area where the object is located by Retinanet is set to 1. Additionally, a Gaussian kernel is applied to smooth the semantic map and the matrix is normalised to add up to 1. Figure 5.7 shows an example of a semantic map. It is important to mention that if two objects of the same class are detected in different locations of the image, I use the largest one to generate the semantic map.

5.4 Cropping Module

Until now, I have presented how to identify the aesthetic and semantic areas of an image, the final component of the semantic cropping algorithm is in charge of generating the actual cropping i.e. how these two areas or maps can be combined to produce a final cropping. The cropping module works by first combining the aesthetic and semantic maps, generate a set of candidate croppings and then use the combined map to produce a score for each candidate that is then used to rank the candidate croppings to pick the best croppings.

The first step to generate a semantic cropping is to generate both, aesthetics and semantic maps by using the aesthetics module and the semantic module. These two maps reflect how aesthetically pleasing and semantic relevant an area of an image is. To combine them I use a linear combination of both maps:

(5.5)

where is the combined map, and are the aesthetic and semantic maps respectively and and are weights that provide a way to give more importance to one map or the other.

Figure 5.8: Example of combining the aesthetic and semantic maps overlaid on top of the original image.

In order to generate candidate croppings, I generate the set of all possible croppings with aspect ratio

and padding

between croppings for the input image by following a sliding window approach. I then use the combined map to produce a score for each candidate cropping as follows:

(5.6)

where is the value of the coordinates and in the combined map . This score can then be used to rank the cropping candidates and pick the best one or top best depending on the application of the algorithm.

6.1 Aesthetic Image Classification

In literature, the performance of an aesthetic image classification model is usually measured by computing how accurate a model is when classifying images from the AVA dataset [24], typically 20% of the images in the dataset are used for the evaluation of the model. In this thesis, I follow the same approach and compute the accuracy of my aesthetic model on the 20% of the images not used during training and, therefore not seen by the network before.

Table 6.1 shows how state-of-the-art models perform compared to the aesthetic model defined and trained in this thesis. It is important to mention that my aesthetic model is basically the one suggested by Kao et al. in [11] but using VGG16 to extract features, trained with different hyper-parameters and image size.

Method Accuracy

 

AVA [24] 0.667
Kao et al.[11] 0.763
Wang et al. [35] 0.769

 

Aesthetic model ( and ) 0.820
Table 6.1: Accuracy achieved by multiple state-of-the-art models on the AVA Dataset [24], including my aesthetic model.

6.2 Object Detection

Object detection results are usually compared using mean average precision (mAP). MS Coco provides an API that, when provided with a set of predictions, it automatically computes different metrics like mAP. I use the MS Coco API to compare the results of the training of my version of Retinanet with the performance stated in the Retinanet paper [20]. and additionally I include the performance achieved by Faster R-CNN.

Method mAP

 

Faster R-CNN [29] 0.368
Retinanet (Resnet50) [20] 0.357

 

Semantic model ( and ) 0.350
Table 6.2: Mean average precision achieved by state-of-the-art models on the MS Coco dataset [21], including my semantic model.

As mentioned before, Retinanet is a one stage object detection network which provides a mAP close to a two stage detection network like Faster R-CNN but with the advantage of taking less time to generate predictions.

6.3 Image Cropping

The performance of models in the task of image cropping is typically measured by computing the Intersection Over Union (IOU) as follows:

(6.1)

where and are two croppings and the intersection operation is defined as the area shared by both croppings, similarly, the union operation is the area covered by both croppings combined. I will evaluate the semantic cropping model using the IOU metric on three different datasets, the first two datasets do not take into account semantics, they are meant to evaluate how much can a cropping enhances the aesthetic quality of an image. The last one is the semantic dataset that I presented in the Chapter 4 Semantic Cropping Dataset chapter. The first two datasets, the FLMS dataset [4] and the Flickr dataset [1] do not present any semantic challenge, in order to see if semantic information provides any advantage to the generated croppings I also evaluate the model with and i.e. I use aesthetic and semantic maps. The table 6.3 shows how state-of-the-art models perform on the FLMS dataset compared to the semantic cropping model. Additionally, figure 6.1 shows the qualitative results of the combined cropping method on images from the FLMS cropping dataset [4].

Method IOU

 

Fang et al. [4] 0.6998
Kao et al.[11] 0.7500
Wang et al. [35] 0.8100
A2-RL [18] 0.8204

 

Aesthetic model ( and ) 0.8169
Semantic model ( and ) 0.4287
Combined model ( and ) 0.8181
Table 6.3: Comparison of different methods performance on the FLMS cropping dataset [4]

Figure 6.1: Qualitative results on the FLMS cropping dataset using the combined version of my cropping model. The ground truth cropping is marked with a red bounding box. The generated cropping by my method is marked with a green bounding box.

As you can see my model provides very similar performance to the best model, that I know of, on the FLMS dataset and both, aesthetic only and the combined models have similar performance. Additionally, I have compared my model to other state-of-the-art methods that have published results on the Flickr dataset [1] which contains more images than the FLMS dataset. As shown in the table 6.4, my cropping model provides a similar (inappreciably better) performance to the current state-of-the-art cropping model when using only the aesthetic module or the combination of aesthetic and semantic modules. Additionally, figure 6.2 shows the qualitative results of the combined image cropping method on images from the Flickr Cropping Dataset [1].

Method IOU

 

Chen et al. [1] 0.6019
A2-RL [18] 0.6633

 

Aesthetic model ( and ) 0.6639
Semantic model ( and ) 0.4695
Combined model ( and ) 0.6633
Table 6.4: Performance on the Flickr Cropping Dataset [1]

Figure 6.2: Qualitative results on the Flickr Cropping Dataset using the combined version of my cropping model. The ground truth cropping is marked with a red bounding box. The generated cropping by my method is marked with a green bounding box.

6.4 Semantic Image Cropping

For the semantic cropping dataset, the semantic model is evaluated on the two flavours of the dataset i.e. using the croppings generated manually by me and using the croppings generated by Mechanical Turk workers. For both flavours, I compare the performance of the model with different values of and , I present the results in table 6.5. An important thing to notice in the results is that MTurk workers gave a cropping that is closer to the object, sacrificing aesthetics. In the case of my croppings, I give more room to aesthetics and therefore a combined map with and provides better performance.

Method IOU (mine) IOU (MTurk)
Aesthetic model ( and ) 0.5436 0.4154
Semantic model ( and ) 0.5407 0.6697
Combined model ( and ) 0.6443 0.5228
Table 6.5: Performance on the semantic cropping dataset

Figure 6.3 shows the qualitative results of the semantic cropping method on images from the semantic dataset. As you can see, the method gives priority to the input entity but also uses the most aesthetic areas of the image.

Figure 6.3: Qualitative results on the semantic cropping model on the semantic cropping dataset. Each image shows the cropping produced by the combined (full method), aesthetic or semantic modules. The ground truth cropping (mine) is marked with a red bounding box. The entity for each image together with the soft cropping (green bounding box) can be seen on the top label.

7.1 Further work

The problem of semantic cropping is a vast and complicated problem, in this thesis I have only covered the use case where a limited aspect ratio of 1:1 is required. A possible expansion to this case is to increase the number of aspect ratios of the croppings in the semantic cropping dataset. Additionally the dataset would benefit from increasing the number of images and entities.

Regarding the model, I have mentioned before that object detection networks have improved considerably in the past few years but even with this improvement the current state of object detection is far from ideal. This makes the semantic module imperfect, a temporary solution would be to train the Retinanet model with a dataset that covers a larger set of classes, probably a more generic set of classes would provide a better performance. Current cropping methods use aesthetics or saliency to determine the best cropping for an image, in this work I have only considered aesthetics. An interesting change to the semantic cropping model would be to incorporate a saliency module which detects salient parts of an image to generate a saliency map, this module could be trained using the Salicon [10] dataset.

Additionally, entity resolution i.e. mapping the given entity into the object detection classes could be improved by not treating each word individually but considering multiple words as a whole entity.

Another interesting experiment would be to not manually fix the weights used to combine the aesthetic and semantic maps but learn these two values (or maybe combine them in a non-linear manner) for different type of images.

Bibliography

  • [1] Y. Chen, T. Huang, K. Chang, Y. Tsai, H. Chen, and B. Chen (2017) Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study. CoRR abs/1701.01480. External Links: Link, 1701.01480 Cited by: §2.3, Figure 3.6, §3.4, Chapter 3, §6.3, §6.3, Table 6.4, Chapter 7, Chapter 7.
  • [2] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei (2009-06) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 248–255. External Links: Document, ISSN 1063-6919 Cited by: §2.1, §2.2, §5.2, §5.3.1.
  • [3] M. Everingham, S. M. Eslami, L. Gool, C. K. Williams, J. Winn, and A. Zisserman (2015-01) The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111 (1), pp. 98–136. External Links: ISSN 0920-5691, Link, Document Cited by: §3.2.
  • [4] C. Fang, Z. Lin, R. Mech, and X. Shen (2014) Automatic image cropping using visual composition, boundary simplicity and content preservation models. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, New York, NY, USA, pp. 1105–1108. External Links: ISBN 978-1-4503-3063-3, Link, Document Cited by: §2.3, §2.3, Figure 3.5, §3.3, Chapter 3, §6.3, Table 6.3, Chapter 7, Chapter 7.
  • [5] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524. External Links: Link, 1311.2524 Cited by: §2.2.
  • [6] R. B. Girshick (2015) Fast R-CNN. CoRR abs/1504.08083. External Links: Link, 1504.08083 Cited by: §2.2.
  • [7] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. CoRR abs/1703.06870. External Links: Link, 1703.06870 Cited by: §2.2, §2.2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §2.1.
  • [9] J. J. Jiang and D. W. Conrath (1997) Semantic similarity based on corpus statistics and lexical taxonomy. CoRR cmp-lg/9709008. External Links: Link Cited by: §2.4.
  • [10] M. Jiang, S. Huang, J. Duan, and Q. Zhao (2015-06) SALICON: saliency in context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3, §7.1.
  • [11] Y. Kao, R. He, and K. Huang (2017-03) Automatic image cropping with aesthetic map and gradient energy map. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 1982–1986. External Links: Document, ISSN Cited by: §2.3, §2.3, §2.3, §3.1, §5.2.1, §5.2, §6.1, Table 6.1, Table 6.3.
  • [12] Y. Kao, C. Wang, and K. Huang (2015-Sept) Visual aesthetic quality assessment with a regression model. In 2015 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 1583–1587. External Links: Document, ISSN Cited by: §2.1.
  • [13] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Link, 1412.6980 Cited by: §5.2.1, §5.3.1.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, USA, pp. 1097–1105. External Links: Link Cited by: §2.1, §2.2.
  • [15] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989-12) Backpropagation applied to handwritten zip code recognition. Neural Comput. 1 (4), pp. 541–551. External Links: ISSN 0899-7667, Link, Document Cited by: §2.1.
  • [16] Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Miller, E. Sockinger, P. Simard, and V. Vapnik (1995) Comparison of learning algorithms for handwritten digit recognition. In INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS, pp. 53–60. Cited by: §2.1.
  • [17] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel (1990) Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems 2, D. S. Touretzky (Ed.), pp. 396–404. External Links: Link Cited by: §2.1.
  • [18] D. Li, H. Wu, J. Zhang, and K. Huang (2017)

    A2-RL: aesthetics aware reinforcement learning for automatic image cropping

    .
    CoRR abs/1709.04595. External Links: Link, 1709.04595 Cited by: §2.3, Table 6.3, Table 6.4.
  • [19] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2016) Feature pyramid networks for object detection. CoRR abs/1612.03144. External Links: Link, 1612.03144 Cited by: §5.3.1.
  • [20] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. CoRR abs/1708.02002. External Links: Link, 1708.02002 Cited by: §2.2, §2.2, Figure 5.4, §5.3.1, §5.3.1, §5.3.1, §6.2, Table 6.2.
  • [21] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. External Links: Link, 1405.0312 Cited by: Figure 3.3, Figure 3.4, Chapter 3, §5.3.1, Table 6.2, Chapter 7.
  • [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2015) SSD: single shot multibox detector. CoRR abs/1512.02325. External Links: Link, 1512.02325 Cited by: §2.2.
  • [23] X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang (2014) RAPID: rating pictorial aesthetics using deep learning. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, New York, NY, USA, pp. 457–466. External Links: ISBN 978-1-4503-3063-3, Link, Document Cited by: §2.1.
  • [24] N. Murray, L. Marchesotti, and F. Perronnin (2012-06) AVA: a large-scale database for aesthetic visual analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2408–2415. External Links: Document, ISSN 1063-6919 Cited by: §2.1, §2.3, §2.3, Figure 3.2, Chapter 3, §5.2.1, §6.1, Table 6.1, Chapter 7.
  • [25] M. Nishiyama, T. Okabe, Y. Sato, and I. Sato (2009) Sensation-based photo cropping. In Proceedings of the 17th ACM International Conference on Multimedia, MM ’09, New York, NY, USA, pp. 669–672. External Links: ISBN 978-1-60558-608-3, Link, Document Cited by: §2.3.
  • [26] T. Pedersen, S. Patwardhan, and J. Michelizzi (2004) WordNet::similarity: measuring the relatedness of concepts. In Demonstration Papers at HLT-NAACL 2004, HLT-NAACL–Demonstrations ’04, Stroudsburg, PA, USA, pp. 38–41. External Links: Link Cited by: §2.4.
  • [27] P. Poirson, P. Ammirato, C. Fu, W. Liu, J. Kosecka, and A. C. Berg (2016) Fast single shot detection and pose estimation. CoRR abs/1609.05590. External Links: Link, 1609.05590
  • [28] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. External Links: Link, 1506.02640 Cited by: §2.2.
  • [29] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §2.2, §2.2, Figure 5.6, §5.3.1, §5.3.1, Table 6.2.
  • [30] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link, 1409.1556 Cited by: §2.1.
  • [31] F. Stentiford (2007-01) Attention based auto image cropping. ICVS Workshop on Computational Attention & Application, pp. . Cited by: §2.3.
  • [32] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders (2013) Selective search for object recognition. International Journal of Computer Vision 104 (2), pp. 154–171. External Links: Link Cited by: §2.2.
  • [33] ”. Wang, M. Zhao, L. Wang, J. Huang, C. Cai, and X. Xu” (”2016”) ”A multi-scene deep learning model for image aesthetic evaluation”. ”Signal Processing: Image Communication” ”47”, pp. ”511 – 518”. External Links: ISSN "0923-5965", Document, Link Cited by: §2.1.
  • [34] W. Wang, M. Zhao, L. Wang, J. Huang, C. Cai, and X. Xu (2016) A multi-scene deep learning model for image aesthetic evaluation. Signal Processing: Image Communication 47, pp. 511 – 518. External Links: ISSN 0923-5965, Document, Link Cited by: §2.1.
  • [35] W. Wang and J. Shen (2017) Deep cropping via attention box prediction and aesthetics assessment. CoRR abs/1710.08014. External Links: Link, 1710.08014 Cited by: §2.3, §2.3, Table 6.1, Table 6.3.
  • [36] M. D. Zeiler and R. Fergus (2013) Visualizing and understanding convolutional networks. CoRR abs/1311.2901. External Links: Link, 1311.2901 Cited by: §2.1.
  • [37] B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba (2016)

    Learning Deep Features for Discriminative Localization.

    .
    CVPR. Cited by: Figure 2.1, §2.1, §2.3, §5.2, §5.2.