News Cover Assessment via Multi-task Learning

07/17/2019 ∙ by Zixun Sun, et al. ∙ Tencent 1

Online personalized news product needs a suitable cover for the article. The news cover demands to be with high image quality, and draw readers' attention at same time, which is extraordinary challenging due to the subjectivity of the task. In this paper, we assess the news cover from image clarity and object salience perspective. We propose an end-to-end multi-task learning network for image clarity assessment and semantic segmentation simultaneously, the results of which can be guided for news cover assessment. The proposed network is based on a modified DeepLabv3+ model. The network backbone is used for multiple scale spatial features exaction, followed by two branches for image clarity assessment and semantic segmentation, respectively. The experiment results show that the proposed model is able to capture important content in images and performs better than single-task learning baselines on our proposed game content based CIA dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the growth of personalized information consumption demand, the recommendation system has achieved significant success in news application (app), such as “kuaibao” and “toutiao” in China, which aiming at recommending high quality articles based on individual demand. In feed list of those apps, only titles and cover images are exhibited, with which readers decide whether to click and read the articles. A high quality cover will significantly increase the article click-through rate (CTR) and improve the readers’ quality of experience (QoE) simultaneously. Thus the cover image assessment is particularly crucial for information feed design.

In this paper, we focus on the task of cover assessment in game content, such as the apps of Kings’ Campsite, which is a professional generated content platform for players of Honor of Kings. One high quality cover image should be appealing to user’s attention and meanwhile express the article content. Comparing the quality of images can seem like a very subjective task. What makes one image preferable to another depend on many factors, and may vary depending on the different individuals. Intuitively, each news cover is chosen or cropped from the article image content, and demanding to be presented with high clarity and prominent objects. Motivated by this, we decompose the challenging task of determining cover images into two subtasks, the image clarity assessment and the image semantic segmentation(Mottaghi et al., 2014; Cordts et al., 2016; Chen et al., 2018; Chen et al., 2017, 2018). The main idea is that the image semantic segmentation results help to quantify the semantic information including object proportion and position in images, which further along with image clarity assessment results perform as news cover selection and cropping guideline.

In recent years, the Convolutional Neural Networks (CNNs) have shown a great progress in computer vision research, especially when the “AlexNet” appeared and achieved striking results in the ImageNet competition in 2012

(Krizhevsky et al., 2012). The CNN performance on classification is further improved in 2014, in which VGGNet (Simonyan and Zisserman, 2014) and GooleNet (Szegedy et al., 2015) with deeper and wider network architecture achieved high performance in the 2014 ILSVRC (Russakovsky et al., 2015) classification challenge. With the advent of ResNet (He et al., 2016), the network architectures are going even more deeper and achieved higher performance.

The success of CNNs has bloomed the research of their application to a variety of computer vision tasks, i.e., image quality assessment (IQA) (Kang et al., 2014; Liu et al., 2017; Bosse et al., 2018; Talebi and Milanfar, 2018) and semantic segmentation(Mottaghi et al., 2014; Cordts et al., 2016; Chen et al., 2014, 2018; Chen et al., 2017, 2018). These studies result in significant improvement compared to earlier hand-crafted based features. On one hand, semantic segmentation performance applying Deep Convolutional Neural Networks (DCNNs) has been successfully improved, for example the outstanding performance of DeepLab series networks(Chen et al., 2014, 2018; Chen et al., 2017, 2018), especially DeepLabv3(Chen et al., 2017) and DeepLabv3+(Chen et al., 2018). On the other hand, the image quality with respect to human perception can be accurately predicted with CNN-based method (Kang et al., 2014; Liu et al., 2017; Bosse et al., 2018; Talebi and Milanfar, 2018).

Motivated by the ability of DeepLabv3+ (Chen et al., 2018) to capture the contextual information at multiple scales, in this paper we employ DeepLabv3+ as the network architecture and model a multi-task learning network to assess news cover from image clarity and semantic segmentation perspective, as illustrated in Fig. 1

. We focus on pursuing a practical game content cover image assessment system, which jointly performs image clarity assessment and semantic segmentation prediction. Finally, we experimentally verify the effectiveness of the proposed model on our Cover Image Assessment (CIA) dataset. Another contribution of our proposed multi-task learning network is that it addresses the limitations of available modern graphics processing units (GPUs) memory when processing multi-task deep learning, in which the memory complexity is independent of the number of tasks in our proposed end-to-end multi-task networks model.

Figure 1. Multi-task learning network architecture.

Our proposed multi-task learning network extends DeepLabv3+ by adding a branch of image clarity assessment module.

2. Related Work

We briefly review the literature related to our approach, and conclude as below.

Image quality assessment (IQA). Human visual system is highly sensitive to edge and contour information of an image (Martini et al., 2012). Some IQA studies take edge structure information as the main image quality consideration, for example, in (Cohen and Yitzhaky, 2010) the authors apply edge information for both blur and noise detection, which are the major factors on image quality degradation. In (Ni et al., 2016), an edge model is employed to extract salient edge information for screen content images assessment, which outperforms the other state-of-the-art IQA models of the day.

In recent years, the idea of employing a CNN based approach for no-reference IQA (NR-IQA) tasks is arising, and meanwhile the performance of NR-IQA has been significantly improved under such methods(Niu et al., 2019; Kang et al., 2014). For example, in (Kang et al., 2014), a CNN is directly utilized for image quality prediction without a reference image, which integrates the feature learning and regression into one optimization process. One common ground behind those models is that these network architectures are shallower and narrower, which are not deep enough for learning high-level features. The emergence of deeper CNN, such as ResNet-101 (He et al., 2016) and Xception (Chollet, 2017), further promotes the representational abilities of those models. For example, DeepLabv3+ (Chen et al., 2018), employs atrous convolution to extract dense feature maps and capture global multiple scale context, resulting in significant performance improvement over semantic segmentation tasks. In (Bosse et al., 2018), DeepLab based network is applied to excavate spatial features of hyper spectral images, and achieves outstanding performance.

Most IQA studies consider low-level features such as color or texture, which is not enough for news cover assessment. In news cover assessment, the high-level object feature is also one of the key factors for future cover image selection or cropping. In this paper, we take both low-level image clarity feature and high-level object feature into consideration for news cover assessment.

multi-task learning (MTL). MTL is based on a fundamental idea that different tasks could share a common low level representation. In many computer vision tasks, MTL has exhibited advantages in performance improvement and memory saving. In (Kokkinos, 2017)

, one unified architecture which jointly learn low-, mid-, and high-level vision tasks is introduced. With such a universal network, the tasks of boundary detection, normal estimation, saliency estimation, semantic segmentation, semantic boundary detection, proposal generation, and object detection can be simultaneously addressed. In

(Misra et al., 2016), a multi-task learning network with ”cross-stitch” units is proposed, which shows dramatically improved performance over one-task based baselines on the NYUv2 dataset (Silberman et al., 2012). However, prior studies have not explored multi-task learning architecture or approach for IQA and semantic segmentation, which is our target method in this work.

3. Method

Inspired by the superior performance in semantic segmentation tasks and the multiple scale spatial features to capture interesting part of an image, DeepLabv3+ network(Chen et al., 2018) is chosen as the a basis for the proposed multi-task learning based cover image assessment network. Our proposed network architecture is illustrated in Fig.1, which is comprised by the feature extractor module, the image clarity assessment module, the semantic segmentation module. The proposed multi-task learning network is modified from DeepLabv3+ network by adding an image clarity assessment branch and meanwhile retaining the encoder-decoder architecture in DeepLabv3+ for semantic segmentation tasks. The network first processes the whole image with a deep convolutional neural network (DCNN) and subsequently employs atrous convolution to produce a feature map with multi-scale contextual information. Then the feature map is shared between clarity assessment module and semantic segmentation module simultaneously.

3.1. Deep image clarity assessment learning

The proposed multi-task learning network has two output layers. The first layer outputs the image clarity assessment score , for the image indexed by . The second layer outputs a predicted binary mask matrix to distinguish foreground and background in an image.

In image clarity assessment module, the feature map is regressed by a sequence of conv3-256, maxpool, conv3-256, maxpool, conv3-256, maxpool, one average pooling layer and one FC-1 layer. The convolutional layers apply

pixel-size convolution kernels and are activated by a rectified linear unit (ReLU). The max pooling layers apply

pixel-size kernels. The output of FC-1 layer is activated by a sigmoid unit. The estimated image clarity score of image

is then obtained. Note that the image clarity assessment is a regression task, we choose Mean Square Error (MSE) as the loss function for training image clarity assessment branch, which is

(1)

3.2. Deep semantic segmentation learning

For the task of semantic image segmentation, our proposed multi-task learning network keeps the encoder-decoder architecture in DeepLabv3+. In this paper, the feature map from feature exactor is computed with , which is consistent with DeepLabv3+. After convolution operation and one simple bilinear upsampling with factor 4, the feature map concatenates with the low level features from the DCNN and then is refined by a few convolutions. Finally, the refined feature map is upsampled by another simple bilinear upsampling with factor 4 and is then feeded into semantic segmentation prediction.

Note that the image with low clarity may introduce noise during semantic segmentation network training. Denote as the binary annotation of -th pixel of image . A ground truth label means foreground (object) for -th pixel and for background. Denote as the predicted pixel mask value of image . The cross entropy loss function is applied for semantic segmentation,

(2)

3.3. End-to-End multi-task learning network

With above definitions, we minimize an objective function following the multi-task loss in our proposed network model. To avoid the influence of heavy distorted images, we train the semantic segmentation branch over the images with relatively high clarity, i.e., image clarity score in this paper. Our multi-task loss function is defined as:

(3)

where denotes the number of training images, the hyper-parameter control the balance between two task losses, and is the binary indicator with

By convention the image with low clarity is labeled and hence the corresponding is ignored.

4. Experimental Results

In this section, we introduce our dataset and conduct a number of experiments to evaluate the performance of our approach.

4.1. Cover Image Assessment Dataset

(a) Image clarity score = 9.5
(b) Image clarity score = 7.5
(c) Image clarity score = 4.22
(d) Image clarity score = 1.88
Figure 2. Some example images from CIA dataset with image clarity score

We experiment on Cover Image Assessment Dataset (CIA), which is consisted by game images from Honor of Kings, a multiplayer mobile online battle game. Each image in the dataset is annotated from pixel-level and meanwhile labeled by one clarity score. Distinguished from the previous image quality assessment datasets, all of which provide the visual quality of distorted images, CIA dataset assesses the image from two dimension, image clarity and semantic information, which is more practical for cover image assessment. Next, we introduce the detailed collection process of this dataset and provide its statistics results.

We collect a batch of original Honor of Kings game videos from Tencent video website, the resolution of which consisted by and . All the videos are of high quality and the clarity among them are consistent. We then extract 1021 frames from those videos, and the extracted images contain 98 objective hero classes and one background class. Our annotation tasks include two aspects. One is pixel-level mask annotation for semantic segmentation task. We annotate the objective heros in images. Another is image clarity assessment. We distort each image to 10 levels through Gaussian blur distortion and quantize corresponding clarity score to range . Higher value of clarity score (score 10 is maximal) corresponds to higher image clarity. A few examples with ratings associated with different levels of clarity are illustrated in Fig. 2. Thus through distortion process our CIA dataset is augmented to 8651 images, and each image has one clarity score and one segmentation mask.

4.2. Experimental Setup

We investigate a number of network backbones and training strategies, and evaluate their performance through standard semantic segmentation metrics and clarity assessment metrics. The network backbone is based on popular deep models such as MobileNet (Sandler et al., 2018), Xception (Chollet, 2017), ResNet-101(He et al., 2016), and is trained on our proposed CIA dataset. The dataset is randomly split into 7786 images for training set and 865 images for testing set. In both training and testing dataset, the proportion of distorted image and undistorted image is , which guarantees that the undistorted image has been seen by the network during training. Random crop is employed during the training as a data enhancement method, and the cropped image size is

. We use the Pytorch framework and employing Stochastic Gradient Descent (SGD) with an initial learning rate of

and momentum of for network training. The models are trained for epochs, which ensures the convergence of all models. The batch size is set to and the parameter is set to during both training and testing. The parameter is set to . Training rates follow a polynomial decay policy with factor .

In this paper, we train the multi-task learning based network in two strategies, introduced as end-to-end training and multi-stage training. Specifically, in end-to-end training, the tasks of image clarity assessment and semantic segmentation are trained simultaneously, with the multi-task loss (3) be minimized in each iteration. In multi-stage training, the semantic segmentation branch is trained first and the convergent feature map from feature extractor is shared for clarity assessment model. We then fine-tune the following convolutional layers and fully-connected layers with mini-batches of 8.

We compute the average Pearson Linear Correlation Coefficient (LCC) and mean Intersection over Union (mIoU) as evaluation metrics for task of image clarity assessment and task of semantic segmentation, respectively. The LCC is computed as below:

(4)

where and are the means of ground truth and predicted clarity score, separately. LCC measures the linear correlation between the predicted image clarity score and the ground truth, and larger LCC value implies more accurate image clarity assessment performance. The mIoU is defined over image pixels, following the standard protocols (Everingham et al., 2010). Large mIoU means higher accuracy on semantic segmentation performance. With semantic segmentation results, we can compute corresponding semantic information easily, which along with predicted image clarity score help for news cover assessment.

4.3. Performance Evaluation

We first evaluate our proposed multi-task learning network performance over several network backbones on CIA dataset, and then further demonstrate the effectiveness of our proposed multi-task learning model through comparing with single-task models.

(a) Image clarity score: 8.2036(9.5), IoU = 0.8369, Object proportion: 1.49%
(b) Image clarity score = 5.7954(5.625), IoU = 0.7534, Object proportion: 1.57%
(c) Image clarity score = 4.5992(4.2188), IoU = 0.7514, Object proportion: 2.13%
(d) Image clarity score = 2.3295(2.31), IoU = 0.7518, Object proportion: 2.42%
Figure 3. Some example images from CIA dataset using our multi-task network structure. Predicted image clarity (and ground truth) scores, predicted IoU and corresponding object proportion results are shown below each image.

Effectiveness of Network Structure. To demonstrate the ability of our proposed multi-task learning network structure to assess the image clarity and predict semantic segmentation simultaneously, we train the multi-task learning network on CIA data set and compare different network backbone performance among MobileNet, Xception, and ResNet-101. The network is trained in end-to-end manner. The network backbones are initialized with a pre-trained weights training on ImageNet. Fig. 3 shows a few predicted results from CIA dataset under our proposed multi-task learning network. The segmentation results are visualized in Fig. 3(a)-3(d), and the predicted image clarity scores, ground truth scores, predicted IoU and corresponding object proportion in images are shown below each image. The object proportion is computed by corresponding semantic segmentation. Results in Fig. 3 suggest that our proposed multi-task learning network works, and the network has the ability to predict the image clarity and semantic segmentation simultaneously, which helps to design cover image selection and cropping strategies. The detailed performance results under different network backbones are summarized in Table 1. As can be seen, ResNet-101 performs better than the other two network backbones, in terms of LCC and mIoU. Based on these results, we choose ResNet-101 as our network structure in the following studies.

Type/Network MobileNet Xcpetion ResNet
LCC 0.9809 0.9559 0.9805
mIoU 0.7603 0.7165 0.7746
Table 1. Performance evaluation on CIA dataset in terms of different network backbones
Models Image clarity Semantic segmentation Multi-task
LCC 0.9741 0.9805
mIoU 0.7618 0.7746
Table 2. Performance comparison among different models

Effectiveness of multi-task learning. Next, to further demonstrate whether the proposed multi-task learning model has the potential to improve results, we train the baseline network with image clarity assessment branch only, and with semantic segmentation branch only. The comparison results are shown in Table 2. The column image clarity and column semantic segmentation in Table 2 are single-task learning model. Obviously, across all three types of network, our proposed multi-task learning network outperforms both single-task based semantic segmentation model and clarity assessment model, in terms of mIoU and LCC respectively. This verifies the effectiveness of our proposed multi-task learning models. Besides, for multi tasks learned with several single-task networks, the demand for memory for back-propagation algorithm intensifies linearly with the number of tasks. Instead, under our proposed end-to-end multi-task networks model, the memory complexity is independent of the number of tasks. This indicates that the proposed multi-task learning network has advantages in fulfilling multi-task and multi-target with a limited memory budget.

Comparison between two different training strategies Finally, we evaluate the impact of different training strategies on the proposed multi-task learning based model. Instead of training the multi-task learning network end-to-end, we also evaluate the performance of the proposed network through multi-stage training. In multi-stage training, the clarity assessment module is fine-tuned based on the multiple scale spatial feature map from pre-trained image segmentation network on CIA dataset. The comparison results are presented in table 3. Obviously, the end-to-end training method leads to both excellent image clarity assessment and segmentation performance.

Training strategies end-to-end training multi-stage training
LCC 0.9805 0.8548
mIoU 0.7746 0.7470
Table 3. Performance comparison among different training strategies

5. Conclusion

News cover assessment is the critical step for suitable cover photo generation. To address challenges in news cover assessment, in this work, we have introduced an end-to-end multi-task learning network, which can predict image clarity and semantic segmentation in parallel, and reduce the memory consumption compared to task execution independently. Then various news cover selection and cropping strategies can be designed with image clarity assessment and semantic information on salient object. We have experimentally shown the effectiveness of the proposed multi-task learning network on our proposed game content based CIA dataset. How to design news cover selection and cropping strategy according to the predicted image clarity and semantic information on salient object is the future main research direction. Research in these directions is underway, and more importantly, we consider the work in this paper to be a first step in the way of news cover assessment.

References

  • (1)
  • Bosse et al. (2018) Sebastian Bosse, Dominique Maniry, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek. 2018. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing 27, 1 (2018), 206–219.
  • Chen et al. (2014) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014).
  • Chen et al. (2018) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2018), 834–848.
  • Chen et al. (2017) Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).
  • Chen et al. (2018) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 801–818.
  • Chollet (2017) François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 1251–1258.
  • Cohen and Yitzhaky (2010) Erez Cohen and Yitzhak Yitzhaky. 2010. No-reference assessment of blur and noise impacts on image quality. Signal, image and video processing 4, 3 (2010), 289–302.
  • Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016.

    The cityscapes dataset for semantic urban scene understanding. In

    Proceedings of the IEEE conference on computer vision and pattern recognition. 3213–3223.
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303–338.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Kang et al. (2014) Le Kang, Peng Ye, Yi Li, and David Doermann. 2014. Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1733–1740.
  • Kokkinos (2017) Iasonas Kokkinos. 2017. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6129–6138.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Liu et al. (2017) Xialei Liu, Joost van de Weijer, and Andrew D Bagdanov. 2017. Rankiqa: Learning from rankings for no-reference image quality assessment. In Proceedings of the IEEE International Conference on Computer Vision. 1040–1049.
  • Martini et al. (2012) Maria G Martini, Chaminda TER Hewage, and Barbara Villarini. 2012. Image quality assessment based on edge preservation. Signal Processing: Image Communication 27, 8 (2012), 875–882.
  • Misra et al. (2016) Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3994–4003.
  • Mottaghi et al. (2014) Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 891–898.
  • Ni et al. (2016) Zhangkai Ni, Lin Ma, Huanqiang Zeng, Canhui Cai, and Kai-Kuang Ma. 2016. Screen content image quality assessment using edge model. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 81–85.
  • Niu et al. (2019) Zijia Niu, Wen Liu, Jingyi Zhao, and Guoqian Jiang. 2019.

    DeepLab-Based Spatial Feature Extraction for Hyperspectral Image Classification.

    IEEE Geoscience and Remote Sensing Letters 16, 2 (2019), 251–255.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.
  • Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.
  • Silberman et al. (2012) Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision. Springer, 746–760.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
  • Talebi and Milanfar (2018) Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural image assessment. IEEE Transactions on Image Processing 27, 8 (2018), 3998–4011.