We address regression problems which aim at analyzing the relationship between dependent variables (targets) and independent variables (inputs). Regression has been applied to a variety of computer vision problems including crowd counting [shi2018crowd], age estimation [huo2016deep], affective computing [ponce2016chalearn], image super-resolution [tai2017image], visual tracking [zhang2017robust] and so on. Pioneering works within this area typically learn a mapping function from hand-crafted features (e.g., Histogram of Oriented Gradients (HoG), Scale-Invariant Feature Transform (SIFT)) to the desired output (e.g., ages, affective scores, density maps and so on).
Recently, transforming a regression problem to an optimizable robust loss function jointly trained with deep Convolutional Neural Network (CNN) has been reported to be successful to some extent. Most of the existing deep learning based regression approaches optimize the L2 loss function together with a regularization term, where the goal is to minimize the mean square error between the network prediction and the ground-truth. However, it is well known that the mean square error is sensitive to outliers, which are essentially the samples that lie at an abnormal distance from other training samples in the objective space. In this case, samples that are rarely encountered in the training data may have a disproportionally high weight and consequently influence the training procedure by reducing the generalization ability. To this end, Ross Girshick[girshick2015fast] introduced a SmoothL1 loss for bounding box regression. As a special case of Huber loss [huber1964robust], the SmoothL1 loss combines the concept of L2 loss and L1 loss. It behaves as an L1 loss when the absolute value of the error is high and switches back to L2 loss when the absolute value of the error is close to zero. Besides, Belagiannis et al. [belagiannis2015robust] propose a deep regression network that achieves robustness to outliers by minimizing Tukey’s biweight function [huber2011robust, black1996unification] .
While tremendous progress has been achieved later by employing robust statistical estimations together with specially-designed network architecture to explicitly address outliers, they may fail to generalize well in practice. As studied in [dietterich2000ensemble], a single model was lacking due to the statistical, computational and representational limitations. To this end, a great deal of research has gone into designing multiple regression systems [dietterich2000ensemble, breiman2001random, ren2016ensemble, qiu2017oblique]. However, existing methods for ensemble CNNs [walach2016learning, han2016incremental, cortes2014deep] typically trained multiple CNNs, which usually led to much larger computational complexity and hardware consumption. Thus, these ensemble CNNs are rarely used in practical systems.
In this paper, we propose a Deep Negative Correlation Learning (DNCL) approach which learns a pool of diversified regressors in a “divide and conquer” manner. Each regressor is jointly optimized with CNNs by an amended cost function, which penalizes correlations with others. Our approach inherits the advantage of traditional Negative Correlation Learning (NCL) [liu2000evolutionary, brown2005managing] approaches, that systematically controls the trade-offs among the bias-variance-covariance in the ensemble. Firstly, by dividing the task into multiple “negatively-correlated” sub-problems, the proposed method shares the essence of ensemble learning and yield more robust estimations than a single network [ren2016ensemble, brown2005managing]. Secondly, thanks to the rich feature hierarchies in deep networks, each sub-problem could be solved by a feature subset. In this way, the proposed method has a similar amount of parameters with a single network and thus is much more efficient than most existing deep ensemble learning [walach2016learning, han2016incremental, cortes2014deep]. Simplicity and efficiency are central to our design, the proposed methods are almost complementary to other advanced strategies for individual regression tasks.
A preliminary version of this work was presented in CVPR 2018 [shi2018crowd], which provides an application of DNCL for crowd counting. This paper adds to the initial version in the following aspects:
We provide more theoretical insights on the Rademacher complexity.
We extend the original work to deal with more regression based problems, which allows the use of state-of-the-art network structures that give an important boost to performance for the proposed method.
More comprehensive literature review, considerable new analysis and intuitive explanations are added to the initial results.
2 Related Work
We first briefly introduce the commonly used loss function for regression based deep learning computer vision tasks, followed by summarizing the existing ensemble regression techniques.
Deep Regression. Recently, learning a mapping function to predict a set of interdependent continuous values by deep networks is popular. One example could be object detection where the target is to regress the bounding box for precise localization [ren2015faster]. Other examples include regressing the facial points in facial landmark detection [sun2013deep] and positions of the body in human pose estimation [toshev2014deeppose]. The L2 loss function is a natural choice for solving such problems. Zhang et al. [zhang2014facial] further utilized L2 regularization to increase the robustness of network for both landmark detection and attribute classification. Similar strategies were also applied in object detection [wang2014deep].
The commonly used L2 loss in regression problems may not generalize well in the case of outliers because outliers can have a disproportionally high weight, and consequently influence the training procedure by reducing the generalization ability and increasing the convergence time. To this end, a SmoothL1 loss [girshick2015fast] was reported to be more robust than L2 loss when outliers are present in the dataset:
where stands for the prediction error. Motivated by the recent success in robust statistics [huber2011robust], an M-estimator based [black1996unification] loss function, called Tukey Loss [belagiannis2015robust], was proposed for both human pose estimation and age estimation [belagiannis2015robust]. More specifically,
where is a tuning parameter, and is commonly set to , which gives approximately
asymptotic efficiency as L2 minimization on the standard normal distribution of residuals.is a scaled version of the residual by computing the median absolute deviation by:
where , and stands for the ground-truth label, predicted result and number of data samples, respectively. In case of regressing multiple outputs, the MAD values are calculated independently.
Our proposed DNCL method could also be regarded as a loss function, which is readily pluggable into existing CNN architecture and amenable to training via backpropagation. Without extra parameters, the proposed methods mimic ensemble learning and have a better control of the trade-off between the intrinsic bias, variance and co-variance. We evaluate the proposed method on multiple challenging and diversified regression tasks. When combined with the state-of-the-art network structure, our method could give an important boost to the performance of existing loss functions mentioned above.
Ensemble Regression. Ensemble methods are wildly regarded to be better than single model if the ensemble is both “accurate” and “diversified” [dietterich2000ensemble, breiman2001random, ren2016ensemble]. As studied in [dietterich2000ensemble]
, a single model was less generalizable from the statistical, computational and representational point of view. To this end, a bunch of research has gone into designing multiple regression systems. For instance, the accuracy and diversity in a typical decision tree ensemble[breiman2001random, barandiaran1998random, zhang2017robust, qiu2017oblique] were guaranteed by allowing each decision tree grow to its maximum depth and utilizing feature subspace, respectively. Boosting [friedman2001greedy] generated a new regressor with an amended loss function based on the loss of the existing ensemble models.
Motivated by the success of ensemble methods, several deep regression ensemble methods were proposed as well. However, existing methods for training CNN ensemble [walach2016learning, cortes2014deep] usually generated multiple CNNs separately. In this case, the resulting system usually yielded a much larger computational complexity compared with single models and thus were usually very slow in terms of both training and inference, which naturally limited their applicability for resource-constrained scenarios.
One of the exceptions could be the Deep Regression Forest (DRF) [Shen_2018_CVPR] which reformulated the split nodes as a fully connected layer of a CNN and learnt the parameter of CNN and tree nodes jointly by an alternating strategy. Firstly, by fixing the leaf nodes, the internal nodes of trees as well as the CNN were optimized by back-propagation. After that, both the CNN and the internal nodes were frozen and the leaf nodes were learned by iterating a step-size free and fast converging update rule derived from Variational Bounding. We show the proposed method could also be combined with the concept of DRF. The resulting system is much simpler to learn and yield a significant improvement enhancement, as elaborated in Section 4.3.
The proposed method is generic and could be applied to a wide range of regression tasks. It mimics ensemble learning without extra parameters and helps to learn more generalizable features through a better control of the trade-off between the intrinsic bias, variance and co-variance. We evaluated it on multiple challenging and diversified regression tasks including crowd counting, age estimation, personality analysis and image super-resolution. Simplicity is central to our design and the strategies adopted in the proposed method are complementary to many other specially-designed techniques for each task. When combined with state-of-the-art network structure for each task, our proposed method is able to yield an important boost to the baseline methods. Below we provide a detailed review on the recent advances in each task.
Crowd Counting. Counting by regression is perceived as the state-of-the-art at present. The regression-based methods have been widely studied and reported to be computationally feasible with modern hardware, robust with parameters and accurate across various challenging scenarios. A deep CNN [zhang2015cross] was trained alternatively with two related learning objectives, crowd density classification and crowd counting. However, it relied heavily on a switchable learning approach and was not clear how these two objective functions can alternatively assist each other. Wang et al. citewang2015deep proposed to directly regress the total people number by adopting AlexNet [krizhevsky2012imagenet], which has now been found to be worse than the methods regressing density map. This observation suggests that reasoning with rich spatial layout information from convolutional feature maps is necessary. Boominathan et al. [boominathan2016crowdnet] proposed a framework consisting of both deep and shallow networks for crowd counting. It was reported to be more robust with scale variations, which have been addressed explicitly by other studies [shi2018multiscale, zhang2016single, onoro2016towards] as well. Switching CNN was introduced in [sam2017switching], where patches from a grid within a crowd scene were relayed to independent CNN regressors based on crowd count prediction quality of the CNN established during training. Arteta et al. [arteta2016counting] augmented and interleave density estimation with foreground-background segmentation and explicit local uncertainty estimation under a new deep multi-task architecture. Noroozi et al. [noroozi2017representation]
used counting as a pretext task to train a neural network with a contrastive loss and showed improved results on transfer learning benchmarks.
Personality Analysis. Recent personality-related work with visual cues attempted to identify personality from body movement [lepri2012connecting], facial expression change [biel2012facetube, sanchez2013inferring], combining acoustic cues [abadi2015inference], eye gaze [batrinca2011please], and so on. In addition, recognizing personality traits using deep learning on images or videos has also been extensively studied. ‘ChaLearn 2016 Apparent Personality Analysis competition’ [ponce2016chalearn]
provided an excellent platform, where researchers could assess their deep models on a large annotated big-five personality traits dataset. Instead of classifying pre-defined personality categories, common practices use a finer-grained representation, in which personalities are distributed in a five-dimensional space spanned by the dimensions ofExtraversion, Agreeableness, Conscientiousness, Neuroticism and Openness [ponce2016chalearn]. This is advantageous in the sense that personality states can be represented at any level of the aforementioned big-five personality traits. A Deep Bimodal Regression framework based on both video and audio input was utilized in [zhang2016deep] to identify personality. A similar work from Güçlütürk et al. [guccluturk2016deep]
introduced a deep audio-visual residual network for multimodal personality trait recognition. In addition, a volumetric convolution and Long-Short-Term-Memory (LSTM) based network was introduced by Subramaniamet al. [subramaniam2016bi] for learning audio-visual temporal patterns. A pre-trained CNN was employed by Gürpınar et al. [gurpinar2016combining] to extract facial expressions as well as ambient information for personality analysis. For more related work on personality analysis, please refer to recent surveys [junior2018first, escalante2018explaining].
Age Estimation. Age estimation from face images is gaining its popularity since the pioneering work of [geng2007automatic]. Conventional regression methods include but are not limited to kernel method [guo2009human, guo2011simultaneous], hierarchical regression [han2015demographic], randomized trees [montillo2009age], label distribution [geng2013facial], and so on. Recently, end-to-end learning with CNN has also been widely studied for age estimation. Ni et al. [yi2014age] firstly proposed a four layer CNN for age estimation. Niu et al. [niu2016ordinal] reformulated age estimation as an ordinal regression problem by using end-to-end deep learning methods. In particular, age estimation in their setting was transformed into a series of binary classification sub-problems. Ranking CNN was introduced in [chen2017using], where each base CNN was trained with ordinal age labels. In [agustsson2017anchored], anchored Regression Networks were introduced as a smoothed relaxation of a piece-wise linear regressor for age estimation through the combination of multiple linear regressors over soft assignments to anchor points. Li et al. [li2018deep] designed a Deep Cross-Population (DCP) age estimation model with a two-stage training strategy, in which a novel cost-sensitive multitask loss function was first used to learn transferable aging features by training on the source population. Then, a novel order-preserving pair-wise loss function was utilized to align the aging features of the two populations. DEX [rothe2018deep] solved age estimation by way of deep classification followed by a softmax expected value refinement. Shen et al. [Shen_2018_CVPR]
extended the idea of a randomized forest into deep scenarios and show remarkable performances for age estimation.
Single Image Super-resolution. With the far-reaching application in medical imaging, satellite imaging, security and surveillance, single image super-resolution is a classic computer vision problem, which aims to recover a high resolution (HR) image from a low-resolution (LR) image. Since the using of fully convolutional networks for super-resolution [dong2016image], many advanced deep architectures have been proposed. For instance, Cascaded Sparse Coding Network (CSCN) [wang2015deep] combined the strengths of sparse coding and deep network. An efficient sub-pixel convolution layer was introduced in [shi2016real] to better upscale the final LR feature maps into the HR output. A PCA-inspired collaborative representation cascade was introduced in [zhang2017collaborative]. A novel residual dense network(RDN) was designed [zhang2018residual] to fully exploit the hierarchical features from all the convolutional layers. Specifically, they proposed residual dense block (RDB) to extract abundant local features via dense connected convolutional layer. A deeply-recursive convolutional network (DRCN) was proposed in [kim2016deeply], which increased the network depth by a recursive layer without introducing new parameters for additional convolutions. A very deep fully convolutional encoding-decoding framework was employed in [mao2016image] to combine convolution and deconvolution. Wei et al. [han2018image]
reformulated image super-resolution as a single-state recurrent neural network (RNN) with finite unfoldings and further designed a dual-state design, the Dual-State Recurrent Network (DSRN). Deep Back-Projection Networks (DBPN)[haris2018deep] exploited iterative up- and downsampling layers to provide an error feedback message for projection errors at each stage. In [zhang2018image], a residual in residual (RIR) structure was introduce. The concept of non-local learning was adopted in [zhang2019residual] for image super-resolution. For more research work, please refer to [timofte2017ntire].
3 Proposed Method
Before elaborating the proposed regression method, we first briefly present the notations and the background knowledge. We assume that we have access to training samples, . The samples are dimensional: , where and denote the height, width and number of channels of input image respectively. Our objective is to predict their regression labels, i.e., , where . We denote a generic data point by and use , with denoting the placeholder for the index wherever necessary. Similarly, we use and to represent the dimensionality of a generic input data and its label, respectively. We achieve our goal by learning a mapping function , where .
The learning problem is to use the set to learn a mapping function , parameterized by , to approximate their label as accurate as possible:
In practice, as data distribution is unknown, Eqn. (4) is usually approximated by
We omit the input and parameter vectors. Without ambiguity, instead of, we write simply . We use the shorthand expectation operator to represent the generalization ability on testing data. Bias-variance decomposition [brown2005managing] theorem states that the regression error of a predictor can be decomposed into its bias and variance :
It is a property of the generalization error in which bias and variance have to be balanced against each other for best performance.
A single model, however, turns out to be far from optimal in practice which has been evidenced by several studies, both theoretically [ren2016ensemble, brown2005managing] and empirically [fernandez2014we, zhang2017benchmarking]. Consider the ensemble output by averaging individual’s response , i.e.,
Here we restrict our analysis to the uniform combination case which is commonly used in practice, although the decomposition presented below generalize to non-uniformly weighted ensembles as well. Posing the ensemble as a single learning unit, its bias-variance decomposition can be shown by the following equation:
Consider ensemble output in Eqn. (7), it is straightforward to show:
where denotes for covariance.
The bias-variance-covariance decomposition in Eqn. (9) illustrates that, in addition to the internal bias and variance, the generalization error of an ensemble depends on the covariance between the individuals as well.
It is natural to show
Then it is easy to show
Eqn. (10) explains the effect of error correlations in an ensemble model by stating that the quadratic error of the ensemble estimator is guaranteed to be less than or equal to the average quadratic error of the component estimators. This is also in line with the strength-correlation theory [breiman2001random], which advocates learning a set of both accurate and decorrelated models.
3.2 Deep Negative Correlation Learning
3.2.1 Our Method
Conventional ensemble learning methods such as bagging [breiman1996bagging] and Random Forest [breiman2001random] train multiple models independently. This may not be optimal because, as demonstrated in Eqn. (10), the ensemble error consists of both the individual error and the interactions within the ensemble. Based on this, we proposed a “divide and conquer” deep learning approach by learning a correlation regularized ensemble on top of deep networks with the following objective:
More specifically, we consider our mapping function as an ensemble of predictors as defined in Eqn. (7) where each base predictor is posed as:
where , , and stand for the index for individual models, the index for data samples and the depth of the network, respectively. More specifically, each predictor in the ensemble consists of cascades of feature extractors , and regressor . Motivated by the recent success of CNNs on visual recognition tasks, each feature extractor is embodied by a typical layer of a CNN. Below we present the details for each task.
3.2.2 Network Structure
The proposed method can be efficiently encapsulated into existing deep CNN thanks to its rich feature hierarchy. In our implementation, as illustrated in Fig. 3, lower levels of feature extractors are shared by each predictor for efficiency, i.e., , , . Furthermore, building on the lessons learnt from subspace idea in ensemble learning [breiman2001random], highest level of feature extractor outputs a different feature subset for different regressor to insert more diversities. In this study, this is implemented via the well-established “group convolution” strategy [krizhevsky2012imagenet]. Each regressor is optimized by an amended cost function as defined in Eqn. 11. Generally speaking, network specification of is problem dependent and we show that, the proposed method is end-to-end-trainable and independent of the backbone network architectures.
We employ a deep pretrained VGG16 network for this task and make several modifications. Firstly, the stride of the fourth max-pool layer is set to 1. Secondly, the fifth pooling layer was further removed. This provides us with a much larger feature map with richer information. To handle the receptive-field mismatch caused by the removal of stride in the fourth max-pool layer, we then double the receptive field of convolutional layers after the fourth max-pool layer by using the technique ofholes introduced in [yu2015multi]. We also include another variant of the proposed method called “NCL” which is a shallow network optimized with the same loss function. The details of this network will be elaborated in Section XI.
Personality analysis. We utilize a truncated 20 layer version of the SphereFace model [liu2017sphereface] for personality analysis. We first detect and align faces for each input image with well-established MTCNN [zhang2016joint]. As we are dealing with videos, in order to speed up training and reduce the risk of over-fitting, we take a similar approach as done in [wang2016temporal] to first sparsely sample 10 frames from each video in a randomized manner. Average pooling is further used to aggregate multiple results for the same video.
Age estimation. For age estimation, we use the network backbone of deep forest [Shen_2018_CVPR]. It reformulates the split nodes of a decision forest as a fully connected layer of a CNN and learns both split nodes and leaf nodes in an iterative manner. More specifically, by fixing the leaf nodes, the split nodes as well as the CNN parameters are optimized by back-propagation. Then, by fixing the split nodes, the leaf nodes are optimized by iterating a step-size free and fast converging update rule derived from Variational Bounding. Instead of using this iterative strategy, we use the proposed NCL loss in each node to make them both accurate and diversified.
Image super-resolution. For image super-resolution, we choose the state-of-the-art DRRN [tai2017image] as our network backbone and change the L2 loss into the proposed NCL loss. More specifically, an enhanced residual unit structure is recursively learned in a recursive block, and several recursive blocks are stacked to learn the residual image between the HR and LR images. The residual image is then added to the input LR image from a global identity branch to estimate the HR image.
Eqn. (11) can be regarded as a smoothed version of Eqn. (10) to improve the generalization ability of the ensemble models. Please note that the optimal value of may not necessarily be 0.5 because of the discrepancy between the training and testing data [brown2005managing]. By setting , we actually achieve conventional ensemble learning (non-boosting type) where each model is optimized independently. It is straightforward to show that the first part in Eqn. (11) corresponds to bias plus an extra term , while the second part stands for the variance, covariance and the same term . Since the extra term appears on both sides, it cancels out when we combine them by subtracting, as done in Eqn. (11). Thus by introducing the second part in Eqn. (11), we aim at achieving better “diversity” with negative correlated base models to balance the components of bias variance and the ensemble covariance to reduce the overall mean square error (MSE).
To demonstrate this, consider the scenario in Fig. 2. We are using a regression ensemble consisting of regressors where the ground truth is . Each curve in Fig. 2 illustrates the evolution of one regressor when trained with gradient descent, i.e., , where and stands for the learning rate and mean-square loss function, respectively. is the index of individual models in the ensemble and stands for the index of iterations. Although both conventional ensemble learning (Fig. (a)a) and NCL (Fig. (b)b) may lead to correct estimations by simple model averaging, NCL results in much diversified individual models which make error cancellation being possible on testing data.
For generalization, consider the artificial spirals dataset in Fig. 1(a), where an ensemble of three single hidden layer feed-forward network (SLFN) is trained on. Then the ensemble is evaluated on data samples densely sampled on plane. The first row in Fig. 1 shows that NCL ensemble leads to more diversified SLFN, compared with conventional ensemble learning as illustrated in the second row of Fig. 1, thus making the resulting ensemble generalize well on testing data. Creating diverse sets of models has been extensively studied, both theoretically [brown2005managing, brown2005diversity, ren2016ensemble, dietterich2000ensemble, minku2009impact, lee2016stochastic, alhamdoosh2014fast, zhou2002ensembling] and empirically [fernandez2014we, hansen1990neural]. More specifically, Breiman [breiman2001random] derived a VC-type bound for generalization ability of ensemble models which advocated both accurate and decorrelated individual models. In addition, our methods also differ from the classical work of [liu2000evolutionary] which trains multiple shallow networks.
3.2.3 Connection with the Rademacher Complexity
We now show the bound for the Rademacher complexity [bartlett2002rademacher]
of the proposed deep negatively correlation learning. Firstly we will make no difference between convolution and fully-connected (FC) layers because FC layers can be easily transformed into convolution layers with property kernel size and padding values.
Definition 1. (Radamacher Complexity).
For a dataset
generated by a distribution on set and
a real-valued function class in ,
the empirical Rademacher complexity of is the random variable:
is the random variable:
where are usually referred as Rademacher Variable and are independent random variables uniformly chosen from . The Rademacher complexity of is .
The empirical Rademacher complexity is widely regarded as a proximity of the generalization ability based on the following theorem:
(Koltchinskii and Panchenko, 2000 [koltchinskii2002empirical]).
Fix and let be a class of
functions mapping from to [0,1].
Let be drawn independently according to
a probability distribution
be drawn independently according to a probability distribution. Then with probability at least over random draws of samples of size , every satisfies:
In addition, we have
Lemma 1. For and , let . If is -Lipschitz continuous, i.e. , then for any :
Proof. We provide the proof for , the general case works iteratively.
Based on Lemma 1, we have the following conclusion:
Lemma 2. Let , , . Assume that . Then for any sample of size :
Proof let . Then we have with . From Lemma 1, we have is Lipschitz over , then we have:
we complete the proof.
Furthermore, by combining as defined in Lemma 2 with Theorem 1, we have
Lemma 3. Let be a dataset generated by a distribution on set and be their corresponding labels. For a function class which maps the data to [0,1], define the commonly used mean square as and the empirical mean square error as as . Assume that . Then for a fixed , with probability at least over random draws of samples of size , every satisfies
We now show that the empirical Rademacher complexity of the proposed method on the training set is to the standard network:
Proposition 1. Denote by the group convolution based method and by , the conventional method. Then
The operation in Eqn. (22) stands for the convolution operator. stand for the feature subset of the feature maps . More specifically, we divide the feature map of along the axis into subset. The same procedure is applied on the kernel filter .
Remark 1. The empirical Rademacher complexity measures the ability of functions from a function class (when applied to a fixed set ) to fit random noise. It is a more modern notion of complexity that is distribution dependent and defined for any class real-valued functions. On the one hand, by setting , our method works in a “divide and conquer” manner and the whole Rademacher complexity is reduced by a factor of , which, intuitively speaking, making the function easier to learn. On the other hand, may also affect the term of . For instance, setting an extremely larger value of may also lead to a larger value of because much less input feature is provided for each base predictor.
In this section, we investigate the feasibility of the proposed method on four regression tasks: crowd counting, personality analysis, age estimation and single image super-resolution. The proposed method is implemented in Caffe[jia2014caffe]. In order to further understand the merits of the proposed methods, we also include some variants of the proposed method. More specifically, for each task, we replace the proposed loss function with L2, SmoothL1 and Tukey loss, and they are referred as “L2”, “SmoothL1” and “Tukey”, respectively. For the SmoothL1 loss, instead of using a fixed value of for the threshold in Eqn. (1), we treat it as another hyper-parameter and optimized it on the training data. We do not compare the proposed method explicitly with naive implementations of multiple CNN ensemble as their computational time is much larger and thus are less interested to us. We highlight the best results in each case in Red. The second and third best methods are highlighted in Green and Blue, respectively. As different evaluation protocols may be utilized in different applications, we put a after each metric to indicate the cases wherever a larger value is better. Similarly, is used in cases wherever smaller value indicates better performance.
4.1 Crowd Counting
For crowd counting, we evaluate the proposed methods on three benchmark datasets: UCF_CC_50 dataset [idrees2013multi], Shanghaitech dataset [zhang2016single] and WorldExpo’10 dataset [zhang2015cross]
. The proposed networks are trained using Stochastic Gradient Descent with a mini-batch size of 1 at a fixed constant momentum value of 0.9. Weight decay with a fixed value of 0.0005 is used as a regularizer. We use a fixed learning rate ofin the last convolution layer of our crowd model to enlarge the gradient signal for effective parameter updating and use a relatively smaller learning rate of in other layers. We set the ensemble size to be 64. More specifically, we use a convolution layer with the kernel of as regressor on each output feature map to get the final crowd density map. Specifically, each regressor is sparsely connected to a small portion of feature maps from the last convolutional layer () of VGG16 network, implemented via the well-established “group convolution” strategy [krizhevsky2012imagenet, wang2016stct]. We also include another variant of the proposed method called “NCL”, which is a shallow network optimized with the same loss function. The details of this network will be elaborated in Section XI.
The widely used mean absolute error (MAE) and the root mean squared error (RMSE) are adopted to evaluate the performance of different methods. The MAE and RMSE are defined as follows:
Here represents the total number of images in the testing datasets, and are the ground truth and the estimated value respectively for the image.
|Rodriguez et al. [rodriguez2011density]||✗||655.7||697.8|
|Lempitsky et al. [lempitsky2010learning]||✗||493.4||487.1|
|Isrees et al. [idrees2013multi]||✗||419.5||541.6|
|Zhang et al. [zhang2015cross]||✓||467.0||498.5|
|Zhang et al. [zhang2016single]||✓||377.6||509.1|
|Zeng et al. [zeng2017multi]||✓||363.7||468.4|
|Mark et al. [MarsdenMLO16a]||✓||338.6||424.5|
|Daniel et al. [onoro2016towards]||✓||333.7||425.2|
|Sam et al. [sam2017switching]||✓||318.1||439.2|
|Elad et al. [walach2016learning]||✓||364.2||-|
UCF_CC_50 dataset The challenging UCF_CC_50 dataset [idrees2013multi] contains 50 images that are randomly collected from the Internet. The number of head ranges from to with an average of individuals per image. The total number of annotated persons within images is . Challenging issues such as large variations in head number among different images from a small amount of training images come in the way of accurately counting for UCF_CC_50 dataset. We follow the standard evaluation protocol by splitting the dataset randomly into five parts in which each part contains ten images. Five-fold cross-validation is employed to evaluate the performance. Since the perspective maps are not provided, we generate the ground truth density map by using the method of Zhang et al. [zhang2016single].
We compare our method on this dataset with the state-of-the-art methods. In [rodriguez2011density, lempitsky2010learning, idrees2013multi], handcraft features are used to regress the density map from the input image. Several CNN-based methods in [zhang2015cross, boominathan2016crowdnet, zhang2016single, zeng2017multi, MarsdenMLO16a, onoro2016towards, sam2017switching] were also considered here due to their superior performance on this dataset. Table I summarizes the detailed results. Firstly, it is obvious that most deep learning methods outperform hand-crafted features significantly. In [boominathan2016crowdnet], Boominathan et al. proposed to employ a shallow network to assist the training process of deep VGG network. With the proposed deep negative learning strategy, it is also interesting to see that 1) both our deep (“DNCL”) and shallow (“NCL”) networks work well; 2) deep networks (“DNCL”) are better than shallower networks (“NCL”), as expected. However, shallower network still leads to competitive results and may be advantageous in resource-constrained scenarios as it is computationally cheaper; (3) it is straightforward to see that the deeper version of the proposed method outperforms all others on this dataset; (4) the proposed method performs favorably against a naive application of multiple CNN ensemble of [walach2016learning].
|Zhang et al. [zhang2015cross]||181.8||277.7||32.0||49.8|
|Zhang et al. [zhang2016single]||110.2||173.2||26.4||41.3|
|Sam et al. [sam2017switching]||90.4||135.0||21.6||33.4|
|Liu et al. [liu2018decidenet]||-||-||20.8||29.4|
|Zhang et al. [zhang2015cross]||9.8||14.1||14.3||22.2||3.7||12.9|
|Zhang et al. [zhang2016single]||3.4||20.6||12.9||13.0||8.1||11.6|
|Sam et al. [sam2017switching]||4.4||15.7||10.0||11.0||5.9||9.4|
|Liu et al. [liu2018decidenet]||2.0||13.1||8.9||17.4||4.8||9.3|
Shanghaitech dataset The Shanghaitech dataset [zhang2016single] is a large-scale crowd counting dataset, which contains 1198 annotated images with a total of 330,165 persons. This dataset is the largest one in the literature in terms of the number of annotated pedestrians. It consists of two parts: Part_A consisting of 482 images that are randomly captured from the Internet, and Part_B including 716 images that are taken from the busy streets in Shanghai. Each part is divided into training and testing subset. The crowd density varies significantly among the subsets, making it difficult to estimate the number of pedestrians.
We compare our method with six existing methods on the Shanghaitech dataset. All the detailed results for each method are illustrated in Table II. In the same way, we can see that all deep learning methods outperform hand-crafted features significantly. The shallow model in [zhang2016single] employs a much wider structure by a multi-column design and performs better than the shallower CNN models in [zhang2015cross] in both cases. A deeper version of the proposed method performs consistently better than the other shallow one, as expected, because of employing a much deep pre-trained model. Moreover, it is interesting to see that with deep negative learning, even a relatively shallower network structure is on a par with a much complicated and state-of-the-art switching strategy [sam2017switching]. Finally, our deep structure leads to the best performance in terms of MAE on Part_A and RMSE on Part_B.
WorldExpo’10 dataset The WorldExpo’10 dataset [zhang2015cross] is a large-scale and cross-scene crowd counting dataset. It contains 1132 annotated sequences which are captured by 108 independent cameras, all from Shanghai 2010 WorldExpo’10. This dataset consists of 3980 frames with a total of 199,923 labeled pedestrians, which are annotated at the centers of their heads. Five different regions of interest (ROI) and the perspective maps are provided for the test scenes.
We follow the standard evaluation protocol and use all the training frames to learn our model. For comparison, the quantitative results are given in Table III. In the same way, we observe that learned representations are more robust than the handcraft features. Even without using the perspective information, our results are still comparable with another deep learning method [zhang2015cross] which used perspective normalization to crop square meters patches with 0.5 overlaps on testing time. The deeper version of our proposed method outperforms all other in terms of average performance.
4.2 Personality Analysis
For personality analysis, the ensemble size is set to be 16. We use the ChaLearn personality dataset [ponce2016chalearn], which consists of short video clips with 41.6 hours (4.5M frames) in total. In this dataset, people face and speak to the camera. Each video is annotated with personality attributes as the Big Five personality traits (Extraversion, Agreeableness, Conscientiousness, Neuroticism and Openness) [ponce2016chalearn] in . The annotation was done via Amazon Mechanical Turk. For the evaluation, we follow the standard protocol in ECCV 2016 ChaLearn First Impression Challenge [ponce2016chalearn], and use the mean accuracy and coefficient of determination , which are defined as follows:
where denotes the total number of testing samples, the ground truth, the prediction, and the average value of the ground truth.
winner, ChaLearn First Impressions Challenge (ECCV 2016).
winner, ChaLearn First Impressions Challenge (ICPR 2016)
We train the whole network with an initial learning rate of . For each mini-batch, we randomly select 10 videos thus generating a total batch size of 100. We set in this experiment and train the network for iterations and decrease the learning rate by a factor of 10 in the , and iteration.
The quantitative comparison between the proposed method and other state-of-the-art works on personality recognition is shown in Table IV. Moreover, Table V lists the comparison of the details of several latest personality recognition methods. In contrast to other approaches, ours can be trained end-to-end using only one pre-trained model. Moreover, unlike most methods which fuse both acoustic and visual cues, our proposed method uses only video frames as input. The teams from NJU-LAMDA to BU-NKU-v1 are the top five participants in the ChaLearn Challenge on First Impressions [ponce2016chalearn]. Note that BU-NKU was the only team not using audio in the challenge, and their predictions were rather poor comparatively. After adding the acoustic cues, the same team won the ChaLearn Challenge on First Impressions [ponce2016chalearn]. Importantly, our methods only consider visual streams. Firstly, we observe that the deeply learned representations are well transferable between face verification and personality analysis. This can be verified by the last four results in Table IV. By utilizing state-of-the-art face verification network and good practices in video classification [wang2016temporal], those methods outperform current state-of-the-arts. Secondly, L2 and SmoothL1 loss and Tukey Loss all lead to comparably good results for this task. Finally, the proposed method outperforms all the methods on both metrics in all scenarios.
|Human workers [han2015demographic]||6.3||51.0||4.70||69.5|
|Rothe et al. [rothe2016some]||3.45||-||5.01||-|
4.3 Age Estimation
|SRCNN [dong2016image]||SelfEx [huang2015single]||RFL [schulter2015fast]||VDSR [kim2016accurate]||DSRN [han2018image]||DRRN [tai2017image]||Tukey||SmoothL1||DNCL|
We use the same training and evaluation protocol as done in [Shen_2018_CVPR]
. More specifically, we first use a standard face detector to detect faces[viola2001rapid] and further localized the facial landmarks by AAM [cootes2001active]. The ensemble size is 5. After that, we perform face alignment to guarantee all eyeballs stay at the same position in the image. We further augment the training data by the following strategies: (1) cropping images with some random offsets, (2) adding Gaussian noise to the original images, and (3) randomly flipping from left to right. We compare the proposed method with various state-of-the-arts on two standard benchmarks: MORPH [ricanek2006morph] and FG-NET [panis2016overview].
As for the evaluation metric, we follow the existing method and chooseMean Absolute Error (MAE) as well as Cumulative Score (CS). CS is calculated by , where is the total number of testing images and is the number of testing facial images whose absolute error between the estimated age and the ground truth age is not greater than years. Here, we set the same error level 5 as in [Shen_2018_CVPR].
We first summarize our results on MPRPH dataset in Table VI. It contains more than 55,000 images from about 13,000 people of different races. We perform our evaluation on the first setting (setting I) [Shen_2018_CVPR] which selects 5,492 images of Caucasian Descent people from the original MORPH dataset to reduce the cross-ethnicity effects. In this setting, these 5,492 images are randomly partitioned into two subsets: of the images are selected for training and others for testing. The random partition is repeated 5 times, and the final performance is averaged over these 5 different partitions. Since the DRF method [Shen_2018_CVPR] assumed each leaf node was a normal distribution, minimizing negative log likelihood loss was equivalent to minimize the L2 loss of each node 111Actually the released implementation of DRF [Shen_2018_CVPR] used L2 loss to avoid observing negative loss during training.. As can be seen from Table VI, the proposed method achieves the best performance on this dataset, and outperforms the current state-of-the-arts with a clear margin.
We then conduct experiments on FG-NET [panis2016overview], which contains 1002 facial images of 82 individuals. Each individual in FG-NET has more than 10 photos taken at different ages. The FG-NEt data is challenging because each image may have a large variation in lighting conditions, poses and expressions. We follow the protocol of [Shen_2018_CVPR] to perform “leave one out” cross validation on this dataset. The quantitative comparisons on FG-NET dataset are shown in Table VI. As can be seen, our method achieves better results (MSE: 3.71 vs 3.85 and CS: 81.8 vs 80.6) than DRF [Shen_2018_CVPR].
4.4 Image Super-resolution
We follow exactly the same training and evaluation protocol. More specifically, by following [kim2016accurate, schulter2015fast], a training dataset of 291 images, where 91 images are from Yang et al. [yang2010image] and other 200 images are from Berkeley Segmentation Dataset [martin2001database], were utilized for training. Finally, the method was evaluated on Set5 [bevilacqua2012low], Set14 [zeyde2010single], BSD100 [martin2001database] and Urban100 [huang2015single]
dataset, which have 5, 14, 100 and 100 images respectively. The initial learning rate is set to 0.1 and then decreased to half every 10 epochs. Since a large learning rate is used in our work, we adopt the adjustable gradient cliping[tai2017image] to boost the convergence rate while suppressing exploding gradients. Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity (SSIM) [wang2004image] and Information Fidelity Criterion (IFC) [sheikh2005information] were used for the quantitative evaluations.
Table VII summarizes the main results of both PSNR in db and SSIM () on the four testing sets. Similarly, the results of IFC are presented in Table VII. Firstly, we can find that the image super-resolution is extremely challenging as most state-of-the-art approaches perform comparably well. However, it is still obvious that the proposed method outperforms the original L2 loss in most cases, leading to even better results than a more recent work using dual-state recurrent networks [han2018image]. In addition, other loss functions such as SmoothL1 and Tukey loss are both outperformed by L2 loss in a large margin. Qualitative comparisons among DRRN [tai2017image] and SmoothL1, Tukey and our proposed method are illustrated in Fig. 4. As we can see, our method produces relatively sharper edges with respect to patterns, while other methods may give blurry results.
After demonstrating the superiority of the proposed method by extensively comparing them with many state-of-the-art methods on multiple datasets, we now provide more discussions to shed light upon their rationale and sensitivities with some hyper-parameters.
NCL or Conventional Ensemble Learning? In Table VIII, we compared the performance of the proposed method with conventional ensemble learning and choose crowd counting as a study case. It is widely accepted that training deep networks like VGG remains to be challenging. In [boominathan2016crowdnet], a shallow network was proposed to assist the training and improve the performance of deep VGG network. When compared with results achieved on dataset UCF_CC_50 by other methods shown in Table I, our implementation of a conventional ensemble method using a single VGG network leads to much improved results. However, it still over-fits severely compared with other state-of-the-art methods. More specifically, it was outperformed by recent methods such as multi-column structure [zhang2016single], multi-scale Hydra method [onoro2016towards], and advanced switching strategy [sam2017switching]. In contrast, the proposed method leads to much improved performance compared with this baseline in all cases and outperforms all the aforementioned methods. As illustrated in Fig. 2, the NCL mechanism used here encourages diversities in the ensemble and thus it is more likely to allow error canceling. For more results on other datasets, please refer to Table IX and Table X.
The learning objective function in Eqn. (11) is also in line with Breiman’s strength-correlation theory [breiman2001random] on the VC-type bound for the generalization ability of ensemble models, which advocated both accurate and decorrelated individual models. It as well appreciated that the individual model should be able to exhibit different patterns of generalization– a very simple intuitive explanation is that a million identical estimators are obviously no better than a single.
DNCL for other loss functions It is widely-accepted that encouraging the diversity could generate better ensemble. Although DNCL is derived under the commonly used L2 loss function, here we show that naively apply this idea to other loss functions could be beneficial. To this end, we replace the first part in Eqn. 11 with other loss functions while keep the second part unchanged to make the ensemble negatively-correlated. We report the detailed results in Table IX and Table X. Firstly, the results show that the proposed ensemble strategy still generates better results than single model for each loss function but is outperformed by the proposed method. Secondly, one can observe that in some cases NCL with other loss functions were outperformed by conventional ensemble. This indicates that another diversity measurements [tang2006analysis] could be better when other loss functions were utilized.
|Chalearn (Average Score)||Urban100 ()|
Effect of and . Parameter controls the correlation between each model in the ensemble. On the one hand, setting is equivalent to train each regressor in an independent manner. On the other hand, employing a larger value for overemphasizes the effect of diversity and may lead to poor individual regressors. We empirically find that setting to be a relatively smaller value usually leads to satisfactory results. Parameter stands for the number of base regressors in the ensemble. Theoretically speaking, conventional ensemble learning such as bagging and decision tree ensemble requires larger ensemble sizes [rodriguez2006rotation, fernandez2014we, zhang2017benchmarking] to perform well. However, with the constraint of using the same amount of parameter, increasing the value of will pass each base model less input information, which may lead to worse performance. We empirically find that the proposed method works well even with a relatively smaller ensemble size. For crowd counting, setting to be within 32 and 64 can generate satisfactory results and it is set to be 64 by default as no significant improvement is observed with a more number of regressors. More detailed report on the effect of are provided in Table XI. Similarly, the performances of personality analysis and image super-resolution are stable when is within [8,16] and [16,32] and they are set to be 8 and 16, respectively. For age estimation, we use the same ensemble size of 5 as done in the original paper [Shen_2018_CVPR].
Independent of the network backbone. While tremendous progress has been achieved in vision community by aggressively exploring deeper [he2016deep] or wider architectures [Zagoruyko2016WRN], specially-designed network architecture [shi2016rank, liu_tpami_2018]
, or heuristic engineering tricks[he_bag] with the standard “convolution + pooling” recipe, we want to emphasize that the proposed method is independent of the network backbone and almost complementary to those strategies. To show this , we first observe that combing the proposed NCL learning strategies with each “special-purpose” network in each task can lead to improved results. In order to further demonstrate the independence between the proposed method and the network backbone, we choose crowd counting as an example and train a relatively shallower model named as NCL, which is constructed by stacking several Multi-Scale Blob as shown in Fig. 5, aiming to increase the depth and expand the width of the crowd model in a single network. Multi-Scale Blob (MSB) is an Inception-like model which enhances the feature diversity by combining feature maps from different network branches. More specifically, it contains multiple filters with different kernel size (including , and ). This also makes the net more sensitive to crowd scale changing of the images.
Motivated by VGGNet [simonyan2014very], to make the model more discriminative, we further achieve and convolutional layers by stacking two and three convolutional layers, respectively. In our adopted network, the first convolution layer consists of filters and is followed by a max pooling layer. After that, we stack two MSB modules as demonstrated in Fig. 5, where the first MSB module is followed by a max-pooling layer. The number of feature maps of each convolution layer in these two MSB modules is 24 and 32, respectively. Finally, we use the same convolution layer on each of the feature maps as regressor to get the final crowd density map.
The main results of the shallow network can be found in Table I, Table II and Table III. With the proposed negative correlation learning strategy, it is also interesting to see that 1)both our deep and shallow networks work well; 2) deep networks (DNCL) are better than shallower networks (NCL), as expected. However, the shallower network (NCL) still leads to competitive results and may be advantageous in resource-constrained scenarios as it is computationally cheaper.
Other Aggregation Methods Our loss function is derived under the widely-used ensemble setting in which each base model is assigned with equal importance. In this part we investigate the effect of DNCL when different base model has different importance. To this end, we use another convolution to aggregate the results from each base model and report the results the on Shanghaitech Part A dataset and the results are summarized in Table XII. The experimental results shows that the proposed methods achieve better results. However, as the diversities are also enhanced in the “weighted average” method, the results are better than conventional ensemble, as expected.
Visualization of the Enhanced Diversities. In this section we provide another evidence to show the enhanced diversities in our ensemble methods. We choose crowd counting as our studying case and compute their pair-wise Euclidean distance between each pair of the predictions from each base model. From Fig. (e)e and Fig. (f)f, we can easily observe that there exist a larger discrepancy in the proposed method, as we expected. Finally, a more diversified ensemble leads to better final performance, as can be found in Fig. (b)b, (c)c and (d)d.
In this paper, we present a simple yet effective learning strategy for regression. We pose a typical deep regression network as an ensemble learning problem and learn a pool of weak regressors using convolutional feature maps. The main component of this ensemble architecture is the introduction of negative correlation learning (NCL), which aims to improve the generalization capability of the ensemble models. We show the proposed method has sound generalization capability through managing their intrinsic diversities. The proposed method is generic and independent of the backbone network architectures. Extensive experiments on several challenging tasks including crowd counting, personality analysis, age estimation and image super-resolution demonstrate the superiority of the proposed method over other loss functions and current state-of-the-arts.
This research was supported by NSFC (NO. 61620106008), the national youth talent support program, and Tianjin Natural Science Foundation (17JCJQJC43700, 18ZXZNGX00110).