Constrained Mutual Convex Cone Method for Image Set Based Recognition

03/14/2019 ∙ by Naoya Sogi, et al. ∙ University of Tsukuba City, University of London UCL 0

In this paper, we propose a method for image-set classification based on convex cone models. Image set classification aims to classify a set of images, which were usually obtained from video frames or multi-view cameras, into a target object. To accurately and stably classify a set, it is essential to represent structural information of the set accurately. There are various representative image features, such as histogram based features, HLAC, and Convolutional Neural Network (CNN) features. We should note that most of them have non-negativity and thus can be effectively represented by a convex cone. This leads us to introduce the convex cone representation to image-set classification. To establish a convex cone based framework, we mathematically define multiple angles between two convex cones, and then define the geometric similarity between the cones using the angles. Moreover, to enhance the framework, we introduce a discriminant space that maximizes the between-class variance (gaps) and minimizes the within-class variance of the projected convex cones onto the discriminant space, similar to the Fisher discriminant analysis. Finally, the classification is performed based on the similarity between projected convex cones. The effectiveness of the proposed method is demonstrated experimentally by using five databases: CMU PIE dataset, ETH-80, CMU Motion of Body dataset, Youtube Celebrity dataset, and a private database of multi-view hand shapes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we propose a method for image-set classification based on convex cone models, which can exactly represent the geometrical structure of an image set. In particular, we discuss the effectiveness of combining the proposed method and the convolutional neural network (CNN) features extracted from a high-level hidden layer of a learned CNN.

For the last decade, image set-based classification methods have gained substantial attention in various applications using multi-view images or videos, such as 3D object recognition and motion analysis. The essence of image set based classification is on how to effectively and low-costly measure the similarity between two image sets. To this end, several types of methods using different models have been proposed (Fukui and Yamaguchi, 2005; Sakano and Mukawa, 2000; Fukui and Yamaguchi, 2007; Fukui et al., 2006; Fukui and Maki, 2015; Kim et al., 2007; Wang et al., 2008; Cevikalp and Triggs, 2010; Lu et al., 2017, 2015; Hayat et al., 2015; Feng et al., 2016; Shah et al., 2017; Yamaguchi et al., 1998).

In this paper, among the above various methods, we focus on subspace based methods, considering the compactness of a subspace model, simple geometrical relationship of class subspaces, and practical and efficient computation. In this type of method, a set of images is compactly modeled by a subspace in a high-dimensional vector space, where the subspace is generated by applying the Principal Component Analysis (PCA) to the image set without data centering. After converting each image set to a subspace, the similarity between two subspaces to be compared can be calculated by using the canonical angles between the subspaces 

(Afriat, 1957; Hotelling, 1936). Typical subspace-based methods are the mutual subspace method (MSM) (Yamaguchi et al., 1998) and its extension, the constrained mutual subspace method (CMSM) (Fukui and Yamaguchi, 2005).

Besides the above advantages, the validity of the subspace representation is also supported by the following physical characteristics: images of a convex object with Lambertian reflectance under various illumination conditions can be represented by a low-dimensional subspace, what is called an illumination subspace (Georghiades et al., 2001; Belhumeur and Kriegman, 1998; Lee et al., 2005). In other words, in object recognition, the subspace of an object can be stably generated from even few sample images under different illumination conditions. Our representation by convex cone is an enhanced extension of the subspace representation.

Conventional subspace-based methods take a raw intensity vector or a hand-crafted feature as the input. Regarding more discriminant features, many recent studies have revealed that CNN features are effective inputs for various types of classifiers (Sharif Razavian et al., 2014; Chen et al., 2016; Guanbin Li and Yu, 2015; Azizpour et al., 2016). Inspired by the successes in these studies, we expect that CNN features can also work as discriminant inputs for subspace based methods, such as MSM and CMSM. In this paper, we verify the effectiveness of CNN features for subspace based methods as the baseline. To the best of our knowledge, this paper is the first comprehensive report on the validity of the combination of MSM/CMSM and CNN features.

Figure 1: Conceptual diagram of the proposed constrained mutual convex cone method (CMCM). First, a set of CNN features is extracted from an image set. Then, each set of CNN features is represented by a convex cone. After the convex cones are projected onto the discriminant space , the classification is performed by measuring similarity based on the angles between the two projected convex cones and .

CNN feature vectors have only non-negative values when the rectified linear unit (ReLU

(Nair and Hinton, 2010)

is used as an activation function. Although there are many types of features with non-negative constraint, in this paper, we focus on CNN features. This characteristic of CNN features does not allow the combination of them with negative coefficients; accordingly, a set of CNN features forms a convex cone instead of a subspace in a high-dimensional vector space.

For example, it is well known that a set of front-facing images under various illumination conditions forms a convex cone, referred to as an illumination cone (Georghiades et al., 2001; Belhumeur and Kriegman, 1998; Lee et al., 2005). The illumination cone is a more strict representation than the illumination subspace mentioned above. Several previous studies have demonstrated the advantages of convex cone representation compared with subspace representation (Kobayashi and Otsu, 2008; Kobayashi et al., 2010; Wang et al., 2017, 2018). These advantages naturally motivated us to replace a subspace with a convex cone in models for a set of CNN features including the types of features with non-negative constraint.

In this framework, it is necessary to consider how to calculate the geometric similarity between two convex cones. To this end, we define multiple angles between two convex cones by following the definition of the canonical angles (Hotelling, 1936; Afriat, 1957) between two subspaces. Although the canonical angles between two subspaces can be analytically obtained from the orthonormal basis vectors of the two subspaces, the definition of angles between two convex cones is not trivial, as we need to consider the non-negative constraint. In this paper, we define multiple angles between convex cones sequentially from the smallest to the largest by repeatedly applying the alternating least squares method (Tenenhaus, 1988). Then, the geometric similarity between two convex cones is defined based on the obtained angles. We call the classification method using this similarity index the mutual convex cone method (MCM), corresponding to the mutual subspace method (MSM).

Moreover, to enhance the performance of the MCM, we introduce a discriminant space , which maximizes the between-class variance (gap) among convex cones projected onto the discriminant space and minimizes the within-class variance of the projected convex cones, similar to the Fisher discriminant analysis (Fisher, 1936). The class separability can be increased by projecting the class of convex cones onto the discriminant space , as shown in Fig.1. As a result, the classification ability of MCM is enhanced, similar to that of the projection of class subspaces onto a generalized difference subspace (GDS) in CMSM (Fukui and Maki, 2015). Finally, we perform the classification using the angles between the projected convex cones . We call this enhanced method the “constrained mutual convex cone method (CMCM),” corresponding to the constrained MSM (CMSM). This idea has been motivated by our previous preliminary work in (Sogi et al., 2018) and this paper shows more deep analysis with extensive and comprehensive experiments.

The main contributions of this paper are summarized as follows.

  1. We verify the validity of the combination of MSM/ CMSM and CNN features, which has not yet been reported in the research fields of computer vision and machine learning.

  2. To enhance the framework of the subspace based methods, we introduce a convex cone representation to accurately and compactly represent a set of features with non-negative constraint as typified by CNN features.

  3. We introduce two novel mechanisms in our image set based classification: a) multiple angles between two convex cones to measure the similarity between the cones; and b) a discriminant space to increase the class separability among convex cones.

  4. We propose two novel image set based classification methods, called MCM and CMCM, based on convex cone representation and the discriminant space.

The paper is organized as follows. In Section 2, we describe the algorithms of conventional methods, such as MSM and CMSM. In Section 3, we describe the details of the proposed method. In Section 4, we demonstrate the validity of the proposed method by visualization and classification experiments using four public datasets, i.e., CMU PIE (Gross et al., 2010), ETH-80 (Leibe and Schiele, 2003), CMU Motion of Body (Gross and Shi, 2001), and Youtube Celebrity (Kim et al., 2008), and a private database of multi-view hand shapes. Section 5 concludes the paper.

2 Related work

In this section, we first describe the algorithms for the MSM and CMSM, which are standard methods for image set classification. Then, we provide an overview of the concept of convex cones.

2.1 Mutual subspace method based on canonical angles

Figure 2: Conceptual diagram of the canonical angles and canonical vectors. The 1-st canonical vectors form the smallest angle between the subspaces. The 2-nd canonical vectors form the smallest angle in a direction orthogonal to .

MSM is a classifier based on canonical angles between two subspaces, where each subspace represents an image set.

Given -dimensional subspace and - dimensional subspace in -dimensional vector space, where , the canonical angles between and are recursively defined as follows (Hotelling, 1936; Afriat, 1957):

(1)

where and are the canonical vectors forming the -th smallest canonical angle between and . The -th canonical angle is the smallest angle in the direction orthogonal to the canonical angles as shown in Fig.2.

The canonical angles can be calculated from the orthogonal projection matrices onto subspaces and . Let be basis vectors of and be basis vectors of . The projection matrices and are calculated as and , respectively. is the

-th largest eigenvalue of

or . Alternatively, the canonical angles can be easily obtained by applying the SVD to the orthonormal basis vectors of the subspaces.

The geometric similarity between two subspaces and is defined by using the canonical angles as follows:

(2)

In MSM, an input subspace is classified by comparison with class subspaces using this similarity as shown in Fig.3.

Figure 3: Conceptual diagram of conventional MSM. Each image set is represented by a subspace, which is generated by applying the PCA to the set. In classification, the similarity between two subspaces is measured based on the canonical angles between them. An input subspace is assigned to the class of the subspace with the greatest similarity.

2.2 Constrained MSM

The essence of the constrained MSM (CMSM) is the application of the MSM to a generalized difference subspace (GDS) (Fukui and Maki, 2015), as shown in Fig.4. GDS is designed to contain only difference components among subspaces . Thus, the projection of class subspaces onto GDS can increase the class separability among the class subspaces, substantially improving the classification ability of MSM (Fukui and Maki, 2015).

2.3 Convex cone model

In this subsection, we explain the definition of a convex cone and the projection of a vector onto a convex cone. A convex cone is defined by finite basis vectors as follows:

(3)

As indicated by this definition, the difference between the concepts of a subspace and a convex cone is whether there are non-negative constraints on the combination coefficients or not.

Given a set of feature vectors , the basis vectors of a convex cone representing the distribution of can be obtained by non-negative matrix factorization (NMF) (Lee and Seung, 1999; Kim and Park, 2008). Let and . NMF generates the basis vectors by solving the following optimization problem:

(4)

where denotes the Frobenius norm. We use the alternating non-negativity-constrained least squares-based method (Kim and Park, 2008) to solve this problem.

Figure 4: Conceptual diagram of the constrained MSM (CMSM). By projecting class subspaces onto the generalized difference subspace, the separability between the classes is increased. By measuring the similarities among the projected subspaces using the canonical angles, the input subspace is assigned to either class 1 or 2.

Although the basis vectors can be easily obtained by the NMF, the projection of a vector onto the convex cone is slightly complicated by the non-negative constraint on the coefficients. In Kobayashi and Otsu (2008), a vector is projected onto the convex cone by applying the non-negative least squares method (Bro and De Jong, 1997) as follows:

(5)

The projected vector is obtained as .

In the end, the angle between the convex cone and a vector can be calculated as follows:

(6)

3 Proposed method

In this section, we explain the algorithms in the MCM and CMCM, after establishing the definition of geometric similarity between two convex cones.

3.1 Geometric similarity between two convex cones

We define the geometric similarity between two convex cones. To this end, we consider how to define multiple angles between two convex cones like canonical angles. Two convex cones and are formed by basis vectors and , respectively. Assume that for convenience. The angles between two convex cones cannot be obtained analytically like the canonical angles between two subspaces, as it is necessary to consider non-negative constraint. Alternatively, we find two vectors, and , which are closest to each other. Then, we define the angle between the two convex cones as the angle formed by the two vectors. In this way, we sequentially define multiple angles from the smallest to the largest, in order.

Figure 5: Conceptual diagram of the procedure searching for pairs of vectors . The first pair of and can be found by the alternating least squares method. The second pair of and is obtained by searching the orthogonal complement space of Span.

First, we search for a pair of -dimensional vectors and , which have the maximum correlation, using the alternating least squares method (ALS) (Tenenhaus, 1988). The first angle is defined as the angle formed by and . The pair of and can be found by using the following algorithm:

Algorithm to search for the pair and
Let and be the projections of a vector onto and , respectively. For the details of the projection, see Section 2.3.

  1. Randomly initialize .

  2. .

  3. .

  4. .

  5. If is sufficiently small, the procedure is completed. Otherwise, return to 2) setting .

  6. .

For the second angle , we search for a pair of vectors and with the maximum correlation, but with the minimum correlation with and . Such a pair can be found by applying ALS to the projected convex cones and on the orthogonal complement space of the subspace spanned by the vectors and as shown in Fig.5. Then is formed by and . In this way, we can obtain all of the pairs of vectors forming the -th angle , .

With the resulting angles , we define the geometrical similarity between two convex cones and as follows:

(7)

3.2 Mutual convex cone method

The mutual convex cone method (MCM) classifies an input convex cone based on the similarities defined by Eq.(7) between the input and the class convex cones. MCM consists of two phases, a training phase and a recognition phase, as summarized in Fig.6.

Given class sets with images .

Figure 6: Process flow of the proposed mutual convex cone method (MCM), which consists of a training phase and a recognition phase.

Training Phase

  1. Feature vectors are extracted from the images of class .

  2. The basis vectors of class- convex cone, , are generated by applying NMF to the set of feature vectors .

  3. are registered as the reference convex cone of class .

  4. The above process is conducted for all classes.

Recognition Phase

  1. A set of images is input.

  2. Feature vectors are extracted from the images .

  3. The basis vectors of the input convex cone, , are generated by applying NMF to the input set of feature vectors.

  4. The input image set is classified based on the similarity (Eq.(7)) between the input convex cone and the -th class reference convex cone .

3.3 Generation of discriminant space

To enhance the performance of the mutual convex cone method, we introduce a discriminant space , which maximizes the between-class variance and minimizes the within-class variance for the convex cones projected on , similarly to the Fisher discriminant analysis (FDA). In our method, the within-class variance is calculated from basis vectors of convex cones, and the between-class variance is calculated from gaps among convex cones for effectively utilizing the information formed by convex cones.

We define these gaps as follows. Let be the -th class convex cone with basis vectors , be the projection operation of a vector onto defined by Eq.(5), and be the number of the classes. We consider vectors , , such that the sum of the correlation is maximum. Such a set of vectors can be obtained by using the following algorithm. This algorithm is almost the same as the generalized canonical correlation analysis (Vía et al., 2005, 2007), except that the non-negative least squares (LS) method is used instead of the standard LS method.

Procedure to search for a set of first vectors

  1. Randomly initialize .

  2. Project onto each convex cone, and then normalize the projection as .

  3. .

  4. If is sufficiently small, the procedure is completed. Otherwise, return to 2) setting .

Next, we search for a set of second vectors with the maximum sum of the correlations under the constraint condition that they have the minimum correlation with the previously found . The second vectors can be obtained by applying the above procedure to the convex cones projected onto the orthogonal complement space of the vector . In the following, a set of the -th vectors can be sequentially obtained by applying the same procedure to the convex cones projected onto the orthogonal complement space of . In this way, we finally obtain the sets of . With the sets of , we define a difference vector as follows:

(8)

Considering that each difference vector represents the gap between the two convex cones, we define using these vectors as follows:

(9)

where can be set from 1 to .

Next, we define the within-class variance using the basis vectors for all classes of convex cones as follows:

(10)

where . Finally, the -dimensional discriminant space is spanned by eigenvectors corresponding to the largest eigenvalues of the following eigenvalue problem:

(11)

3.4 Constrained mutual convex cone method

Figure 7: Process flow of the proposed constrained MCM (CMCM). CMCM is an enhanced version of MCM with the projection of class convex cones onto the discriminant space .

We construct the constrained MCM (CMCM) by incorporating the projection onto the discriminant space into the MCM. CMCM consists of a training phase and a recognition phase, as shown in Fig.7. In the following, we explain each phase for the case in which classes have images each.

Training Phase

  1. Feature vectors are extracted from the images .

  2. The basis vectors of the -th class convex cone, , are generated by applying NMF to each class set of feature vectors.

  3. Sets of difference vectors are generated according to the procedure described in section 3.3.

  4. The discriminant space is generated by solving Eq.(11) using and .

  5. The basis vectors are projected onto the discriminant space and then the lengths of the projected basis vectors are normalized to 1. A set of these basis vectors forms the projected convex cone.

  6. are registered as the reference convex cones of class .

Recognition Phase

  1. A set of images is input.

  2. Feature vectors are extracted from the images .

  3. The basis vectors of a convex cone, , are generated by applying NMF to the set of feature vectors.

  4. The basis vectors are projected onto the discriminant space and then the lengths of the projected basis vectors are normalized to 1. The normalized projections are represented by .

  5. The input set is classified based on the similarity (Eq.(7)) between the input convex cone and each class reference convex cone .

4 Evaluation experiments

In this section, we demonstrate the effectiveness of the proposed methods through four experiments. The first experiment uses the ETH-80 dataset to verify the effectiveness of using multiple angles between convex cones as the similarity between them. The second experiment analyzes the attribute of difference vectors between two convex cones by visualizing the difference vectors as images. The third experiment evaluates the classification performance of the proposed methods using the three datasets, 1) ETH-80 (Leibe and Schiele, 2003), 2) CMU Motion of Body (CMU MoBo) (Gross and Shi, 2001), and 3) YouTube Celebrities (YTC) (Kim et al., 2008), with a large number of training samples. The fourth experiment demonstrates the robustness of the proposed methods against the small sample sizes (SSS) problem, considering the situation in which only few training samples are available for learning. In this experiment, we use the multi-view hand shape dataset (Ohkawa and Fukui, 2012)

4.1 Effectiveness of using multiple angles

Figure 8: Results of classification experiment. The vertical axis denotes accuracy, and the horizontal axis denotes the number of angles used for calculating the similarity.

In this experiment, we verify the effectiveness of using multiple angles for calculating the similarity between convex cones, through a classification experiment using the ETH-80 dataset. The ETH-80 dataset consists of object images in eight different categories, captured from 41 viewpoints. Each category has ten kinds of object. One object randomly sampled from each category set was used for training, and the remaining nine objects were used for test. As an input image set, we used 41 multi-view images for each object. We used images scaled to 32 32 pixels and converted to grayscale. Vectorized features of the grayscale images were used as input, i.e. the dimension of the feature vector is 1024.

We evaluated the classification performance of mutual convex cone method (MCM) and constrained MCM (CMCM), while varying the number of angles used for calculating the similarity. As baselines, the mutual subspace method (MSM) and constrained MSM (CMSM) were also evaluated. Dimensions of reference subspaces and convex cones were set to 20, and dimensions of input subspaces and convex cones were set to 10.

Fig.8 shows the accuracy changes of the different methods against the number of angles. The horizontal axis denotes the number of angles used for calculating the similarity. We can confirm that the accuracy of MCM and CMCM increases, as the number of angles increases. This result shows clearly the importance of comparing the whole structures of convex cones by using multiple angles rather than using only the minimum angle for accurate classification.

In case of using one or two angles, the accuracy of CMCM is less than CMSM. However, with an increase in the numbers of angles, CMCM outperforms the methods MSM and CMSM that are based on subspace representation. This indicates that using multiple angles is required to compare the structures of two convex cones.

4.2 Validity of difference vectors between convex cones

Figure 9:

Results of visualizing the difference vectors between two convex cones and difference vectors between the subspaces of neutral and simile. The parts with values larger than the threshold, which is automatically decided by Otsu’s binarization 

(Otsu, 1979), in the difference vectors are emphasized in red.

In this experiment, we demonstrate the validity of difference vectors, , between convex cones through the visualization of on two sets of facial expressions, neutral and smile. They were extracted from the CMU PIE dataset (Gross et al., 2010). Each set has 20 front face images taken under various illumination conditions.

After representing the two sets of raw images as convex cones, we generated the difference vectors between the two convex cones according to Eq.(8). For comparison, we also calculated the difference vectors between the canonical vectors of two subspaces of the two sets. We set the number of basis vectors of each convex cone to 5 and the dimension of each subspace to 5.

Figure 10: Mean images of absolute value images of the difference vectors between convex cones and the difference vectors between subspaces. (a) . (b) . The parts with the values larger than the threshold, which is automatically decided by Otsu’s binarization (Otsu, 1979), in the difference vectors are emphasized in red.

Fig.9 shows the visualizations of and . We can see that both sets of the difference vectors can emphasize regions around smile lines and eyes. These regions can move largely in comparison with other regions when changing from neutral face expression to smile. However, the resolutions in variation captured by them are a bit different. To take a closer look at this difference, we calculated mean images of the absolute values of the difference vectors, by and , as shown in Fig.10. The difference vectors, , between the subspaces capture roughly difference on the whole face. On the other hand, the difference vectors, , between convex cones capture clearly fine difference on smile lines and around eyes.

Figure 11: Results of the experiment using synthesized data. After generating convex cones for each set, we calculated difference vectors between and

. Then, we evaluated cosine similarities between two convex cones

, , which are spanned by and the difference images between pairs of original images, respectively.

Besides, to verify how much a set of difference vectors between two convex cones captures the difference in the structure of them, we conducted a comparison experiment using two synthetic convex cones and , which are shown in Fig.11. The convex cones are spanned by three basis vectors, which were generated by applying NMF to a set of images of two different objects synthesized under 100 illumination conditions. We calculated the difference vectors between and . Let the convex cone spanned by be convex cone . Note that the are not orthogonal to each other, so that they span a convex cone. Besides , we generated a convex cone , which is spanned by three basis vectors obtained by applying NMF to a set of difference image vectors between pairs of object images of classes 1 and 2. According to our definition, we expect that can have a high correlation with . In fact, the first three cosine similarities between and are 0.9104, 0.8478, and 0.5426 , respectively. The high correlations support that a set of the difference vectors, namely the convex cone spanned by them, captures effectively the structural difference between the convex cones.

4.3 Comparison of classification performance with conventional methods

In this subsection, we evaluate the classification performance of the proposed methods compared with various conventional methods using three public datasets. In the following, details of each dataset and experimental protocols are described. After that, experiment results are shown.

4.3.1 ETH-80 dataset

The ETH-80 dataset consists of eight different categories, captured from 41 viewpoints. Each category has ten kinds of object. Five objects randomly sampled from each category were used for training, and the remaining objects were used for testing. As an input image set, we used 41 multi-view images for each object. To conduct a consistent experiment with previous works, we used images scaled to 32 32 pixels (Shah et al., 2017; Hayat et al., 2015). We evaluated the classification performance of each method in terms of the average accuracy of ten trials using randomly divided datasets.

For MSM and CMSM, the dimensions of class subspaces, input subspaces, and GDS were set to 50, 30, and 395, respectively. For MCM and CMCM, the numbers of the basis vectors of class and input convex cones were set to 50 and 30, respectively. The dimension of the discriminant space was set to 450. We determined these dimensionalities by cross-validation using the training data.

In this experiment, we used CNN features as feature vectors. To obtain CNN features under our experimental setting, we modified the original ResNet-50 (He et al., 2016)

trained by the ImageNet database 

(Russakovsky et al., 2015) slightly for our experimental conditions. First, we replaced the final 1000-way fully connected (FC) layer of the original ResNet-50 with a 1024-way FC layer and applied the ReLU function. Then, we added a -way FC layer with softmax behind the previous 1024-way FC layer.

Moreover, to extract more effective CNN features from our modified ResNet, we fine-tuned our ResNet using the learning set. A CNN feature vector was extracted from the 1024-way FC layer every time an image was input into our ResNet. As a result, the dimensionality of a CNN feature vector was 1024.

In our fine-tuned CNN, an input image set was classified based on the average value of the output conviction degrees for each class from the last FC layer with softmax. In this section, we refer to this method as “softmax”.

4.3.2 CMU MoBo dataset

The CMU Mobo dataset (Gross and Shi, 2001) consists of 25 people videos walking on a treadmill. Although the original purpose of this dataset was to research on human gait analysis (Gross and Shi, 2001), in this experiment we conducted image set based face classification following previous works (Shah et al., 2017; Hayat et al., 2015; Cevikalp and Triggs, 2010; Wang et al., 2008).

The face images were detected by the Viola and Jones detection algorithm (Viola and Jones, 2004) from video frames. Detected face images were reshaped to 40 40 pixels and converted to grayscale. Face images extracted from one video was considered as an image set.

The dataset contains four walking patterns (videos) of each person, except for one person. We used videos of 24 people with all walking patterns. One video randomly sampled from each person was used for training, and the remaining three videos were used for testing. We repeated the evaluation ten times with different random selections.

For MSM and CMSM, the dimensions of class subspaces, input subspaces, and GDS were set to 50, 50, and 1000, respectively. For MCM and CMCM, the numbers of the basis vectors of class and input convex cones were set to 50 and 30, respectively. The dimension of the discriminant space was set to 1000. We determined these dimensionalities by cross-validation using the training data. CNN features were extracted from the fine-tuned ResNet under this experimental setting, according to the same procedure used in the previous experiments.

ETH-80 CMU Mobo YTC
DCC(Kim et al., 2007) 91.753.74 88.892.45 51.424.95
MMD(Wang et al., 2008) 77.505.00 92.502.87 54.043.69
CHISD(Cevikalp and Triggs, 2010) 79.535.32 96.521.18 60.425.95
MMDML(Lu et al., 2015) 94.53.5 97.81.0 -
ADNT(Hayat et al., 2015) 98.121.69 97.920.73 71.354.83
PLRC(Feng et al., 2016) 87.725.67 93.744.3 61.286.37
Reconstruct Model (Shah et al., 2017) 94.754.32 98.331.27 66.455.07
softmax 96.502.29 98.611.52 64.182.20

CNN feature + MSM
99.501.05 99.170.97 64.262.89
CNN feature + CMSM 99.501.05 99.580.67 66.452.36
CNN feature + MCM 99.501.05 98.751.22 64.112.68
CNN feature + CMCM 99.750.79 99.580.67 66.742.12
Table 1:

Experimental results (recognition rate (%), standard deviation) for the three public datasets.

4.3.3 YTC dataset

The YTC dataset (Kim et al., 2008) contains 1910 videos of 47 people. Similarly to (Shah et al., 2017), as an image set, we used a set of face images extracted from a video by the Incremental Learning Tracker (Ross et al., 2008). All the extracted face images were scaled to 30 30 pixels and converted to grayscale. Three videos per each person were randomly selected as training data, and six videos per each person were randomly selected as test data. We conducted five-fold cross-validation according to the above procedure.

For MSM and CMSM, the dimensions of class subspaces, input subspaces, and GDS were set to 70, 10, and 824, respectively. For MCM and CMCM, the numbers of the basis vectors of class and input convex cones were set to 50 and 40, respectively. The dimension of the discriminant space was set to 1000. We determined these dimensionalities by cross-validation using the training data. CNN features were extracted from the fine-tuned ResNet under this experimental setting, according to the same procedure used in the previous experiments.

4.3.4 Results and discussion

Table 1

shows the classification results of the proposed methods and various conventional methods, including several Deep Neural Networks based methods. First of all, we can see that the subspace-based methods and the proposed MCM/CMCM achieve comparative or better performances than that of the conventional methods in all the datasets. In particular, it is notable that the proposed methods achieve competitive results with more complex methods using deep learning, such as softmax, MMDML and ADNT. Especially, in ETH-80 and Mobo, they show very high recognition rates against these deep learning based methods. The conventional methods do not explicitly consider the structure information of an image set. In contrast, the proposed methods extract effectively the detailed structure information through the convex cone representation. This difference in the classification mechanism leads to the advantage of our methods.


Figure 12: ROC curves of subspace and convex cone based methods for the YTC dataset.

CMCM outperformed MCM in all the cases. This indicates that projecting onto the discriminant space can capture useful geometrical information to increase the class separability among the class convex cones, as we expected. CMSM also improves the performance of MSM. However, the improvement degree by CMCM is larger than that of CMSM. This implies that the discriminant space works better with convex cone representation to enhance the class separability among class cones.

The results on ETH-80 and Mobo show clearly the effectiveness of both of cone and subspace based methods against the conventional methods. However, it may be difficult to argue the advantage of CMCM against CMSM, since they both realized almost 100 recognition rate with near zero EERs. The databases seemed to be relatively easy for both types of methods to classify.

On the other hand, the YTC is difficult for all the methods, so that we can find apparent difference between the recognition rates of both. To visually confirm this advantage, we calculated the receiver operating characteristic (ROC) curves of four subspace and cone based methods, as shown in Fig.12. The ROC curves indicate clearly the strength of CMCM against CMSM. This superiority is also supported by the average the area under the curve (AUC) as follows:  CMSM and CMCM are 0.9002 and 0.9341 respectively.

4.4 Robustness against limited training data

A major issue with deep neural networks is the requirement of a large number of training samples to learn the networks accurately. Therefore, the robustness against small sample size (SSS) is a necessary characteristic for effective methods using CNN features in practice. In this experiment, we evaluated the robustness of the different methods against SSS using our private multi-view hand shape dataset (Ohkawa and Fukui, 2012).

4.4.1 Experimental protocol

Figure 13: Sample images of the multi-view hand shape dataset used in the experiments. Each row shows a hand shape from various viewpoints.

The multi-view hand shape dataset consists of 30 classes of hand shapes. Each class data was collected from 100 subjects at a speed of 1 fps for 4 s using a multi-camera system equipped with seven synchronized cameras at intervals of 10 degrees. During data collection, the subjects were asked to rotate their hands at a constant speed to increase the number of viewpoints. Figure 13 shows several sample images in the dataset. The total number of images collected was 84000 (= 30 classes4 frames7 cameras 100 subjects).

We randomly divided the subjects into two sets. One set was used for training, and the other was used for testing. We evaluated the performances of the methods by setting the numbers of subjects used for training to 1, 2, 3, 4, 5, 10, and 15. In each case, the total number of training images was 30 classes7 cameras4 frames subjects, (). We set the number of subjects used for testing to 50. As an input image set, we used 28 (=7 cameras 4 frames) images of a subject. Thus, the total number of convex cones for testing was 1500 (=30 classes50 subjects).

To extract CNN features from the images, we used the fine-tuned ResNet by using the training images under the experimental conditions.

softmax MSM CMSM MCM CMCM
1 36.07 62.27 65.87 63.07 67.87
2 71.41 73.47 74.73 74.60 75.33
3 83.87 85.27 87.40 85.67 87.47
4 86.60 87.60 91.00 88.27 91.33
5 91.60 91.13 92.87 92.07 93.53
10 95.73 95.27 95.73 95.40 96.27
15 96.53 96.20 96.27 96.67 97.00
Table 2: Change in the accuracies () against the number of training subjects.

4.4.2 Results and discussion

Table 2 shows the accuracies versus the number of training subjects. From the table, we can see that the overall performance of CMCM was better than that of the other methods. In particular, CMCM works well when the number of training subjects is small. For example, when is 1, CMSM and CMCM achieve an error rate of about half that for softmax. Moreover, CMCM outperforms the subspace based methods, MSM and CMSM. This further indicates that the convex cone based method can represent the distribution of a set of CNN features more accurately than the subspace based methods.

5 Conclusion

In this paper, we proposed a method based on the convex cone model for image-set classification, referred to as the constrained mutual convex cone method (CMCM). We discussed a combination of the proposed method and CNN features, though our method can be applied to various types of features with non-negative constraint.

The main contributions of this paper are 1) the introduction of a convex cone model to represent a set of feature vectors compactly and accurately; 2) the definition of the geometrical similarity of two convex cones based on the angles between them, which are obtained by the alternating least squares method; 3) the proposal of a method, i.e., MCM, for classifying convex cones using the angles as the similarity index; 4) the introduction of a discriminant space that maximizes between-class variance (gaps) between convex cones and minimizes within-class variance; and 5) the proposal of the constrained MCM (CMCM), which incorporates the above projection into the MCM.

We verified the effectiveness of multiple angles and the discriminant space which are the essence of the proposed frameworks through two experiments. Then, we evaluated the classification performances of the proposed methods by comparing with various types of conventional methods. The proposed methods achieved competitive results, whether the number of training samples is large or small.

Acknowledgements.
Part of this work was supported by JSPS KAKENHI Grant Number JP16H02842.

References

  • Afriat (1957) Afriat SN (1957) Orthogonal and oblique projectors and the characteristics of pairs of vector spaces. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol 53, pp 800–816
  • Azizpour et al. (2016) Azizpour H, Razavian AS, Sullivan J, Maki A, Carlsson S (2016) Factors of transferability for a generic ConvNet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(9):1790–1802
  • Belhumeur and Kriegman (1998) Belhumeur PN, Kriegman DJ (1998) What is the set of images of an object under all possible illumination conditions? International Journal of Computer Vision 28(3):245–260
  • Bro and De Jong (1997) Bro R, De Jong S (1997) A fast non-negativity-constrained least squares algorithm. Journal of Chemometrics 11(5):393–401
  • Cevikalp and Triggs (2010)

    Cevikalp H, Triggs B (2010) Face recognition based on image sets. In: Computer Vision and Pattern Recognition, IEEE, pp 2567–2573

  • Chen et al. (2016) Chen JC, Patel VM, Chellappa R (2016) Unconstrained face verification using deep CNN features. In: 2016 IEEE Winter Conference on Applications of Computer Vision, pp 1–9
  • Feng et al. (2016)

    Feng Q, Zhou Y, Lan R (2016) Pairwise linear regression classification for image set retrieval. In: Computer Vision and Pattern Recognition, pp 4865–4872

  • Fisher (1936) Fisher RA (1936) The use of multiple measurements in taxonomic problems. Annals of Human Genetics 7(2):179–188
  • Fukui and Maki (2015) Fukui K, Maki A (2015) Difference subspace and its generalization for subspace-based methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(11):2164–2177
  • Fukui and Yamaguchi (2005) Fukui K, Yamaguchi O (2005) Face recognition using multi-viewpoint patterns for robot vision. In: The Eleventh International Symposium of Robotics Research, pp 192–201
  • Fukui and Yamaguchi (2007) Fukui K, Yamaguchi O (2007) The kernel orthogonal mutual subspace method and its application to 3D object recognition. In: Asian Conference on Computer Vision, pp 467–476
  • Fukui et al. (2006) Fukui K, Stenger B, Yamaguchi O (2006) A framework for 3D object recognition using the kernel constrained mutual subspace method. In: Asian Conference on Computer Vision, pp 315–324
  • Georghiades et al. (2001) Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6):643–660
  • Gross and Shi (2001) Gross R, Shi J (2001) The CMU motion of body (MoBo) database. Tech. Rep. CMU-RI-TR-01-18, Carnegie Mellon University, Pittsburgh, PA
  • Gross et al. (2010) Gross R, Matthews I, Cohn J, Kanade T, Baker S (2010) Multi-pie. Image and Vision Computing 28(5):807–813
  • Guanbin Li and Yu (2015)

    Guanbin Li, Yu Y (2015) Visual saliency based on multiscale deep features. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp 5455–5463

  • Hayat et al. (2015) Hayat M, Bennamoun M, An S (2015) Deep reconstruction models for image set classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(4):713–727
  • He et al. (2016) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition
  • Hotelling (1936) Hotelling H (1936) Relations between two sets of variates. Biometrika 28(3/4):321–377
  • Kim and Park (2008) Kim H, Park H (2008) Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications 30(2):713–730
  • Kim et al. (2008) Kim M, Kumar S, Pavlovic V, Rowley H (2008) Face tracking and recognition with visual constraints in real-world videos. In: Computer Vision and Pattern Recognition, IEEE, pp 1–8
  • Kim et al. (2007) Kim TK, Kittler J, Cipolla R (2007) Discriminative learning and recognition of image set classes using canonical correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(6):1005–1018
  • Kobayashi and Otsu (2008) Kobayashi T, Otsu N (2008) Cone-restricted subspace methods. In: International Conference on Pattern Recognition, pp 1–4
  • Kobayashi et al. (2010) Kobayashi T, Yoshikawa F, Otsu N (2010) Cone-restricted kernel subspace methods. In: IEEE International Conference on Image Processing, pp 3853–3856
  • Lee and Seung (1999) Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788
  • Lee et al. (2005) Lee KC, Ho J, Kriegman DJ (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5):684–698
  • Leibe and Schiele (2003) Leibe B, Schiele B (2003) Analyzing appearance and contour based methods for object categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, vol 2, pp 409–415
  • Lu et al. (2015) Lu J, Wang G, Deng W, Moulin P, Zhou J (2015) Multi-manifold deep metric learning for image set classification. In: Computer Vision and Pattern Recognition, pp 1137–1145
  • Lu et al. (2017) Lu J, Wang G, Zhou J (2017) Simultaneous feature and dictionary learning for image set based face recognition. IEEE Transactions on Image Processing 26(8):4042–4054
  • Nair and Hinton (2010)

    Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, pp 807–814

  • Ohkawa and Fukui (2012) Ohkawa Y, Fukui K (2012) Hand-shape recognition using the distributions of multi-viewpoint image sets. IEICE Transactions on Information and Systems 95(6):1619–1627
  • Otsu (1979) Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1):62–66, DOI 10.1109/TSMC.1979.4310076
  • Ross et al. (2008) Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. International Journal of Computer Vision 77(1-3):125–141
  • Russakovsky et al. (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252
  • Sakano and Mukawa (2000) Sakano H, Mukawa N (2000) Kernel mutual subspace method for robust facial image recognition. In: International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies , vol 1, pp 245–248
  • Shah et al. (2017) Shah SAA, Nadeem U, Bennamoun M, Sohel FA, Togneri R (2017) Efficient image set classification using linear regression based image reconstruction. In: Computer Vision and Pattern Recognition Workshops, pp 601–610
  • Sharif Razavian et al. (2014) Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE Conference on Computer Vision and Pattern Recognition workshops, pp 806–813
  • Sogi et al. (2018) Sogi N, Nakayama T, Fukui K (2018) A method based on convex cone model for image-set classification with CNN features. In: International Joint Conference on Neural Networks (IJCNN), pp 1–8
  • Tenenhaus (1988) Tenenhaus M (1988) Canonical analysis of two convex polyhedral cones and applications. Psychometrika 53(4):503–524
  • Vía et al. (2005) Vía J, Santamaría I, Pérez J (2005) Canonical correlation analysis (CCA) algorithms for multiple data sets: Application to blind SIMO equalization. In: 13th European Signal Processing Conference, pp 1–4
  • Vía et al. (2007) Vía J, Santamaría I, Pérez J (2007) A learning algorithm for adaptive canonical correlation analysis of several data sets. Neural Networks 20(1):139–152
  • Viola and Jones (2004) Viola P, Jones MJ (2004) Robust real-time face detection. International Journal of Computer Vision 57(2):137–154
  • Wang et al. (2008) Wang R, Shan S, Chen X, Gao W (2008) Manifold-manifold distance with application to face recognition based on image set. In: Computer Vision and Pattern Recognition, IEEE, pp 1–8
  • Wang et al. (2017) Wang Z, Zhu R, Fukui K, Xue JH (2017) Matched shrunken cone detector (MSCD): Bayesian derivations and case studies for hyperspectral target detection. IEEE Transactions on Image Processing 26(11):5447–5461
  • Wang et al. (2018) Wang Z, Zhu R, Fukui K, Xue JH (2018) Cone-based joint sparse modelling for hyperspectral image classification. Signal Processing 144:417–429
  • Yamaguchi et al. (1998) Yamaguchi O, Fukui K, Maeda K (1998) Face recognition using temporal image sequence. In: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp 318–323