MReT2019
The code and database for MReT2019
view repo
Exterior contour and interior structure are both vital features for classifying objects. However, most of the existing methods consider exterior contour feature and internal structure feature separately, and thus fail to function when classifying patchy image structures that have similar contours and flexible structures. To address above limitations, this paper proposes a novel Multi-Orientation Region Transform (MORT), which can effectively characterize both contour and structure features simultaneously, for patchy image structure classification. MORT is performed over multiple orientation regions at multiple scales to effectively integrate patchy features, and thus enables a better description of the shape in a coarse-to-fine manner. Moreover, the proposed MORT can be extended to combine with the deep convolutional neural network techniques, for further enhancement of classification accuracy. Very encouraging experimental results on the challenging ultra-fine-grained cultivar recognition task, insect wing recognition task, and large variation butterfly recognition task are obtained, which demonstrate the effectiveness and superiority of the proposed MORT over the state-of-the-art methods in classifying patchy image structures. Our code and three patchy image structure datasets are available at: https://github.com/XiaohanYu-GU/MReT2019.
READ FULL TEXT VIEW PDF
In this paper, a 1d convolutional neural network is designed for
classif...
read it
Occlusion relationship reasoning demands closed contour to express the
o...
read it
Fine-grained image classification is to recognize hundreds of subcategor...
read it
Existing deep Thermal InfraRed (TIR) trackers usually use the feature mo...
read it
Recognizing objects from subcategories with very subtle differences rema...
read it
The basic principles in designing convolutional neural network (CNN)
str...
read it
This paper studies image-based geo-localization (IBL) problem using
grou...
read it
The code and database for MReT2019
Patchy image structure classification is a fundamental yet significant research topic in computer vision and artificial intelligence (AI) research communities. Recent years have witnessed great progress of patchy image structure classification in applications highly related to AI agriculture and smart farming, such as plant branch structure classification
[8], leaf vein classification [9], and insect wing vein classification [41]. Patchy image structures characterize topological information of both exterior contours and interior structures from a target shape, which are important for shape modeling and classification. An example of illustrating the patchy image structures is shown in Fig. 1. Unfortunately, classifying such patchy image structures remains as an open problem, due to the fact that target shapes from various categories may have highly similar contours and flexible interior structures.In the past decades, many researchers have devoted their efforts to addressing this challenging problem. Existing approaches can be broadly classified into two categories: (1) handcrafted feature based methods and (2) deep convolutional neural network (ConvNets) techniques. Most approaches in the first category focus on developing effective feature representations of the shape contour [11, 20, 7, 28, 25, 4]. Although these methods achieved promising performance in general shape classification tasks, they may fail to function when the contour of objects among different classes are highly similar, such as the patchy image structures shown in Fig. 1.
Alternatively, one may consider applying ConvNets techniques, which are currently driving advances in image classification tasks, to the patchy image structure classification. The most powerful and practical ConvNets methods are the so-called backbone networks, such as Alexnet [52], VGG-16 [21], and ResNet-50 [34], which have been proven to be very effective in large-scale image classification tasks. More recently, approaches [fu2017look, Zheng_2019_CVPR, Yang2018Learning, chen2019destruction]
focusing on fine-grained classification may also contribute to the challenging patchy image structure classification. However, a major limitation is that deep learning techniques rely heavily on the amount of the training samples,
, they may fail to perform when limited training samples are provided [37].In this paper, we address the patchy image structure classification problem by proposing a novel multi-orientation region transform method that can simultaneously characterize exterior contour, flexible interior structure, and inter-relationship between them. The contributions of our work are summarized as follows: (1) A Multi-Orientation Region Transform (MORT) is proposed to provide unified discriminative description of the flexible interior connection structure, shape of the exterior contour, and their inter-relationship in the patchy distance map of a patchy image structure with arbitrary orientation. (2) The proposed MORT is theoretically and experimentally proved to be rotation and translation invariant, which is important for effective and practical shape description and matching. (3) The proposed MORT can be extended to combine with ConvNets based features for further performance enhancement. (4) The encouraging experimental results demonstrate the effectiveness and efficiency of the proposed MORT over the state-of-the-art benchmarks on patchy image structure classification.
Existing methods for patchy image classification can be coarsely categorized into handcrafted feature based methods and deep convolutional neural network (ConvNets) techniques. A common strategy of handcrafted feature based methods is to present contour-based feature representations, which has yielded desirable performance for shape classification on various publicly available shape databases [7, 11, 25, 28]. Another line of work focuses on deformation-based shape analysis and classification. A recent progress on deformation-based shape analysis [20] enables simple physical interpretations of the resulting deformations. However, their proposed similarity metric is based on the differential quantities which are highly sensitive to noise and local perturbations. To overcome the sensitivity to local perturbations, a deformation based curved shape representation (DBCSR) [4] is proposed based on the relative transformation matrices between neighboring points, such that shapes are represented as elements of matrix Lie group. However, these methods may fail to function when classifying shapes with similar contours in ultra-fine granularity (e.g. leaf images from the same species but different cultivars).
Recently, ConvNets methods have been successfully applied in various visual tasks, thus may also contribute to the challenging patchy image structure classification. Most existing ConvNets methods are designed based on the so-called backbone networks, such as Alexnet [52], VGG-16 [21], and ResNet-50 [34]. Given the effectiveness and practicality of these backbone networks in image classification tasks, they are naturally regarded as the first options for exploring new classification tasks [wu2019ip102].
Another promising alternative is to apply the fine-grained classification techniques, which are particular capable of classifying similar shapes. lin2017improved (lin2017improved) introduced statistics normalization methods to improve an architecture that can capture second-order statistics of convolutional features in a translationally invariant manner. li2018towards (li2018towards) proposed to utilize the second-order information to improve the fine-grained classification performance. Some researchers propose to first locate the discriminative regions and then classifies based on these regions [huang2016part]. Such methods, however, require additional bounding box annotations on targeted regions. Another technique focuses on localizing discriminative regions by attention mechanism in an unsupervised manner, without requiring extra annotations. Unfortunately, these methods [zheng2017learning, fu2017look, Zheng_2019_CVPR] require additional network structure (e.g., attention mechanism), leading to extra computation overhead for both training and inference stages. To that end, Yang2018Learning (Yang2018Learning) proposed a method that is weak supervised without requiring the annotations of parts or key areas. chen2019destruction (chen2019destruction) introduced a very effective Destruction and Construction (DCL) method for fine-grained classification. In their work, the input images are partitioned into local regions and then shuffled by a region confusion mechanism, leading to a classification network focusing on discriminative regions for spotting the subtle differences. Given a large number of training data, ConvNets based methods can yield desirable performance on classification tasks. However, the ConvNets based methods with limited training data may dramatically overfit the training data [37].
Aforementioned limitations motivate us to develop an effective and efficient method, without requiring large training sets for the challenging patchy image structure classification. Our work has the same assumption with the state-of-the-art handcrafted feature based methods [3, 11, 7, 28, 25, 20, 4], , the contours of shapes are assumed to be provided.
In this section, a novel Multi-Orientation Region Trans-form (MORT) method is proposed to encode not only global shape features but also local structures within a shape for finer-level structure pattern analysis. To better capture the interior structure, shape of the exterior contour, and their inter-relationship, a patchy distance map (PDM) is first developed. The proposed MORT (embedded with the PDM) and its discrete form is then described to construct the final feature descriptor. Finally, we present the rotation, translation and scale invariance analysis of the proposed feature descriptor, as well as the similarity measure using the transform coefficient matrices.
In set theory, when all sets under consideration are subsets of a given set , the absolute complement of is defined as the set of elements in but not in . Based on above notions, a patchy shape is defined to be composed of two subsets, interior set and its absolute complement. Specifically, given a shape with patchy structure, the patches are categorized into two types: (1) interior patches , defined as closed sub-regions inside the shape when the contour is filtered out (e.g. closed pattern patches in a butterfly image); and (2) complementary patches , defined as the absolute complement of interior patch .
In order to integrate both geometrical and topological features of the patchy shape, we propose a patchy distance map (PDM) to enable comprehensive skeleton-based feature description, in which each patch is encoded with distance transform. Specifically, we apply Euclidean Distance Transform [12] to each patch, such that each pixel is assigned with a value computed by the distance between the pixel and the nearest patch boundary. By normalizing the distance map in each patch, both small patch and large patch can provide equally detailed description based on the PDM. An example of visualizing PDM is given in Fig. 2. For the convenience of description, we denote as the interior patchy distance map function, and as the complementary patch distance map function.
Given a contour point ( ranges from 0 to 1), the arc length , from to its end point on the contour along clock-wise direction, is defined as , where is the perimeter of contour (or the number of contour points when the contour is sampled in discrete form), and is the scale index ( where ). Denote and by coordinates and , a region integral on at scale is defined as:
(1) | ||||
where is the vertical distance from the origin to the boundary line defined by: . is the vertical distance from the origin to the base line defined by: . is the angle between the base line and y-axis. and is the 2D Dirac delta function defined as
(2) |
Here, in Eq. (1) ensures that only the points on the intersection of base line and boundary line are counted for integral calculation. Given a point indexed by under scale , variables , and together determine the integral region, denoted as , in Eq. (1). As moves along the contour, the region integral is performed over different defined by the scanning range and orientation of the boundary line at all possible scales (see Fig. 3).
The MORT on a given patchy distance map function at point is defined as
(3) |
The MORT at a single point
generates a vector of
coefficients with each describing the region integral for one scale at point . An example of the proposed MORT on point is illustrated in Fig. 3. By moving along the contour for a complete loop, which results in a varying , the above vector grows into a MORT coefficient matrix of by dimensions:(4) |
This matrix describes how the region integrals at different scales synchronously vary when the point moves. When scale and cut a half, a quarter, , off the contour to steer the region integral at different orientations (see Fig. 3(b)&(c)). The greater the is, the smaller integral region is scanned, which provides finer descriptions on local details of the target (see Fig. 3(d)-(i). When scale , and become the same point after a complete loop, the region integral is performed over the whole region of the given image, making the first row elements in matrix share the same value. For example, the 2nd, 3rd, 4th rows in matrix are visualized in Fig. 3(d), Fig. 3(e), and Fig. 3(f) when point moves to the position shown in Fig. 3(b).
Let be the subset of pixels enclosed in the outer contour of the object shape, where is the total number of pixels enclosed in the contour. The procedure of computing the discrete MORT coefficient matrix is presented in Algorithm 1, in which Eq. (1) becomes the sum of patchy distance map (PDM) values inside the moving integral region .
As Steps 2, 3 and 4 take time , and respectively, the algorithm 1 has a computational complexity of .
By extracting features from various orientations and scales, the image region can be described in a coarse-to-fine manner to provide comprehensive description of the shape. More importantly, this allows encoding the inter-relationship of contour and interior context of a shape.
When the initial location of the contour point, that steers the Multi-Orientation Region Transform (i.e., the red point in Fig. 3, moves clockwise, the entire column of in
shifts to the right. The magnitudes of its 1D Fourier transform coefficients are calculated by
(5) |
where and . To make the generated feature descriptor robust to noise and compact, the lowest M order coefficients are used to describe the target, where . And are used to construct the final feature descriptor as
(6) |
In this section, we present the rotation, translation and scale invariance analysis of the feature descriptor, which is important for effective and practical shape matching [6].
Lemma 1: feature descriptor is rotation invariant. It is not difficult to prove the rotation invariance. According to Eq. (5) and Eq. (6), every element in is invariant to the initial location of the contour point that steers the MORT, and thus enable to be invariant to rotation of the whole shape.
Lemma 2: A translation of by a vector (i.e. do not change any element in : . This shows that every element in is translation invariant, and thus is translation invariant.
Lemma 3: A scaling of by a factor (i.e. changes each element in by a factor : . In summary, the feature descriptor is invariant to rotation and translation, and becomes scaling invariant if it is normalized by the area of the shape.
Given two sets of matrices (refer to Eq. (6)) and representing the MORT coefficient matrices of the test and model patchy shapes respectively. Each matrix set is comprised of matrix pairs of and and is the number of types in each class. Note that in can only be matched against the matrix of the same type in , that is with the same type index . The dissimilarity between two given samples can be measured using the fast Minkowski distance of the two matrices:
(7) | ||||
It is worth noting that the proposed method (if needed) can be used as feature extractor together with a classifier, such as the classic support vector machine (SVM), for classification tasks
[17].In this section, we first introduce three publicly available patchy image structure datasets for classification evaluation. We then present the experimental results including both classification accuracy and computational time of the proposed MORT together with state-of-the-art methods. Finally, we present a fused MORT by integrating MORT with ConvNets methods via feature-level fusion, and report evaluation results on the three patchy image structure datasets.
The SoyCultivarVein dataset is a publicly available dataset, which comprises 100 categories (cultivars) with 6 samples (leaf images) in each cultivar and thus has a total number of 1006 = 600 images [Yu_IEEEAccess2019]. The leaves in the SoyCultivarVein dataset are highly similar due to the fact that they all belong to the same species, making it a new and challenging dataset for the artificial intelligence and pattern analysis research community.
The butterfly patchy image structure dataset (BtfPIS), is constructed by applying the canny edge detection [33]
to the binarized images of the first 50 images in each class from the public available Leeds butterfly dataset
[32]. There are 5010 = 500 images in the BtfPIS dataset. The BtfPIS dataset comprises large intra-class variations including rotation and scaling variations in each class.The insect wing patchy image structure dataset (IwPIS) [38]
, is adopted for evaluation, which comprises 25 classes of insect wings with 2 samples in each class. Although small, the IwPIS is very challenging for classification tasks due to the fact that all the samples are from the same order called Diptera. In other words, samples in IwPIS have very small inter-class variances that are difficult to be classified.
Algorithm | Accuracy (%) | Time (ms) |
---|---|---|
SC | 37.58 | |
SC+DP | 36.90 | |
IDSC | 49.07 | |
IDSC+DP | 46.93 | |
SRV | 37.09 | |
DBCSR (opt) | 35.72 | |
DBCSR (uni) | 34.89 | |
HSC | 43.98 | |
MDM | 39.39 | |
HF | 40.46 | |
Proposed MORT | 53.43 |
We compare the proposed MORT with ten state-of-the-art shape-based benchmarks. They are: (1) two versions of Shape Contexts, i.e., standard Shape Contexts (SC) and Shape Contexts with dynamic programming (SC-DP) [3], (2) Inner Distance Shape Contexts (IDSC) and Inner Distance Shape Contexts with dynamic programming (IDSC-DP) [11], (3) square-root velocity (SRV) method [20], (4) deformation-based curved shape representation (DBCSR) with uniform sampling and optimal sampling [4], (5) Hierarchical String Cuts (HSC) [25], (6) Multiscale Distance Matrix (MDM) [7], and (7) Height Functions (HF) [28]. The widely used Nearest Neighbor score (1NN) is employed for performance measurement, as used in the benchmark methods.
Note that, the MORT does not restrain how the interior and complementary patches shall be defined. In our experiments, for SoyCultivarVein dataset, the interior patches are defined as the patches enclosed only by vein points, while the complementary patches are those enclosed by a mixture of vein and contour points (see Figs. 4(a-c)). For BtfPIS dataset and IwPIS dataset, the interior patches are patches with bright pixels (i.e., their intensities are above or equal to the binarization threshold), while the complementary patches are those formed by dark pixels (i.e., their intensities are below the threshold), as illustrated in Figs. 4(d-f) and Figs. 4(g-i), respectively.
Table 1 illustrates the average classification accuracies of MORT together with the state-of-the-art shape-based methods. We repeat the classification evaluation 1000 times by reselecting different three samples randomly to construct the model set and the remaining samples as the testing set. The average results are reported. Using the 1NN evaluation protocol, the proposed MORT achieves the highest average classification accuracy of 54.20% (10.22% higher than the HSC and 5.13% higher than the IDSC), demonstrating its superiority in the ultra-fine-grained cultivar classification task. Table 1 also lists the computational cost (the average time for each matching) of all the competing methods. The matching speed of MORT is among the most efficient methods (ranked the second), demonstrating the efficiency of the MORT method.
In order to verify the rotation invariance of the proposed method, we construct a rotated SoyCultivarVein dataset, by rotating each leaf image in the SoyCultivarVein dataset with a random angle (from to ). We compare the proposed method with two rotation invariant shape classification methods, HSC and MDM on the rotated SoyCultivarVein dataset. The experimental results (see Table 2 in comparison with the Table 1) confirm the theoretical analysis on rotation invariance of MORT.
Algorithm | Accuracy (%) |
---|---|
HSC | 45.11 |
MDM | 39.30 |
Proposed MORT | 54.81 |
Algorithm | Accuracy (%) | Time (ms) |
---|---|---|
SC | 42.57 | |
SC+DP | 41.93 | |
IDSC | 57.45 | |
IDSC+DP | 54.32 | |
SRV | 60.49 | |
DBCSR (opt) | 52.95 | |
DBCSR (uni) | 58.53 | |
HSC | 60.78 | |
MDM | 54.84 | |
HF | 20.64 | |
Proposed MORT | 75.02 |
Table 3 illustrates the average classification accuracies of MORT together with the state-of-the-art shape-based methods. We repeat the classification evaluation 1000 times by reselecting different 25 samples randomly to construct the test set (and the remaining 25 samples are used as the model set) and their average results are reported. Using the 1NN evaluation protocol, MORT achieves the highest classification accuracy of 75.02%, which are significantly higher than the state-of-the-art benchmarks (14.24% higher than the HSC and 17.57% higher than the IDSC). The superior performance of the proposed method over the state-of-the-art benchmarks demonstrates the robustness of MORT in classifying images with large intra-class variances. The matching speed of MORT on BtfPIS dataset is also among the most efficient methods (ranked the second).
Table 4 shows the average classification accuracies and average matching time of MORT and the state-of-the-art methods on the challenging IwPIS dataset. Using the same repeating strategy and 1NN evaluation (half images as model set and the remaining as testing set) as adopted in previous experiments, MORT achieves the highest classification accuracy of 44.79%, which is higher than other 10 benchmarks (4.1% higher than the SC and 13.87% higher than the IDSC). The average matching time (for each matching) of MORT is ms, which is the lowest among all the benchmarks. The results show the superiority of MORT against state-of-the-art benchmarks in classifying patchy image structures with small inter-class variances.
Combining complementary feature representations may significantly improve the classification performance [wang2015beyond]. However, a key question is that whether the proposed MORT and the state-of-the-art ConvNets methods can provide each other complementary features for further performance improvement. To that end, we propose a fused MORT by concatenating the original MORT feature matrices and feature vectors extracted from a state-of-the-art ConvNets method, DCL [chen2019destruction], and then investigate the comparative results of ConvNets methods and the fused MORT. To facilitate the comparison between the proposed method and ConvNets methods, we combine the fused MORT (as the feature encoder) with SVM (as the classifier) using the same protocol as adopted by wu2019ip102 (wu2019ip102).
We compare the proposed MORT with the following state-of-the-art ConvNets methods: (1) Three state-of-the-art ConvNets models: Alexnet [52], VGG-16 [21], and ResNet-50 [34]; and (2) three fine-grained state-of-the-art methods: improved B-CNN [lin2017improved], fast-MPN-COV [li2018towards], and DCL [chen2019destruction].
All the models are implemented in Pytorch 1.0.0 and are pretrained on the ImageNet
[deng2009imagenet] and then fine-tuned on each dataset. In all experiments, the input images are resized to 440 440, and cropped to 384 384 randomly for training. Standard data augmentations are applied including random rotation within15 degree and horizontal flip with 0.5 probability. For fast-MPN-COV and improved B-CNN, the models are trained with the best setting in the implementation of fast-MPN-COV
[li2018towards]. For the remaining methods, the models are trained with the default settings in the implementation of DCL [chen2019destruction].Algorithm | Accuracy (%) | Time (ms) |
---|---|---|
SC | 40.78 | |
SC+DP | 37.59 | |
IDSC | 30.92 | |
IDSC+DP | 34.90 | |
SRV | 36.40 | |
DBCSR (opt) | 33.75 | |
DBCSR (uni) | 27.72 | |
HSC | 35.15 | |
MDM | 27.06 | |
HF | 36.96 | |
Proposed MORT | 44.79 |
For SoyCultivarVein, BtfPIS and IwPIS datasets, we select the first half images from each category as the training set, and the remaining images as the testing set. Table 5 lists the classification accuracies of all the competing ConvNets methods on the three datasets. The fused MORT achieves the best classification accuracy of 65% and 98.00% on SoyCultivarVein dataset and BtfPIS dataset, respectively. Nevertheless, the fused MORT ranks third in classification accuracy on IwPIS, with 6% and 12% lower than the improved B-CNN and fast-MPN-COV, respectively. A possible reason is that the fused MORT has much lower feature dimension (12092) per image compared with improved B-CNN (1262144) and fast-MPN-COV (132896). The overall results show that the fused MORT surpasses the original MORT in classification accuracy on all the three patchy image structure datasets, indicating that ConvNets methods and the proposed MORT can provide each other with complementary information. The superior results of the fused MORT over ConvNets methods on three datasets indicate that the proposed MORT can extract discriminative features even with very limited training samples, leading to the performance improvements of fused MORT over the sole use of DCL or original MORT.
Algorithm | Accuracy (%) | ||
---|---|---|---|
SoyCultivarVein | BtfPIS | IwPIS | |
Alexnet | 14 | 93.20 | 44 |
VGG-16 | 16 | 95.60 | 48 |
ResNet-50 | 28 | 96.80 | 52 |
improved B-CNN | 63 | 73.60 | 74 |
fast-MPN-COV | 51 | 77.20 | 80 |
DCL | 39 | 97.20 | 68 |
fused MORT | 65 | 98.00 | 68 |
In this paper, we presented a novel Multi-Orientation Region Transform (MORT) method, which is rotation, translation and scale invariant, for effective and efficient classification of patchy image structures. The proposed MORT can extract local structural features at various scales and orientations for comprehensive shape description. The encouraging experimental results on three patchy image structure databases demonstrate the effectiveness and efficiency of MORT for patchy image shape classification. Moreover, the results also indicate that the proposed MORT and ConvNets methods can provide each other with important complementary features to further improve the classification performance.
This work was supported in part by the Australian Research Council under Discovery Grant DP180100958 and Linkage Grant LP170100326.
Comments
There are no comments yet.