Tree bark re-identification using a deep-learning feature descriptor

12/06/2019 ∙ by Martin Robert, et al. ∙ Université Laval 0

The ability to visually re-identify objects is a fundamental capability in vision systems. Oftentimes, it relies on collections of visual signatures based on descriptors, such as Scale Invariant Feature Transform (SIFT) or Speeded Up Robust Features (SURF). However, these traditional descriptors were designed for a certain domain of surface appearances and geometries (limited relief). Consequently, highly-textured surfaces such as tree bark pose a challenge to them. In turns, this makes it more difficult to use trees as identifiable landmarks for navigational purposes (robotics) or to track felled lumber along a supply chain (logistics). We thus propose to use data-driven descriptors trained on bark images for tree surface re-identification. To this effect, we collected a large dataset containing 2,400 bark images with strong illumination changes, annotated by surface and with the ability to pixel-align them. We used this dataset to sample from more than 2 million 64x64 pixel patches to train our novel local descriptors DeepBark and SqueezeBark. Our DeepBark method has shown a clear advantage against the hand-crafted descriptors SIFT and SURF. Furthermore, we demonstrated that DeepBark can reach a Precision@1 of 99.8 a database of 7,900 images with only 11 relevant images. Our work thus suggests that re-identifying tree surfaces in a challenging context is possible, while making public a new dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The tracking of objects is an important concept in many fields. For instance, tracking within the supply chain is a key element of the Industry 4.0 philosophy [7, 19]. In the forestry industry for example, it would consist in tracking trees from the forest to their entrance in the wood yard [24, 27]. In the context of mobile robotics, being able to uniquely identify trees might improve localization in forests [17, 25, 32]

, as one would be able to use trees as visual landmarks. In order to perform tracking on trees, one must be able to re-identify them, potentially from bark images. In this paper, we precisely explore this problem, by developing a method to compare images of tree bark, and determining if they come from the same surface or not.

The difficulty of re-identifying bark surfaces arises in part from the self-similar nature of their texture. Moreover, the bark texture induces large changes in appearance when lit from different angles. This is due to the presence of deep troughs in bark, for many tree species. Another difficulty is the absence of a dataset tailored to this problem. There are already-existing bark datasets [13, 35, 9], but these are geared towards tree species classification.

To this effect, we first collected our own dataset with 200 uniquely-identified bark surface samples, for a total of 2,400 bark images. With these images, we produced a feature-matching dataset enabling the training of deep learning feature descriptors. We also have established the first state-of-the-art bark retrieval performance, showing promising results in challenging conditions. In particular, it surpassed by far common local feature descriptors such as SIFT [23] or SURF [4], as well as the novel data-driven descriptor DeepDesc [30]: see Figure 1 for a qualitative assessment.

In short, our contributions can be summarized as follows:

  • We introduce a novel dataset of tree bark pictures for image retrieval. These pictures contain specific markers to infer camera plane transformation.

  • We train a local feature descriptor via Deep Learning and demonstrate that one can match with great success a set of different images of the same bark surface.

2 Related Work

Our problem is related to three main areas: image retrieval, local feature descriptors and metric learning. Below, we discuss these. We also discuss the application of computer vision methods to the identification of bark images.

2.1 Image retrieval

The problem of image retrieval can be defined as follows: given a query image, the goal is to find other images in a database that look similar to the query one. More specifically, in mobile robotics, for example, there is a definition known as Visual Place Recognition (VPR[41, 2, 40, 10, 11], where image retrieval is used to perform localization. There, the objective is to determine if a location has already been visited, given its visual appearance. The robot could then localize itself by finding previously-seen images which are geo-referenced. In the area of surveillance, the problem is defined as Person Re-Identification (Person Re-Id) and aims to follow an individual through a number of security camera recordings [18, 41, 40, 14]. This technique implies to build or learn a function that maps multiple images of an individual to the same compact description, despite variation of view-point, illumination, pose or even clothes. Our tree bark re-identification is closest to this Person Re-Id problem, since we desire to track an individual bark surface despite changes in illumination and viewpoint.

2.2 Local feature descriptor

To describe and compare images while being invariant to view point and illumination changes, we chose to use local feature descriptors. The goal of these descriptors is to summarize the visual content of an image patch. The ideal descriptor is a) compact (low dimensionality) b) fast to compute c) distinctive and d) robust to illumination, translation and rotations. A widespread approach is to use hand-crafted descriptors. They often rely on histograms of orientation, scale and magnitude of image gradients, as in SIFT or SURF. Different variants have appeared over the years trying to alleviate the computation cost [8, 26] or simply trying to improve the performance [1, 3].

Recently, data-driven

approaches based on machine learning have appeared. Some learn a parametric function that maps image patches to compact descriptions that can be compared by their distance 

[6, 30]. Instead of describing an image patch alone, [36]

takes two patches at once and directly provides a similarity probability. There is also work proposing a pipeline trained end-to-end (detector + descriptor) 

[37, 12]. For a good comparison between hand-crafted and data-driven descriptors on different tasks, see [28, 40].

2.3 Metric learning

To build a learned local feature descriptor, we turned to the field of metric learning. It is a training paradigm that tries to learn a distance function between data points. The goal is in line with the points c) and d) of an ideal descriptor since it seeks to make this distance small for similar examples, and large for dissimilar ones. This approach has been explored in [15, 30, 12], where training relied on the so-called contrastive loss

. Another line of work attempts instead to make the inter-class variation larger than the intra-class variation by a chosen margin in the vector space. This formulation corresponds to the

triplet loss [29, 2]. [33] instead chose to compare a similar pair of examples to multiple negative ones, using a clever batch construction process.

2.4 Vision applied to bark/wood texture

Exploiting the information presents in bark images has been explored before. For instance, hand-crafted features such as Local Binary Patterns (LBP) [20, 34, 35], SIFT descriptors [13] and Gabor filters [42] have been used to perform tree species recognition. Closer to our work, [5] compared variants of the LBP method for image retrieval, but only at the species level. If bark is considered as a texture problem, one can find interesting work such as [38, 39] that use grounds textures such as asphalt, wood floor or other texture surfaces to enable robots to localise themselves. However, their technique is based on images with almost no variations and each query is compared with one set of SIFT descriptions from their whole texture map. Data-driven approaches such as deep learning also were applied on images of bark, but strictly for species classification [9].

3 Problem Definition

The problem we are addressing is an instance of re-identification. Given an existing database of bark images and a query image , our goal is to find all images in the database that correspond to the same physical surface, and hence the same tree. We assume that has a meaningful match in our database, i.e. we are not trying to solve an open-set problem; See FAB-MAP [10] for the detection of novel locations.

3.1 Image global signature

We perform the bark image search via global image signatures, defined as . These signatures are extracted for each image (database and query ), as depicted in Figure 2. For this, we mostly follow the method used in [31], summarized below. First, a keypoint detector extracts a collection of keypoints from an image. For each of these keypoints , we extract a description of dimension 128, yielding a list of descriptions . These descriptions can be from standard descriptors, such as SIFT or SURF, or our novel descriptors, described further down. The remaining component of an image signature is a Bag of Words (BoW) representation , calculated from the list of descriptions . We also apply the standard TF-IDF technique. In [31], the comparison between two BoW is done using the cosine distance. Instead, we have -normalized every BoW and used the distance to compare them. This way, our distance ranking is equivalent to the pure cosine distance, but without using a dot product.

Figure 2: Illustration of the signature extraction pipeline, for a single image . First, the keypoints are detected. Then, for each keypoint , a descriptor is computed, creating the list . Finally, a Bag of Words (BoW) representation of is computed from the quantization of all descriptions via a visual vocabulary, resulting in a global signature of image .

3.2 Signature matching

The search is performed mainly by computing a score between a query image signature and each image signatures of the entire database and retrieving the best match based on . For the BoW technique, we simply use the distance between two BoWs as our score . Another way to calculate a score between and begins by taking the distance between every description of and to obtain a collection of putative matching pairs of features, with . Then to filter out potential false matches, one needs to add extra constraints. In this paper, we explore two such filters. The first one is the Lowe Ratio (LR) test introduced in [23]. The second one is a Geometric Verification (GV), which is a simple neighbors check. It begins by taking a match , then retrieving the keypoints associated with each description of the match. Following this, we find the nearest neighbors of each of the keypoints in their respective images. Finally, the match is accepted if at least % of the neighbors of have a match with the neighbors of . The number of matches left after filtering is then considered as the matching score between two bark images.

4 Our approach: Data-driven descriptors

Considering tree bark highly-textured surfaces cause problem to hand-crafted descriptors, we present here the main contribution of our paper, which are data-driven descriptors for bark image re-identification. First we describe our bark image dataset. Then, we discuss how to process this dataset in order to generate keypoint-aligned image patches required to train our descriptors. These descriptors are then described in detail, followed by the necessary training details.

4.1 Bark Image Datasets

In order to develop our data-driven descriptors, we collected a dataset of tree bark images. To ensure drastic illumination changes, we took the pictures at night, and varied the position of a 550 lumen LED lamp. We also varied the position of the camera, an LG Q6 cellphone with a resolution of 4160 x 3120 pixels. Since our training approach (subsection 4.2) requires keypoint-aligned image patches, we used fiducial markers on a wooden frame attached to trees to automate and increase the precision of the image registration, as shown in Figure 3.

We collected bark images for only two different tree species, namely Red Pine (RP) (an evergreen) (50 trees, 100 unique bark surfaces) and Elm (EL) (a deciduous tree) (50 trees, 100 unique bark surfaces). Each bark was surrounded by a custom-made wooden frame which made visible a surface of 757.5  (rectangle of 50.5  by 15 

). We limited ourselves to only two species, to avoid positively biasing image retrieval results. Indeed, neural networks have the capacity to easily distinguish between tree species 

[9]. In total, we took 12 images per distinct surface with the aforementioned variations. To make our evaluation on EL bark more challenging, we also collected unseen bark images from elm trees without any markers. To keep these new images close to our original appearance distribution, we took them at night with 3 different illumination angles, but with limited changes in point of view. We collected a total of 30 images per tree with some physical overlap, spread nearly uniformly around the trunk. This gave us a total of 750 manually-cropped non-relevant images for any EL query taken at a scale similar to all of our other images.

Figure 3: Images from our database of the same surface of Elm (EL) bark, but for different illuminations and camera angles. In each image, there are four fiduciary markers on a custom-made wooden frame, used for pixel-wise registration.

4.2 Descriptor Training Dataset

Our descriptors require a dataset made of 64x64 patches for training with metric learning. Moreover, these patches not only need to be properly indexed per bark surface, they must also show the exact same physical location. After automatically cropping the excess information from images (background, frame, shadow, etc), we performed registration between every image of a bark surface with a reference frame via homography . We used the fiducial markers affixed to our wooden frame surrounding the bark surface (See Figure 3

) to estimate these transformations. Then, for each bark image, we detected the maximum number of keypoints and projected them to the reference frame

via the homography . We filtered all of the keypoints in to require a minimum of 32 pixels between them to minimize overlap. This resulted in around 800-1000 distinct keypoints in . For each of these keypoints, we then found the 12 image patches (one per image, see subsection 4.1) using an homography that gives the transformation from the reference frame to a specific bark image. This resulted in a collection of 64x64 image patches corresponding to the exact same physical location on the bark, but with changes in illumination and point of view (rotation, scaling and perspective). Figure 4 shows three images of a unique bark surface, with the manual correspondence between keypoints. Figure 5 show 12 examples of a keypoint extracted according to our algorithm used to create the training dataset.

Figure 4: Top row: pictures of the same bark surface with strong changes in illumination. Each circle color is a distinct keypoint. Bottom row: close up of the red keypoints from their respective images. This highlights the importance for a descriptor to be as immune as possible to such illumination changes.
Figure 5: Actual example of 64x64 patches of a keypoint. Red arrows indicate the orientation of the original bark images.

4.3 DeepBark and SqueezeBark Descriptors

To perform description extraction, we implemented two different architectures with Pytorch 0.4.1. The first one,

DeepBark, is based on a pre-train version of ResNet-18 [16]

. We removed the average pool and the fully connected layers and replaced them with one fully connected layer (no activation function). The second one,

SqueezeBark, is a smaller network based on the pre-train version of SqueezeNet 1.1 [21]

. We again removed the average pool and the fully connected layers. We replaced them with a max pooling layer (to reduce the feature map) and a fully connected layer (no activation function). In both cases, the network computes a 128-dimensional vector, fed to an

normalization layer. Removing our last fully connected layer and calculating the number of parameters for the remaining convolutionnal layers, DeepBark is then composed of a total of 10,994,880 parameters and SqueezeBark includes 719,552 parameters. Our intention here is to be able to compare the impact of network size on the descriptor quality.

4.4 Training details

To train our networks (DeepBark or SqueezeBark), we chose the N-pair-mc loss [33]. The only difference in our implementation is that, instead of using regularization to avoid degeneracy, we -normalized the descriptor vectors to keep it in a hypersphere [29].

Our dataset is composed of 64x64 patches containing patches for around 70,800 distinct keypoints for the training set and 17,500 for the validation set for most of our experiments. Using 12 patches by keypoint for training and 2 for validation, this totals 884,600 64x64 bark images patches. At each iteration, we only used a pair of example for every keypoint in the training set. However, to ensure an equal probability for every patch to be seen together with every other patch, we randomly selected each patch tuple. We added online data augmentation in the form of color, luminosity and blurriness jitter. Each input image was normalized between by subtracting 127.5 and then divided by 128. We optimized using Adam [22] starting with a learning rate of and reducing it by a factor of 0.5 each time the validation plateaued for 20 iterations.

We built the validation set by finding all of the keypoints available in the bark images set aside for validation, and randomly selected 2 patches from the 12 available for each distinct keypoint. This gave us a fixed validation set, where every patch had a corresponding one. This way, during training we validated our model by selecting 50 keypoints with their 2 examples at the time and performed a retrieval test to calculate the Precision at rank 1 (P@1). The final validation score was simply the average of every P@1 calculated for every batch of 50 keypoints. After training, we selected the model with the highest validation score. The training was stopped either with early stopping when the validation stagnated for 40 iterations, or when a maximum number of iterations was reached.

5 Results

Beside DeepBark and SqueezeBark, we also analysed hand-crafted descriptors, namely SIFT and SURF. We also included DeepDesc, a learned descriptor train on the multi-view stereo dataset [6]. All descriptors use the SIFT keypoint detector, except for SURF that uses its own detector. For all experiments, we used a ratio of 0.8 for the LR test, and set and for the GV filter. Also, each visual vocabulary

was computed from the training images of each respective experiments while being clustered using the k-mean algorithm. As we will see later, these parameters offered good performance and we did not try to adjust any of them to further improve the results

Image retrieval can be evaluated in multiple ways. In our case, we favored metrics based on an ordered set, as they align best with our problem description. Hence, we chose to present results with the Precision at rank K (P@K), Recall at rank K (R@K), R Precision (R-P) and Average Precision (AP). Here are P@K and R@K:

(1)
(2)
In Equation 1 and 2, is a rank and the function returns the number of relevant images ranked between the first rank and the rank ( included). For Equation 2, is the set of relevant images and when is equal to the number of relevant images (11 in our case), it is called the R-P. The advantage of R-P is that it can reach a value of 1 and simultaneously give Precision@11 and Recall@11. We also present Precision Recall (PR) graph define as:

(3)

In Equation 3, represents one image and is the rank where the image can be found. Taking the mean of every give the AP. Keep in mind that these metrics are calculated for every query, averaged together. Also thus, instead of AP we write mean Average Precision (mAP).

5.1 Hyperparameters search

Our approach comprises a number of hyperparameters to select. First is the maximum allowable number

of keypoints in an image. From experiments, increasing beyond 500 keypoints did not significantly improve the performance of any descriptor. The second hyperparameter is the downsizing factor of the original image. Downsizing an image allowed the receptive field of any method to be increased, without changing its process. Our experiments showed that using generally helped every descriptor. Our third hyperparameter is the sigma used in the blurring performed before passing the image through the keypoint detector. Note that the blur was done for the keypoint detection, but after that we used either the unblurred image for computing the description of learned descriptors (DeepBark, SqueezeBark and DeepDesc) or the blurred image for SIFT and SURF. The latter was necessary, as they use the keypoint information found on the blurred image. We found that the best blur filter value varied greatly between descriptors. A sample of the results is in Table 1 and the chosen values for the subsection 5.3 experiment are shown in Table 2. These values were found by averaging the results of 36 randomly-selected queries run against the validation set for each hyperparameter combination.

Descriptors
0 1 2 3
Deep-
Bark
1.0 0.795 0.816 0.838 0.826
1.5 0.902 0.904 0.886 0.750
2.0 0.937 0.914 0.745 0.606
Squeeze-
Bark
1.0 0.098 0.111 0.116 0.131
1.5 0.114 0.139 0.131 0.126
2.0 0.167 0.159 0.136 0.106
SURF
1.0 0.154 0.194 0.288 0.354
1.5 0.301 0.326 0.384 0.402
2.0 0.290 0.359 0.409 0.452
SIFT
1.0 0.124 0.220 0.348 0.404
1.5 0.162 0.318 0.417 0.419
2.0 0.210 0.359 0.389 0.245
Deep-
Desc
1.0 0.053 0.051 0.051 0.040
1.5 0.043 0.063 0.053 0.045
2.0 0.076 0.086 0.048 0.040
Table 1: Results for the hyperparameters grid search, averaged over 36 random queries on the validation set. The downsize factor is how much the size of the original image was divided and is the sigma used for the gaussian blur on the image before keypoint detection. The values reported are the R-P of the GV.
Descriptors Avg. Keypoint Num.
SIFT 1.5 3 469.4 69.9
SURF 2.0 3 499.6 4.8
DeepDesc 2.0 1 497.0 17.4
SqueezeBark 2.0 0 492.8 18.4
DeepBark 2.0 0 492.8 18.4
Table 2: Hyperparameters chosen after careful examination of the grid search, with the mean number of keypoints found at test time. The number of keypoints was capped to 500.

5.2 Impact of training data size

Data-driven approaches based on Deep Learning tend to be data hungry. To check the impact of the training data size, we created 5 training scenarios by tree species, which used 10%, 20%, 30%, 40% and 50% of the dataset. All trained descriptors were validated and tested on the same folds (10% and 40% respectively) of each species dataset. We stopped training when the validation P@1 stagnated for 40 consecutive iterations.

Table 3 shows the performance of the descriptor DeepBark, for each training set size. For each species, the P@1, the R-P and the mAP are reported for the three scoring techniques: GV, LR and BoW. It is good to note that the BoW is also affected by the size of the training set, since the of the BoW is computed from that same training set. From these metrics, we concluded that performance gains were minimal beyond 40%. This confirmed that our training database is sufficiently large to obtain good performance. For references, when using 50% of RP as training data, we have access to approximately 42,700 distinct keypoints giving 512,000 bark images of 64x64 pixels.

Red Pine
Metric 10% 20% 30% 40% 50%
BoW P@1 0.971 0.985 0.985 0.996 0.994
R-P 0.578 0.651 0.705 0.722 0.751
mAP 0.633 0.713 0.769 0.785 0.812
GV P@1 0.988 0.990 0.998 0.996 0.998
R-P 0.727 0.790 0.828 0.842 0.857
mAP 0.777 0.848 0.892 0.905 0.922
LR P@1 1.000 1.000 1.000 1.000 1.000
R-P 0.822 0.890 0.921 0.930 0.938
mAP 0.882 0.932 0.956 0.962 0.967
Elm
Metric 10% 20% 30% 40% 50%
BoW P@1 0.940 0.956 0.971 0.979 0.983
R-P 0.558 0.635 0.662 0.710 0.706
mAP 0.607 0.691 0.721 0.759 0.764
GV P@1 0.944 0.965 0.977 0.981 0.983
R-P 0.670 0.706 0.742 0.763 0.757
mAP 0.707 0.752 0.791 0.816 0.806
LR P@1 0.985 0.996 0.998 0.998 1.000
R-P 0.613 0.689 0.726 0.748 0.747
mAP 0.665 0.740 0.779 0.800 0.798
Table 3: Performance of the DeepBark descriptor, when training with 10%, 20%, 30%, 40% and 50% of the data from a single tree species. From the remaining data, 10% and 40% has been used for validation and testing respectively. Hyperparameters were fixed through testing. Best results are in bold for each row.

5.3 Descriptors comparison

We selected 50% of red pine bark surfaces and 50% of elm bark surfaces to create a test set, while using the remaining data for the training and validation sets. This corresponded to 80 unique bark surfaces for the training, 20 for the validation and 100 for testing, while keeping the ratio between tree species to 50/50 in each set. Our data-driven descriptors DeepBark and SqueezeBark were trained for 200 iterations, and we kept the model with the best validation. With 12 images for each bark surface, the test set has a total of 1200 images, with 600 per tree species. Each of these images were used as a query during the retrieval test. The results were averaged over all queries. We report results in Figure 6 as PR curves. This way, all 11 true positives are taken into account in our experimentations, properly estimating how well our approach resists to strong illumination/viewpoint changes.

From Figure 6, we can see that across almost all descriptors, the GV or the LR are better scoring methods than the BoW. This is understandable, as BoW is more intended as a pre-filtering tool to reduce the number of potential candidates. We can also see that DeepBark clearly dominates SIFT, SURF and DeepDesc. We can also notice that the precision is over 98% up to a recall of 6 images with the GV. Interestingly, the results for SqueezeBark are mitigated. This might indicate that finding a good descriptor for bark images under strong illumination changes is a difficult problem, and thus requires a neural architecture with sufficient capacity.

Figure 6: PR Curve for all descriptors tested on 50% of RP and EL. Learned descriptors were trained on the remaining 50% of bark. Each of the 1200 images of the test set is use as a query. No extra negative examples were added.

5.4 Generalization across species

In the experiments of subsection 5.3, we reported results on networks trained on both species, instead of training and testing each architecture on a single species. Our intention was to double the amount of training data, and benefiting from the potential synergy between species, which is often seen in deep networks (multi-task learning). Here, we precisely look at the generalization of our networks across species. We thus devised two experiments to evaluate the generalization from one kind of bark to the other and vice versa. The first one is composed of a training set with 80% of the RP data, using the remaining 20% as the validation set and all of the EL data is the test set (labelled RP->EL). We also performed the converse (EL->RP). We only report in Figure 7 the PR curve for the GV, as the trend is similar for other scoring methods. Figure 7 first shows that DeepBark is capable of generalizing across species, but that SqueezeBark do so to a lesser extent. Also, there is no clear trend for the generalization direction, since SqueezeBark generalized better from EL to RP but DeepBark generalized better in the opposite direction (from RP to EL).

Figure 7: PR Curve for the generalization test using the GV method. The arrow -> indicates the generalization direction (trained on -> tested on).

5.5 Extra negative examples

To test how our system would perform on a larger database, we added 6,700 true negative elm examples with a crop size similar to query images. Half of them were original images, and the other images were generated via data augmentation, by doing either a rotation, scale or affine transformation. Note that the original 3,350 images contain some physical overlap, as they come from 25 trees.

We reused the DeepBark network and the previously trained in subsection 5.3. For the test, we removed the red pine images and kept the elm images that we separated in two crops (top and bottom halves) giving us a total of 1,200 images. We thus obtain a database of 7,900 bark images. Again, every query had 11 relevant images. This experiment is the only one where we split bark images in two crops, solely done to increase the database size. This has the side effect of also dropping the performance, as the visible bark (and thus the number of visible features) is reduced by half. This can be seen by comparing Figure 6 and Figure 8.

Figure 8: PR Curve for the negative examples test on SIFT and DeepBark. Number in the legend indicate how many negative examples were added.
Metric 0 600 1600 3300 6700
BoW P@1 0.952 0.937 0.924 0.910 0.885
R-P 0.611 0.569 0.537 0.504 0.471
mAP 0.659 0.611 0.572 0.532 0.491
GV P@1 0.998 0.998 0.998 0.998 0.998
R-P 0.832 0.832 0.832 0.832 0.831
mAP 0.874 0.874 0.874 0.873 0.872
LR P@1 0.999 0.999 0.998 0.998 0.998
R-P 0.769 0.763 0.756 0.748 0.736
mAP 0.812 0.804 0.794 0.783 0.768
Table 4: Results of the negative examples test for DeepBark. Number in the header indicate how many negative examples were added. Best result for each metric in bold.

Among the three scoring methods evaluated, the most affected by the amount of negative examples was the BoW, as seen in Figure 8 and Table 4. The LR filter displays a smaller degree of degradation, as a function of the amount of extra negative examples. However, it still retains almost the same P@1. Finally, when looking at the GV, it is clear that the impact of extra negative examples is negligible. This again demonstrates the importance of performing GV filtering. We can also extrapolate that our approach with GV would work on a much larger, realistic dataset.

5.6 Computing time considerations

Even if the LR test and the GV filter perform better, it is unrealistic to use them to search a whole database, in a realistic scenario. Instead, the BoW can be used as pre-filtering to propose putative candidates to the other methods. To this end, we provide Table 5, which shows the R@K for various . These results suggest that keeping the 200 best matching scores calculated using the BoW on DeepBark would retain 73.9% of the 11 relevant images among 7,900 possible matches. As shown by [11], the BoW is fast to compare and can handle large datasets. To get a sense of the time that could be saved by the pre-filtering, we report in Table 6 the average time of these operations using our current algorithm on a single thread. It is important to note that the BoW technique could be even faster using an inverted index and by taking advantage of its sparsity (on average 71.8% of it has a null entry in our experiments). From this, we can see that applying the GV on the top from the original 7,900 images, can be accomplished in 35.88 , while the BoW only took 0.016  for the 7,900 images.

Descriptors R@25 R@50 R@100 R@200
SIFT-0 0.248 0.316 0.403 0.520
SIFT-6700 0.150 0.176 0.215 0.268
DeepBark-0 0.728 0.795 0.857 0.908
DeepBark-6700 0.561 0.625 0.681 0.739
Table 5: R@K for different values of K using the BoW. Results taken from the experiment with negative examples. Number beside method names indicate how many negative examples were added.
Methods BoW LR GV
Mean Time () 0.002 131.499 179.387
Table 6: Single signatures comparison time averaged over 500 comparisons. Done using our actual algorithm on a single thread of an Intel Core I-7.

6 Conclusion

In this paper, we explored bark image re-identification in the challenging context of strong illumination and viewpoint variations. To this effect, we introduced a novel bark image dataset, from which we can extract over 2 million keypoint-registered image patches. Using the latter, we developed two learned local feature descriptors based on Deep Learning and metric learning, namely DeepBark and SqueezeBark. After seeing that a descriptor can perform well with only 40 % of the dataset from one tree species, we showed that both our descriptors performed better than SIFT, SURF and DeepDesc on any of the three scoring methods presented. Our results indicate that using our descriptor DeepBark, retrieval is viable even for large datasets with thousands of negative examples. Moreover, the approach can be sped up by using Bag-of-Words.

Our results are very encouraging, but performance in real-life scenario might differ and thus more data should be collected. Also, it would be interesting to quantify the effect of the BoW size, the generalization capacity over more tree species or the effect of using different keypoint detectors. Further improvement to the training procedure could be done, such as allowing more training iterations, trying other networks, adding pre-training or employing hard mining approaches. Finally, our results open the road for new bark re-identification applications.

7 Acknowledgement

We would like to thank Marc-André Fallu for the access to his red pine plantation. The authors would also like to thank the "Fonds de recherche du Québec – Nature et technologies (FRQNT)" for their financial support. We are also grateful to Fan Zhou for the help in the data collection. Finally, we thank to NVIDIA for their Hardware Grant Program.

References

  • [1] P. F. Alcantarilla, A. Bartoli, and A. J. Davison (2012) KAZE features. In ECCV, pp. 214–227. Cited by: §2.2.
  • [2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2018-06) NetVLAD: cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6), pp. 1437–1451. Cited by: §2.1, §2.3.
  • [3] R. Arandjelovic (2012) Three things everyone should know to improve object retrieval. In CVPR, pp. 2911–2918. Cited by: §2.2.
  • [4] H. Bay, T. Tuytelaars, and L. Van Gool (2006) SURF: speeded up robust features. In CVIU, Vol. 110, pp. 404–417. Cited by: §1.
  • [5] S. Boudra, I. Yahiaoui, and A. Behloul (2015) A comparison of multi-scale local binary pattern variants for bark image retrieval. In ACIVS, pp. 764–775. Cited by: §2.4.
  • [6] M. Brown, Gang Hua, and S. Winder (2011-01) Discriminative learning of local image descriptors. Transactions on Pattern Analysis and Machine Intelligence 33 (1), pp. 43–57. Cited by: §2.2, §5.
  • [7] O. Cakici, H. Groenevelt, and A. Seidmann (2011-11) Using rfid for the management of pharmaceutical inventory-system optimization and shrinkage control. Decision Support Systems 51, pp. 842–852. Cited by: §1.
  • [8] M. Calonder, V. Lepetit, C. Strecha, and P. Fua (2010) BRIEF: binary robust independent elementary features. In ECCV, pp. 778–792. Cited by: §2.2.
  • [9] M. Carpentier, P. Giguère, and J. Gaudreault (2018-03)

    Tree species identification from bark images using convolutional neural networks

    .
    IROS. Cited by: §1, §2.4, §4.1.
  • [10] M. Cummins and P. Newman (2008-06) FAB-map: probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research 27 (6), pp. 647–665. Cited by: §2.1, §3.
  • [11] M. Cummins and P. Newman (2009-06) Highly scalable appearance-only slam - fab-map 2.0. In Robotics: Science and Systems, pp. 1828–1833. Cited by: §2.1, §5.6.
  • [12] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) SuperPoint: self-supervised interest point detection and description. CVPRW, pp. 337–349. Cited by: §2.2, §2.3.
  • [13] S. Fiel and R. Sablatnig (2010-02) Automated identification of tree species from images of the bark , leaves or needles. pp. 67–74. Cited by: §1, §2.4.
  • [14] D. Gray, S. Brennan, and H. Tao (2007) Evaluating appearance models for recognition, reacquisition, and tracking. In International Workshop on Performance Evaluation for Tracking and Surveillance, Cited by: §2.1.
  • [15] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, pp. 1735–1742. Cited by: §2.3.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In CVPR, Vol. 19, pp. 770–778. Cited by: §4.3.
  • [17] T. Hellstrom, P. Larkeryd, T. Nordfjell, and O. Ringdahl (2009) Autonomous forest vehicles: historic, envisioned and state-of-the-art. International Journal of Forest Engineering 20 (1), pp. 31–38. Cited by: §1.
  • [18] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. CoRR abs/1703.07737. External Links: 1703.07737 Cited by: §2.1.
  • [19] V. Hinkka (2012-05) Challenges for building rfid tracking systems across the whole supply chain. International Journal of RF Technologies Research and Applications 3, pp. 201–218. Cited by: §1.
  • [20] Z. Huang, Z. Quan, and J. Du (2006) Bark classification based on contourlet filter features using rbpnn. In Intelligent Computing, pp. 1121–1126. Cited by: §2.4.
  • [21] F. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer (2016-02) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <0.5mb model size. CoRR abs/1602.07360. External Links: 1602.07360 Cited by: §4.3.
  • [22] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. External Links: 1412.6980 Cited by: §4.4.
  • [23] D. G. Lowe (2004-11) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §1, §3.2.
  • [24] F. Mtibaa and A. Chaabane (2014-01) Forestry wood supply chain information system using rfid technology. IIE Annual Conference and Expo, pp. 1562–1571. Cited by: §1.
  • [25] F. T. Ramos, J. Nieto, and H. F. Durrant-Whyte (2007-04) Recognising and modelling landmarks to close loops in outdoor slam. In ICRA, pp. 2036–2041. Cited by: §1.
  • [26] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to sift or surf. In ICCV, pp. 2564–2571. Cited by: §2.2.
  • [27] S. P. Sannikov and V. V. Pobedinskiy (2018-05) Automated system for natural resources management. CEUR-WS. Cited by: §1.
  • [28] J. L. Schonberger, H. Hardmeier, T. Sattler, and M. Pollefeys (2017-07) Comparative evaluation of hand-crafted and learned local features. In CVPR, pp. 6959–6968. Cited by: §2.2.
  • [29] F. Schroff, D. Kalenichenko, and J. Philbin (2015-06)

    FaceNet: a unified embedding for face recognition and clustering

    .
    In CVPR, pp. 815–823. Cited by: §2.3, §4.4.
  • [30] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015-12) Discriminative learning of deep convolutional feature point descriptors. In ICCV, pp. 118–126. Cited by: §1, §2.2, §2.3.
  • [31] J. Sivic and A. Zisserman (2003) Video google: a text retrieval approach to object matching in videos. In ICCV, pp. 1470–1477 vol.2. Cited by: §3.1.
  • [32] N. Smolyanskiy, A. Kamenev, J. Smith, and S. T. Birchfield (2017) Toward low-flying autonomous mav trail navigation using deep neural networks for environmental awareness. IROS, pp. 4241–4247. Cited by: §1.
  • [33] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. NIPS (Nips), pp. 1857–1865,. Cited by: §2.3, §4.4.
  • [34] M. Sulc and J. Matas (2013-11) Kernel-mapped histograms of multi-scale lbps for tree bark recognition. In IVCNZ, Vol. , pp. 82–87. Cited by: §2.4.
  • [35] M. Svab (2014) Computer-vision-based tree trunk recognition. Univerza v Ljubljani. Cited by: §1, §2.4.
  • [36] Xufeng Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg (2015-06) MatchNet: unifying feature and metric learning for patch-based matching. In CVPR, Vol. 07-12-June, pp. 3279–3286. Cited by: §2.2.
  • [37] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua (2016) LIFT: learned invariant feature transform. In ECCV, Cited by: §2.2.
  • [38] L. Zhang, A. Finkelstein, and S. Rusinkiewicz (2017) High-precision localization using ground texture. ICRA, pp. 6381–6387. Cited by: §2.4.
  • [39] L. Zhang and S. Rusinkiewicz (2018-06) Learning to detect features in texture images. In CVPR, pp. 6325–6333. Cited by: §2.4.
  • [40] L. Zheng, Y. Yang, and Q. Tian (2016-08) SIFT meets cnn: a decade survey of instance retrieval. Transactions on Pattern Analysis and Machine Intelligence PP. Cited by: §2.1, §2.2.
  • [41] Z. Zheng, L. Zheng, and Y. Yang (2017-12) A discriminatively learned cnn embedding for person reidentification. ACM Trans. Multimedia Comput. Commun. Appl. 14 (1), pp. 13:1–13:20. Cited by: §2.1.
  • [42] Zheru Chi, Li Houqiang, and Wang Chao (2003-12) Plant species recognition based on bark patterns using novel gabor filter banks. In NIPS, Vol. 2, pp. 1035–1038. Cited by: §2.4.