Deep Representations for Cross-spectral Ocular Biometrics

11/21/2019 ∙ by Luiz A. Zanlorensi, et al. ∙ Universidade da Beira Interior PUCPR 0

One of the major challenges in ocular biometrics is the cross-spectral scenario, i.e., how to match images acquired in different wavelengths (typically visible (VIS) against near-infrared (NIR)). This article designs and extensively evaluates cross-spectral ocular verification methods, for both the closed and open-world settings, using well known deep learning representations based on the iris and periocular regions. Using as inputs the bounding boxes of non-normalized iris/periocular regions, we fine-tune Convolutional Neural Network(CNN) models (based either on VGG16 or ResNet-50 architectures), originally trained for face recognition. Based on the experiments carried out in two publicly available cross-spectral ocular databases, we report results for intra-spectral and cross-spectral scenarios, with the best performance being observed when fusing ResNet-50 deep representations from both the periocular and iris regions. When compared to the state-of-the-art, we observed that the proposed solution consistently reduces the Equal Error Rate(EER) values by 90 and in the PolyU Bi-spectral and Cross-eye-cross-spectral datasets. Lastly, we evaluate the effect that the "deepness" factor of feature representations has in recognition effectiveness, and - based on a subjective analysis of the most problematic pairwise comparisons - we point out further directions for this field of research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Iris recognition using nir wavelength images acquired under controlled environments can be considered a mature technology, which proved to be effective in different scenarios [34]. In contrast, performing iris recognition in uncontrolled environments and at vis wavelength is still a challenging problem  [33, 35]. Some of the latest researches consist of biometrics recognition on cross-spectral scenarios, i.e., using images of eyes from the same subject obtained at the vis and nir wavelengths [17, 41, 29, 4].

Recently, machine learning techniques based on deep learning have been achieving great popularity due to the results reported in the literature, which advance the state-of-the-art in various problems, such as speech recognition 

[15, 47, 19]

, natural language processing 

[12, 43], digit and character recognition [20, 16, 21] and face recognition [32, 6]. In the field of ocular biometrics, using deep learning representation has been advocated both for the periocular [25, 37] and iris [22, 11, 3, 30, 36, 29, 45, 46] regions, with interesting and promising results being reported.

As stated in previous works [27, 22], an often and open problem in ocular recognition is the matching heterogeneous images captured at different resolutions, distances and devices (cross-sensor and cross-spectral). Regarding these problems it is difficult to design a robust handcrafted feature extractor to address the intra-class variations present in this scenarios. In this sense, several recent works demonstrate that deep representations report better results compared to handcrafted features in iris and periocular region recognition [22, 25, 37, 45].

Having in mind that deep learning frameworks are typically able to produce robust representations, in this article we apply this family of frameworks to extract and combine features from the ocular region, obtained at different wavelengths, e.g., vis and nir. The strategy described in this article is composed of some methodologies extracted from the literature. For both the iris and ocular traits we use as input the bounding box delimited regions used in the state-of-the-art methods [46, 25]. Then, the features from these traits were extracted using a similar approach proposed by [46]. In this direction, the main contribution of this article is the extensive experiments on two datasets comparing iris, periocular, and fusion results for both cross-spectral (vis to nir) and intra-spectral (vis to vis, nir to nir) matching, reaching a new state-of-the-art results. There is also the following four-fold contributions: (i) we show that deep learning yield robust representations on two well-known cross-spectral databases (PolyU and Cross-Eyed) for ocular verification using closed- and open-world protocols; (ii) we report how two off-the-shelf networks can be fine-tuned from the face domain to the periocular and iris one; (iii) we analyze the use of a single deep representation extraction schema, for both cross-spectral and the same spectra scenarios; and (iv) we conclude about the benefits of fusing the periocular and iris representations to improve the recognition accuracy.

The remainder of this work is organized as follows. In Section 2, we describe some recent works that use deep learning for iris and periocular recognition. Section 3 provides the details of the proposed approach. Section 4 presents the databases, metrics and evaluation protocol used in our empirical evaluation. The results are presented and discussed in Section 5. Lastly, the conclusions are given in Section 6.

2 Related Work

This section surveys the works that use deep learning frameworks for iris and periocular recognition. Also, we summarize the most relevant ocular recognition methodologies focused on the cross-spectral scenario.

One of the first works applying deep learning to iris recognition only was the DeepIris framework, proposed by Liu et al. [22]. Having as a goal the recognition of heterogeneous irises using images obtained by different sensors (i.e., the cross-sensor scenario), the authors proposed a framework that establishes the similarity between a pair of iris images using *cnn by learning a bank of pairwise filters. The experiments were performed in the Q-FIRE and CASIA cross-sensor databases, reporting promising results with *eer of % and %, respectively.

Another deep learning application for cross-sensor iris recognition, designated DeepIrisNet, was proposed by Gangwar & Joshi [11]. In their study, two *cnn architectures were presented and used to extract features and representations of iris images. Comparing to the baselines, their methodology showed better robustness with respect to five different factors: effect of segmentation, image rotation, input size, training size, and network size.

Nguyen et al. [30]

argued that generic descriptors yielding from deep learning frameworks can appropriately represent iris features from nir images obtained in controlled environments. The authors compared five *cnn architectures trained in the ImageNet database 

[9]: AlexNet, VGG, Inception, ResNet and DenseNet. Deep representations were extracted from normalized iris images at different depths of each cnn model. Afterward, a simple multi-class *svm was applied to perform the identification. The experiments were carried out in the LG2200 (ND-CrossSensor-Iris-2013) and CASIA-Iris-Thousand databases and compared with a baseline feature descriptor [8]

. As main result, the authors argued that features extracted from intermediate layers of the networks reported better results than the representations in the deeper layers.

Luz et al. [25]

extracted deep representations of the periocular region using the VGG16 cnn model. The authors reported promising results by using transfer learning techniques from the face recognition domain, followed by fine-tuning using the ocular images. The experiments achieved the state-of-the-art in the NICE.II and MobBIO databases, which were obtained in uncontrolled environments at the vis wavelength.

Also using the NICE.II database, Silva et al. [42]

proposed a fusion method of iris and periocular deep representations by means of feature selection using the Particle Swarm Optimization (PSO). Similar to the methodology proposed in 

[25], the iris and periocular deep representations were extracted with the VGG16 model trained for face recognition and fine-tuned for each trait. Promising results were reported in the verification mode only using iris information and also using iris and periocular fusion.

Proença and Neves [37] argue that periocular recognition performance is optimized when the iris and sclera regions are discarded. Also, these authors describe a processing chain based on cnn that defines the regions-of-interest in the input image. In their approach, a segmentation process is only required to create the training samples. This process consists in generating a periocular image of a subject containing an ocular (sclera and iris) region belonging to other subjects. Then, the generated samples are used for data augmentation and to feed the learning phase of the cnn model. The experiments were performed in the UBIRIS.v2 and FRGC databases and consistently advances the state-of-the-art in the closed-world setting.

Zanlorensi et al. [46] evaluated the impact of the segmentation for noise removal and normalization when deep representations were extracted from the iris images. The experiments reported that deep representations extracted from an iris bounding box without segmentation process achieved better results than normalized and segmented images. In addition, the authors compared representations extracted from the VGG16 and ResNet50 models and the impact of using data augmentation techniques. A new state-of-the-art was reached in the NICE.II database using only information from the iris region.

In terms of cross-sensor iris recognition, the methodology proposed by Nalla and Kumar [29] introduced a domain adaptation framework to address this problem and reported a new approach using Markov random fields. The experiments were performed using two cross-sensor iris databases: IIT-D CLI and ND Cross sensor 2012; and one cross-spectral iris database: PolyU. The results reported in PolyU database in the verification protocol at closed-world achieved an eer value of % in nir vs nir comparisons and % in vis vs vis comparisons. Using the Markov random fields on cross-spectral comparisons, their methodology achieved % of eer.

In [45], the authors evaluated a range of deep learning architectures applied to the cross-spectral iris recognition. The experimental results were performed in the PolyU and Cross-Eyed databases. Experimental analysis indicates that iris features extracted from cnn models are generally sparse and can be used for template compression. Several hashing algorithms were evaluated and the most effective was supervised discrete hashing achieving more accurate performance and reducing the size of iris template. The best results reported were achieved by incorporating supervised discrete hashing on the deep representations extracted with a cnn model trained with a softmax cross-entropy loss. This methodology reached an eer value of and on the PolyU and Cross-Eyed databases, respectively.However, the authors do not report the system performance on the open-world protocol, which is a more realistic scenario. Also, this methodology requires an approach for the segmentation and normalization of the iris. To the best of our knowledge, this work is the state-of-the-art on cross-spectral recognition in the verification mode. Thus, it is used for comparison with the methodology presented in this paper.

Also applied in the cross-spectral scenario, Hernandez-Diaz et al. [18] proposed a method using a ResNet-101 model pretrained in the Imagenet database [9] to extract deep representation from periocular images. The experiments were carried out in verification mode using the IIITD Multispectral Periocular database [41] in three different spectra: Visible, Night Vision, and Near-Infrared. The results were reported using features extracted at each layer from the model using chi-square distance and cosine similitude to perform the matching. The authors stated that the features extracted from the intermediate layer from the ResNet-101 model achieved the best results in the cross-spectral experiments.

Recently, two contests were performed using the Cross-Eyed database, aiming to recognize iris and periocular (without the iris region) traits in a cross-spectral environment [38, 39]. However, as stated by Wang and Kumar [45], the results reported in these competitions should be considered preliminary, as they employed a comparison protocol with less matching challenge than usual (only 3 images of each class were used in the inter-class comparison instead of all against all) and did not provide information regarding which images of each class were used in the inter-class matching (the authors’ work only reported that the images were randomly selected). Other problems include the availability of codes and also details of the methodologies, which limit the reproducibility.

Previous works on cross-spectral recognition such as [29, 45] only use iris traits and require a methodology for iris segmentation and normalization. Our proposal in this article combine information from the iris and periocular regions. Also, for the iris trait, we use only a bounding box, which does not require segmentation for noise removal and normalization steps.

For completeness, there are several other applications with ocular images based on deep learning such as: spoofing detection [28], recognition of mislabeled left and right iris images [10], liveness detection [13], iris/periocular region location/detection [40, 24], sclera and iris segmentation [23, 5], gender classification [44] and sensor model identification [26].

3 Methodology

In this paper we analyze the use of deep representations from the eye regions (iris and periocular) on cross-spectral scenario, i.e., obtaining models able to match vis against nir wavelength images. Particularly, we evaluate and combine deep representations extracted from two modalities (traits): the iris and periocular regions. In the periocular modality, features were extracted from the entire image (considering the iris, sclera, skin, eyelids and eyelashes components). On the other way, the iris features were extracted from a bounding box, i.e., a cropped image that contains only the iris region, as described by Zanlorensi et al. [46]. These bounding boxes were generated manually by coarse annotations and are publicly available to the research community111https://web.inf.ufpr.br/vri/databases/iris-periocular-coarse-annotations/and appears in [24]. Samples of the periocular and iris images used in this work are shown in Figure 1.

(a)

(b) (c) (d)
Figure 1: VIS (a,c) and NIR (b,d) samples from the PolyU (a,b) and Cross-Eyed (c,d) databases. First and second rows show periocular and iris images, respectively.

Deep representations from the periocular and iris regions were extracted using a similar approach proposed in [46]. In this way, the VGG16 [32] and ResNet-50 [6] *cnn models trained for face recognition were fine-tuned to each modality. We choose these models because they reported promising results in recent works applied in ocular recognition [25, 46, 42, 45]. The architecture modifications for both models consist of the removal of the last layer and the addition of two new layers. The first one is a fully-connected layer with neurons that will be used as the feature representation and aim to reduce the feature dimensionality, since originally VGG16 and ResNet-50 have and

features/outputs, respectively. The other layer added has a softmax cross-entropy loss function and it is used only in the training phase in an identification mode. We chose a feature vector of 256 features based on the results reported by Luz et al. 

[25], where the authors evaluated different feature vector sizes and stated that vector with such size (256) showed the best trade-off regarding matching time, amount of memory required and matching effectiveness. The strategy applied to extract features from nir and vis images is detailed in Figure 2.

Figure 2: The cross-/intra-spectral ocular recognition strategy. A single model (ResNet50 or VGG16) is used to learn features from both spectra: NIR and VIS.

The number of epochs used for training was chosen based on a validation subset composed of

of the training set images. After defining the number of epochs, the *cnn models were trained using the entire training set. The training was performed with the Stochastic Gradient Descent (SGD) optimizer and without freezing any weights of the pre-trained layers.

In the test phase, as previously mentioned, the last layer of each model was removed and the features were extracted from the first new last layer, composed by neurons.

The all-against-all matching was performed using the cosine distance metric, which measures the cosine of the angle between two vectors. Regarding the similarity of biometrics features/representations, it is known that orientation is more important than the magnitude coefficient. The cosine distance metric faithfully matches this feature, being given by:

(1)

where and stand for the feature vectors.

The iris and periocular region representations were combined, applying the score-level fusion technique. Similar to approaches that also used score-level fusion for iris and periocular region traits [1, 2, 29] and also based on the individual performance of each trait in our experiments, we chose to use weights of and for the periocular region and iris representations, respectively. To perform fusion at the score-level, first, we compute the matching for each trait independently, and then we calculated the weighted arithmetic mean between the cosine distances computed for the iris and periocular modalities.

It is important to note that, in the model learning process, all images (nir and vis) were used to feed the cnn models, making a single model to learn discriminant features of images captured in both spectra. To the best of our knowledge, this procedure is similar to the adopted in [45] for the CNN architecture. In the test phase the features are extracted for all images nir or vis images. However, note that for evaluating the cross-spectral scenario, only images acquired under different wavelengths are paired to match.

4 Databases, Metrics and Protocol

This section describes the databases used, the experimental protocol defined and the metrics considered appropriate to provide a meaningful comparison between our method and the baselines.

4.1 Databases

Two well-known databases were used in our empirical evaluation: 1) the PolyU; and the 2) Cross-Eyed databases, described below:

4.1.1 PolyU database

PolyU (PolyU Bi-spectral) database is composed of images obtained simultaneously under both nir and vis wavelengths. The entire database has images with a resolution of pixels. For every spectrum, there are samples of each eye (left and right) from subjects ( classes) [29].

4.1.2 Cross-Eyed database

The Cross-Eyed (Cross-eyed-cross-spectral) iris database has images from subjects ( classes). There are samples from each of the classes for every spectrum. The resolution of the images is pixels. All images were obtained at a distance of 1.5 meters, in an uncontrolled indoor environment, with a wide variation of ethnicity and eye colors, and lightning reflexes [38].

4.2 Metrics

For evaluating the algorithms, we choose the *eer metric, which is determined by the intersection point of *far and *frr curves generated when the acceptation/rejection threshold is varied.

We also report the decidability score  [7]. The metric or index  measures how well separated are the two types of distributions (genuines and impostors), in the sense that recognition errors correspond to the regions where both distributions overlap:

(2)

where the means and standard deviations of the genuine and impostor distributions are given by

, , , and , respectively.

Whereas the index  can be related to the feature vector discrimination ability of an approach, the eer metric measures the real performance of a biometric system. Therefore, regarding a real-world application, we consider the eer as the primary metric in the results reported in this work.

4.3 Protocol

In all experiments, the verification setting was the unique considered, in which pairs of images are compared in order to determine whether a subject is who he claims to be or not. For this, following a one-against-all pairwise matching strategy, all pairs of genuine and impostor comparisons were generated.

For a fair comparison with the state-of-the-art methods, the test protocol used in this work follows the procedures given in[29, 45], which consists of a closed-world protocol, where different instances of the same class are distributed in the training and test sets. In the PolyU database, the first ten instances from every subject were used for training and the remainder (five) were employed for the matching. In the Cross-Eyed database, the first five instances from every subject are used for training and the remaining three instances were employed for the matching.

To perform the experiments, we considered that in both databases, the nir and vis images were obtained synchronously. Thus, here in the intra-class comparison in the cross-spectral scenario, images of the same index were not matched, because the pair represents the same image but in different spectra. Note that in the work by Wang and Kumar [45], the authors considered that in the Cross-Eyed database, non-synchronously spectrum images were obtained (based on the numbers of intra- and inter-class comparisons), so they matched nir against vis images of the same index in the intra-class comparison. Then for a fair comparison with the state-of-the-art method [45], in the closed-world protocol, we also report results considering that the nir and vis images where obtained non-synchronously in the Cross-Eyed database.

In order to evaluate the robustness of the proposed methodology, we also evaluate and then report results on the open-world protocol, in which the training and test sets have images from different classes. In other words, there are no images from the same subject in the training and testing. In this protocol, for both databases, we use the first half of the subject images for training and the second half for testing.

The distributions of images and classes in the training and test sets, as well as the number of genuine and impostors pairs generated in the test phase for both databases and protocols are detailed in Table 1.

Database Protocol Scenario Train/Test Images(Classes) Gen./Imp. pairs
PolyU CW Cross
PolyU CW Intra
PolyU OW Cross
PolyU OW Intra
Cross-Eyed CW Cross
Cross-Eyed CW Intra
Cross-Eyed OW Cross
Cross-Eyed OW Intra
Table 1: Genuine and impostor matches for the Closed-world (CW) and Open-world (OW) protocols on Cross- and Intra-spectral scenarios. *The comparison with the state-of-the-art methods was performed using the closed-world protocol.

The mean and standard deviation of repetitions for the eer and decidability figures obtained by the proposed methodology are shown.

5 Results and Discussion

In this section, we present and discuss the results observed for the intra-spectral cross-spectral scenarios, in both the iris and periocular modalities. We start by providing the results using the closed-world protocol, in order to establish a baseline with respect to the state-of-the-art. We also investigate the impact of the feature vector size and the weights used to merge information from the periocular region and iris traits. Then, the results using the open-world protocol are presented, to perceive how robust deep representations can be obtained. Using the ResNet-50 model, a comparison of the verification effectiveness using features extracted from various network depths is performed. Lastly, we performed a subjective analysis of the pairwise errors.

In a complementary setting, we explore the advantages yielding from fusing representations of the periocular and iris traits to improve performance. Similar to previous works [29, 1, 2] that applied higher weights in the most discriminating traits, and also considering that in all our experiments the periocular region reported better results compared to the iris, we decided to use constant weights of and respectively for the periocular and iris representations when obtaining the fused score by linear combination.

The experiments performed in this work and reported here used an NVIDIA® Titan Xp GPU with 12GB memory and

CUDA cores, and the tensorflow

and Keras frameworks were used to implement the cnn models.

5.1 Closed-world protocol

At first, Table 2 and Table 3 report the results observed for verification mode, in the cross-spectral and intra-spectral scenarios (nir against nir and vis against vis) and using the closed-world protocol. In a way similar to Nalla and Kumar [29] and also to guarantee a fair comparison to their method, the fusion of two spectra on the PolyU database was carried out by linear combination, using weights of and , respectively, to the nir and vis images. However, based on the individual spectral results, on the Cross-Eyed database, we used weights of and for the vis and nir representations, respectively. Also, on the Cross-Eyed database, we can perceive that the spectral fusion using iris representations extracted by the VGG16 model reported lower results than using the only vis spectral information. The results show that the representations obtained from nir images presented a high eer value, which penalized the fusion of spectra. Therefore, lower weight for nir representations may improve the fusion result. The results of those fusions are shown in Table 2 and Table 3 (VIS and NIR Fusion section).

Approach Modality EER (%) Decidability
Cross-Spectral
CNN with SDH [45]* Iris
CNN with SDH [45] Iris
VGG16 with SDH [45]* Iris
Proposed VGG16 Iris
ResNet50 with SDH [45]* Iris
Proposed ResNet50 Iris
Proposed VGG16 Periocular
Proposed ResNet50 Periocular
Proposed VGG16 Fusion
Proposed ResNet50 Fusion
VIS vs VIS
Nalla and Kumar [29]* Iris
Proposed VGG16 Iris
Proposed ResNet50 Iris
Proposed VGG16 Periocular
Proposed ResNet50 Periocular
Proposed VGG16 Fusion
Proposed ResNet50 Fusion
NIR vs NIR
Nalla and Kumar [29]* Iris
Proposed VGG16 Iris
Proposed ResNet50 Iris
Proposed VGG16 Periocular
Proposed ResNet50 Periocular
Proposed VGG16 Fusion
Proposed ResNet50 Fusion
VIS and NIR Fusion
Nalla and Kumar [29]* Iris
Proposed VGG16 Iris
Proposed ResNet50 Iris
Proposed VGG16 Periocular
Proposed ResNet50 Periocular
Proposed VGG16 Fusion
Proposed ResNet50 Fusion
Table 2: Results - closed-world protocol on the PolyU database. *Using only subjects from a total of .
Approach Modality EER (%) Decidability
Cross-spectral
CNN with SDH [45] Iris
VGG16 with SDH [45] Iris
Proposed VGG16* Iris
Proposed VGG16 Iris
ResNet50 with SDH [45] Iris
Proposed ResNet50* Iris
Proposed ResNet50 Iris
Proposed VGG16* Periocular
Proposed VGG16 Periocular
Proposed ResNet50* Periocular
Proposed ResNet50 Periocular
Proposed VGG16* Fusion
Proposed VGG16 Fusion
Proposed ResNet50* Fusion
Proposed ResNet50 Fusion
VIS vs VIS
Proposed VGG16 Iris
Proposed ResNet50 Iris
Proposed VGG16 Periocular
Proposed ResNet50 Periocular
Proposed VGG16 Fusion
Proposed ResNet50 Fusion
NIR vs NIR
Proposed VGG16 Iris
Proposed ResNet50 Iris
Proposed VGG16 Periocular
Proposed ResNet50 Periocular
Proposed VGG16 Fusion
Proposed ResNet50 Fusion
VIS and NIR Fusion
Proposed VGG16 Iris
Proposed ResNet50 Iris
Proposed VGG16 Periocular
Proposed ResNet50 Periocular
Proposed VGG16 Fusion
Proposed ResNet50 Fusion
Table 3: Results - closed-world protocol on the Cross-Eyed database. *same protocol used by Wang and Kumar [45].

Anyway, it can be seen that - for both databases - the proposed approach achieves better results than the state-of-the-art methods, both in the cross-spectral and in the intra-spectral scenarios even that the protocol used in this paper is more challenging. For example, in the PolyU database, we used images from all 209 subjects in the experiments, while the approaches proposed by Wang and Kumar [45], and Nalla and Kumar [29] used images from only 140 subjects. In the Cross-Eyed database, based on the number of pairs of intra-class comparisons reported in the experiments by Wang and Kumar [45], the authors considered that the database has images obtained non-synchronously. Images from the Cross-Eyed database were obtained using a dual sensor with a beam splitter, so the nir and vis images are acquired simultaneously. However, we visually verified that the images of the same index, i.e., those that should be the same one in the nir and vis, have a random shift in each spectrum. Thus, for a fair comparison with the state-of-the-art approaches, we report the results using both protocols, considering the images obtained synchronously and non-synchronously. Note that we collected the state-of-the-art results from the original papers [29, 45], i.e., we did not have implemented any approach from these works.

In terms of the CNN architectures, the ResNet-50 model reported lower eer values compared to the VGG16 model in all cases. However, in some cases, specifically in the PolyU database, the representations extracted with the VGG16 model obtained a better separation of intra- and inter-class distributions, as can be seen in their Decidability index.

The results show that in Cross-Eyed, the periocular modality achieves better results than the iris one. However, in the PolyU database, there is no significant difference between iris and periocular representations, mainly in the intra-spectral experiments. From a visual inspection analysis of the pairwise comparison errors (some examples are shown in Section 5.5), we perceive that in the PolyU database, some uncontrolled conditions present in the images such as pose, eye gaze, and rotation may penalize the quality of the periocular representations. These conditions are more controlled in the cross-eyed images. Also, Cross-Eyed images are smaller than PolyU images, so the iris region is even smaller, and the periocular images are better centralized based on the iris region in the Cross-Eyed and not in the PolyU database. Nevertheless, Cross-Eyed images present a more significant difference in color and illumination among classes, which makes them more distinct and may explain the better results in vis against vis comparisons than nir against nir.

5.2 Feature size and fusion weights analyses

In this section, we analyze and discuss the impact of feature vector size and the weights used for the fusion of the iris and periocular region representations.

As state in Section 3, we choose the feature size of based on the experiments and results reported by Luz et al [25]. Therefore, we also performed some experiments creating new models with different sizes in the last layer before the softmax one, i.e., the layer used to extract the features (representations). The results of the fusion of iris and periocular representations extracted with these models are presented in Table 4. Luz et al. [25] stated that for the cosine distance metric, high dimensional vectors resulted in better performance. Conversely, our results show that representations extracted with the ResNet50 model achieve lower values of eer when the feature vector is smaller. The same occurs in the VGG16 model features in the PolyU database. Regarding the decidability index, the size of the feature vector does not show to have much impact. These results may be related to the fact that both models can generate sparse feature vectors, as stated by Wang and Kumar [45]. Thus a bigger feature vector will not always improve the performance of the biometric system. Here, we decided to keep a feature vector size of 256 because it keep a trade-off between eer and Decidability.

VGG16 Features

ResNet50 Features

Figure 3: Periocular weights impact on the traits fusion in the cross-spectral scenario on the PolyU (top row) and Cross-Eyed (bottom row) databases.
Model Feat. Size PolyU Cross-Eyed
EER (%) Decidability EER (%) Decidability
ResNet50
VGG16
Table 4: Feature vector size results fusing iris and periocular region traits on Cross-spectral scenario.

As described in Section 3, similar to some approaches [1, 2, 29] in the literature and based on the individual performance in our experiments, we choose weights of and for the periocular and iris fusion, respectively. Nevertheless, in this section, we evaluated the impact of different iris and periocular weights on the trait representations fusion in the cross-spectral scenario, for both models. Indeed, we impose , such that , where and stand for the periocular and iris weights, respectively. The results are reported in Figure 3.

Even though the values of eer are lower using features extracted with the ResNet50 model, we can observe a similar behaviour regarding the weight difference in both databases for both models. That is, when the weights are appropriately combined the best results are achieved. We can also observe that the periocular trait has more impact on the Cross-Eyed database than on the PolyU database. We also note that on the PolyU database, in some cases, fusion with a higher iris weight ( and using VGG16 features) may achieve a lower value of eer.

5.3 Open-world protocol

Also, the experimental results observed for the open-world scenario are presented in Table 5 and Table 6 for the PolyU and Cross-Eyed databases, respectively. Notice that this protocol is more challenging since there is no sample of the test classes in the training set. Another factor that makes it more difficult is that compared to the closed-world protocol, fewer images are available for model training, and there are more images on the test set increasing the pair of genuine and imposter comparisons.

Approach Modality EER (%) Decidability
Cross-spectral
Proposed ResNet50 Iris
Proposed ResNet50 Periocular
Proposed ResNet50 Fusion
VIS vs VIS
Proposed ResNet50 Iris
Proposed ResNet50 Periocular
Proposed ResNet50 Fusion
NIR vs NIR
Proposed ResNet50 Iris
Proposed ResNet50 Periocular
Proposed ResNet50 Fusion
Table 5: Verification in the open-world protocol on the PolyU database.
Approach Modality EER (%) Decidability
Cross-spectral
Proposed ResNet50 Iris
Proposed ResNet50 Periocular
Proposed ResNet50 Fusion
VIS vs VIS
Proposed ResNet50 Iris
Proposed ResNet50 Periocular
Proposed ResNet50 Fusion
NIR vs NIR
Proposed ResNet50 Iris
Proposed ResNet50 Periocular
Proposed ResNet50 Fusion
Table 6: Results - open-world protocol on the Cross-Eyed database.
Figure 4: ROC curves comparing the closed- and open-world protocols on the PolyU (top row) and Cross-Eyed (bottom row) databases.

To perceive the differences in performance, a comparison of the results using closed- and open-world is shown with the roc curve in Figure 4. Even though a fully fair comparison between closed- and open-world protocols is not feasible because the number of subjects used for learning is different, it is noticeable that the open-world protocol reported worse performance in all modes compared to the closed-world protocol. Nevertheless, we conclude that fusing the ocular and iris representations also leads to promising results in the open-world protocol, given that the observed decidability was higher than three for both databases considered.

5.4 ResNet-50: Performance vs. Network Depth

Having concluded that the ResNet-50

yields to the optimal results in terms of EER in our experiments, our next goal was to perceive how the verification performance varies with respect to the depth of the layer from where representations are taken. In this experiment, we considered all the convolution layers with stride equal to

, resulting in four different depths to be tested: , , and layers. For each one of the four possibilities (depths), the same modifications described in the methodology section were made, adding a fully-connected layer with neurons and a layer with a softmax cross-entropy loss function. The verification results using the different depths are reported in Table 7 for the PolyU and Cross-Eyed databases.

Spec. Trait layers layers layers layers
(26M) (14.5M) (15.6M) (24.1M)
PolyU
VIS Iris
VIS Perioc.
VIS Fusion
NIR Iris
NIR Perioc.
NIR Fusion
Cross Iris
Cross Perioc.
Cross Fusion
Cross-Eyed
VIS Iris
VIS Perioc.
VIS Fusion
NIR Iris
NIR Perioc.
NIR Fusion
Cross Iris
Cross Perioc.
Cross Fusion
Table 7: eer values observed for different depths (trainable parameters) of ResNet50 architecture, using the closed-world protocol.

It can be observed that the largest degradation of the results occurred when using shallow models occurs in the Cross-Eyed database. In all cases, the vis against vis comparison reports the best results and it is the scenario where it presents the lowest degradation of the response in the different depths of the model.

As shown in the nir against nir and Cross-spectral results in the Cross-Eyed database, some eer values in the fusion of traits is higher than the ones using information from the periocular region only. This behavior is due to the weight used in the fusion of features where the low discrimination of the iris region penalizes and degrades the fused matching score, as we discuss in Section 5.2.

The experiments performed by Nguyen et al. [30] show that features extracted from intermediate layers of the networks achieved better results compared to deep layer representations. However, our results report lower eer rates using features extracted from deeper layers. It is important to point out that in [30] the ResNet152 model (i.e., a deeper model than ResNet50, used in our work) was employed. The same behavior can be observed in work by Henandez-Diaz et al. [18], were the authors stated that features extracted from the intermediate layers of the ResNet-101 model reported the best results. Thus, the deepest layer reported in this work is approximately at the same depth as the intermediate layer reported by Nguyen et al. [30] and by Hernandez-Diaz et al. [18]. In another work, Hernandez-Diaz et al. [14] reported that using the ResNet50 model, representations from the intermediate layers achieved better results in the UBIPr Periocular database [31]. Oppositely, in this work, periocular representations extracted from the last layer of the ResNet50 model achieved the best results. Notice the UBIPr database has some larger images (from pixels (m) to (m)) than PolyU and Cross-Eyed databases and also the periocular region is more extensive, containing eyebrows information, which can explain why a shallow model can extract more discriminant features from the intermediate layers, in this case.

As described in [45], a disadvantage of the VGG16 model, when compared to ResNet, is its larger number of trainable parameters (M, when compared to their CNN with SD methodology M). As before stated, in our case the best responses were observed when using the ResNet50 model, which after the modifications has M (four times lower compared to VGG16). As shown in Table 7, smaller networks in terms of depth lead to increasingly high losses in performance, however also decreasing nearly 10M training parameters, which can be an interesting solution for embedded systems and other cases where the computational complexity might be a concern. The ResNet with layers has more trainable parameters than the other models, since it considers an input image of pixels and filters. In addition, its convolutional part is connected with a fully connected layer containing neurons added for reduction of feature dimensionality.

5.5 Subjective evaluation

In order to provide some insight about the weaknesses of the solutions proposed in this paper, and also to provide a basis for subsequent improvements in the technology, this section highlights some notable cases of image pairwise comparisons that led to the best/worst performance (using the closed-world protocol). Results are shown in Figure 5, grouped into the worst genuine (when the system rejected a true matching) and the best impostors (when the system accepted a false matching) comparisons.

Worst Genuine
  
  
Best Impostor
  
  
Figure 5: Pairwise comparison errors in the vis against vis scenario on Cross-Eyed (left) and PolyU (right) databases. Periocular and iris matching modalities are presented at Top and Bottom rows, respectively.

Although Figure 5 only shows vis images, we noticed that pose and gaze are factors that can lead to matching errors also in nir against nir and cross-spectral scenarios. We observe that there were also confusions in images of the same subject but from different classes (left and right eyes) no matter the spectral scenario. Thus, we believe that it is possible to improve the recognition system accuracy using information based on the angle of the periocular region images and also performing a preprocessing to determine the left and right eyes (i.e., a soft biometrics process). Also, based on the pairwise comparison errors, we can state that another factor that may improve system accuracy is the process of centralization/resizing of the periocular image based on the iris region size and location, similar to the method proposed by Hernande-Diaz et al.  [14].

6 Conclusion

In this work we performed extensive experiments on two databases for both cross-spectral and intra-spectral ocular recognition. A strategy using methodologies from the literature was applied to reach new state-of-the-art results on both databases. It shows that there is still room for improvement by applying and merging known methodologies in the literature to surpass cross-spectral ocular recognition.

We also discuss how deep representations from the iris and ocular region (extracted using VGG16 and ResNet50 architectures) can be fused to improve the recognition performance on the ambitious cross-spectral recognition problem. We used cnn models that were pre-trained for face recognition, and fine-tuned each one for a specific biometric modality: iris and periocular. A single model for each trait was trained for the feature extraction using nir and vis images. The matching phase, on a verification mode, was performed using the cosine metric. In order to provide a fair comparison with the state-of-the-art approaches, we used the closed-world protocol. However, we also reported results on the open-world protocol to evaluate the robustness of the proposed methodology.

Our experiments showed that the models learned on the ResNet-50 architecture reported best results in terms of eer than its VGG counterpart, both in the PolyU and Cross-Eyed databases. Interestingly, we note that even this simple processing chain was observed to advance the state-of-the-art results in both datasets.

Overall, in most of the experiments, features taken from the periocular region were observed to provide better performance than iris features, with the fusion of these two modalities improving the eer value and decidability index than the best individual trait.

In a complementary way, we analyzed the impact of the feature vector size and the Iris and Periocular weights used for trait representation fusion, and how the recognition performance varies with respect to the depth of the models used for feature extraction, i.e., by using intermediate layers of the ResNet50 model to take the feature sets used in the matching phase.

Finally, subjective analysis of the best/worst false genuine and true impostors image pairwise comparisons was also performed, showing that factors such as angles of image capture may interfere in the accuracy of the recognition system. In this direction, we plan for future works to investigate how to build representation taking into account eye gaze and pose.

Acknowledgment

This work was supported by grants from the National Council for Scientific and Technological Development (CNPq)(Nos. 428333/2016-8, 313423/2017-2 and 306684/2018-2), and the Coordination for the Improvement of Higher Education Personnel (CAPES), and also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The fourth author work is funded by FCT/MEC through national funds and co-funded by FEDER - PT2020 partnership agreement under the projects UID/EEA/50008/2019 and POCI-01-0247-FEDER-033395.

References

  • [1] N. U. Ahmed, S. Cvetkovic, E. H. Siddiqi, A. Nikiforov, and I. Nikiforov (2016-12) Using fusion of iris code and periocular biometric for matching visible spectrum iris images captured by smart phone cameras. In

    2016 23rd International Conference on Pattern Recognition (ICPR)

    ,
    Vol. , pp. 176–180. External Links: Document, ISSN Cited by: §3, §5.2, §5.
  • [2] N. U. Ahmed, S. Cvetkovic, E. H. Siddiqi, A. Nikiforov, and I. Nikiforov (2017) Combining iris and periocular biometric for matching visible spectrum eye images. Pattern Recognition Letters 91, pp. 11 – 16. Note: Mobile Iris CHallenge Evaluation (MICHE-II) External Links: ISSN 0167-8655, Document Cited by: §3, §5.2, §5.
  • [3] A. S. Al-Waisy, R. Qahwaji, S. Ipson, S. Al-Fahdawi, and T. A. M. Nagem (2017) A multi-biometric iris recognition system based on a deep learning approach. Pattern Analysis and Applications. Cited by: §1.
  • [4] F. M. Algashaam, K. Nguyen, M. Alkanhal, V. Chandran, W. Boles, and J. Banks (2017) Multispectral Periocular Classification With Multimodal Compact Multi-Linear Pooling. IEEE Access 5, pp. 14572–14578. Cited by: §1.
  • [5] C. S. Bezerra, R. Laroca, D. R. Lucio, E. Severo, L. F. Oliveira, A. S. Britto Jr., and D. Menotti (2018-10) Robust iris segmentation based on fully convolutional networks and generative adversarial networks. In Conference on Graphics, Patterns and Images (SIBGRAPI), Vol. , pp. 281–288. External Links: Document, ISSN 2377-5416 Cited by: §2.
  • [6] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman (2017) VGGFace2: A dataset for recognising faces across pose and age. CoRR. External Links: Link, 1710.08092 Cited by: §1, §3.
  • [7] J. Daugman (2003) The importance of being random: statistical principles of iris recognition. Pattern Recognition 36 (2), pp. 279–291. External Links: Document, ISBN 0031-3203, ISSN 00313203 Cited by: §4.2.
  • [8] J. Daugman (2004) How iris recognition works. IEEE Trans. on Circuits and Systems for Video Technology 14 (1), pp. 21–30. Cited by: §2.
  • [9] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 248–255. Cited by: §2, §2.
  • [10] Y. Du, T. Bourlai, and J. Dawson (2016) Automated classification of mislabeled near-infrared left and right iris images using convolutional neural networks. In IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS), Vol. , pp. 1–6. External Links: Document, ISSN Cited by: §2.
  • [11] A. Gangwar and A. Joshi (2016) DeepIrisNet: deep iris representation with applications in iris recognition and cross-sensor iris recognition. In IEEE Intern. Conference on Image Processing, Vol. , pp. 2301–2305. External Links: Document, ISSN Cited by: §1, §2.
  • [12] X. Glorot, A. Bordes, and Y. Bengio (2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 513–520. Cited by: §1.
  • [13] L. He, G. Poggi, C. Sansone, L. Verdoliva, H. Li, F. Liu, N. Liu, Z. Sun, and Z. He (2016) Multi-patch convolution neural network for iris liveness detection. In 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1–7. Cited by: §2.
  • [14] K. Hernandez-Diaz, F. Alonso-Fernandez, and J. Bigun (2018-Sep.) Periocular recognition using cnn features off-the-shelf. In 2018 International Conference of the Biometrics Special Interest Group (BIOSIG), Vol. , pp. 1–5. External Links: Document, ISSN Cited by: §5.4, §5.5.
  • [15] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §1.
  • [16] A.G. Hochuli, L.S. Oliveira, A.S. B. Jr, and R. Sabourin (2018) Handwritten digit segmentation: is it still necessary?. Pattern Recognition 78, pp. 1 – 11. External Links: ISSN 0031-3203, Document Cited by: §1.
  • [17] M. S. Hosseini, B. N. Araabi, and H. Soltanian-Zadeh (2010) Pigment melanin: Pattern for iris recognition. IEEE Transactions on Instrumentation and Measurement 59 (4), pp. 792–804. External Links: Document, 0911.5462, ISBN 0018-9456, ISSN 00189456 Cited by: §1.
  • [18] F. A. K. Hernandez-Diaz and J. Bigun (2019) Cross spectral periocular matching using resnet features. In International Conference on Biometrics(ICB), Vol. , pp. 1–6. Note: In Press External Links: ISSN Cited by: §2, §5.4.
  • [19] S. Kim, T. Hori, and S. Watanabe (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. External Links: Document, ISBN 978-1-5090-4117-6 Cited by: §1.
  • [20] R. Laroca, E. Severo, L. A. Zanlorensi, L. S. Oliveira, G. R. Gonçalves, W. R. Schwartz, and D. Menotti (2018-07) A robust real-time automatic license plate recognition based on the yolo detector. In 2018 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–10. External Links: Document, ISSN 2161-4407 Cited by: §1.
  • [21] R. Laroca, L. A. Zanlorensi, G. R. Gonçalves, E. Todt, W. R. Schwartz, and D. Menotti (2019) An efficient and layout-independent automatic license plate recognition system based on the YOLO detector. arXiv preprint arXiv:1909.01754 (), pp. 1–14. Cited by: §1.
  • [22] N. Liu, M. Zhang, H. Li, Z. Sun, and T. Tan (2016) DeepIris: learning pairwise filter bank for heterogeneous iris verification. Pattern Recognition Letters 82, pp. 154–161. External Links: ISSN 0167-8655, Document Cited by: §1, §1, §2.
  • [23] D. R. Lucio, R. Laroca, E. Severo, A. S. Britto Jr., and D. Menotti (2018-10) Fully convolutional networks and generative adversarial networks applied to sclera segmentation. In 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), Vol. , pp. 1–7. External Links: Document, ISSN 2474-9699 Cited by: §2.
  • [24] D. R. Lucio, R. Laroca, L. A. Zanlorensi, G. Moreira, and D. Menotti (2019-10) Simultaneous iris and periocular region detection using coarse annotations. In Conference on Graphics, Patterns and Images (SIBGRAPI), Vol. , pp. 1–8. Note: In Press Cited by: §2, §3.
  • [25] E. Luz, G. M., L. A. Z. Junior, and D. Menotti (2018) Deep periocular representation aiming video surveillance. Pattern Recognition Letters 114, pp. 2 – 12. Cited by: §1, §1, §1, §2, §2, §3, §5.2.
  • [26] F. M. et al. (2017) A deep learning approach for iris sensor model identification. Pattern Recognition Letters. External Links: ISSN 0167-8655, Document Cited by: §2.
  • [27] M. D. Marsico, A. Petrosino, and S. Ricciardi (2016) Iris recognition through machine learning techniques: a survey. Pattern Recognition Letters 82, pp. 106 – 115. Note: An insight on eye biometrics External Links: ISSN 0167-8655 Cited by: §1.
  • [28] D. Menotti, G. Chiachia, A. Pinto, W. R. Schwartz, H. Pedrini, A. X. Falcão, and A. Rocha (2015) Deep representations for iris, face, and fingerprint spoofing detection. IEEE Transactions on Information Forensics and Security 10 (4), pp. 864–879. Cited by: §2.
  • [29] P. R. Nalla and A. Kumar (2017) Toward more accurate iris recognition using cross-spectral matching. IEEE Transactions on Image Processing 26 (1), pp. 208–221. Cited by: §1, §1, §2, §2, §3, §4.1.1, §4.3, §5.1, §5.1, §5.2, Table 2, §5.
  • [30] K. Nguyen, C. Fookes, A. Ross, and S. Sridharan (2018) Iris recognition with off-the-shelf CNN features: a deep learning perspective. IEEE Access 6, pp. 18848–18855. Cited by: §1, §2, §5.4.
  • [31] C. N. Padole and H. Proenca (2012-03) Periocular recognition: Analysis of performance degradation factors. In IAPR International Conference on Biometrics (ICB), New Delhi, India, pp. 439–445. External Links: ISBN 978-1-4673-0397-2 Cited by: §5.4.
  • [32] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. In British Machine Vision Conference (BMVC), pp. 1–12. Cited by: §1, §3.
  • [33] H. Proença and L. A. Alexandre (2005) UBIRIS: A noisy iris image database. In 13th International Conference on Image Analysis and Processing - ICIAP 2005, Vol. 3617, pp. 970–977. Cited by: §1.
  • [34] H. Proença and L. A. Alexandre (2012) Toward covert iris biometric recognition: experimental results from the NICE contests. IEEE Transactions on Information Forensics and Security 7 (2). External Links: Document, ISSN 1556-6013 Cited by: §1.
  • [35] H. Proença, S. Filipe, R. Santos, J. Oliveira, and L. A. Alexandre (2010) The UBIRIS.v2: a database of visible wavelength iris images captured on-the-move and at-a-distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (8), pp. 1529–1535. External Links: Document, ISSN 0162-8828 Cited by: §1.
  • [36] H. Proença and J. C. Neves (2017) IRINA: Iris Recognition (Even) in Inaccurately Segmented Data. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2017-Janua, pp. 6747–6756. External Links: Document, ISBN 978-1-5386-0457-1 Cited by: §1.
  • [37] H. Proença and J. C. Neves (2018) Deep-PRWIS: Periocular Recognition Without the Iris and Sclera Using Deep Learning Frameworks. IEEE Transactions on Information Forensics and Security 13 (4), pp. 888–896. External Links: Document, ISSN 1556-6013 Cited by: §1, §1, §2.
  • [38] A. Sequeira, L. Chen, P. Wild, J. Ferryman, F. Alonso-Fernandez, K. B. Raja, R. Raghavendra, C. Busch, and J. Bigun (2016) Cross-Eyed - Cross-Spectral Iris/Periocular Recognition Database and Competition. In 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), Vol. P-260, pp. 1–5. Cited by: §2, §4.1.2.
  • [39] A. F. Sequeira, L. Chen, J. Ferryman, P. Wild, F. Alonso-Fernandez, J. Bigun, K. B. Raja, R. Raghavendra, C. Busch, T. de Freitas Pereira, S. Marcel, S. S. Behera, M. Gour, and V. Kanhangad (2017-10) Cross-eyed 2017: cross-spectral iris/periocular recognition competition. In 2017 IEEE International Joint Conference on Biometrics (IJCB), Vol. , pp. 725–732. External Links: ISSN 2474-9699 Cited by: §2.
  • [40] E. Severo, R. Laroca, C. S. Bezerra, L. A. Zanlorensi, D. Weingaertner, G. M., and D. Menotti (2018-07) A benchmark for iris location and a deep learning detector evaluation. In 2018 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–7. External Links: Document, ISSN 2161-4407 Cited by: §2.
  • [41] A. Sharma, S. Verma, M. Vatsa, and R. Singh (2014) On cross spectral periocular recognition. 2014 IEEE International Conference on Image Processing, ICIP 2014, pp. 5007–5011. External Links: Document, ISBN 9781479957514 Cited by: §1, §2.
  • [42] P. H. Silva, E. Luz, L. A. Zanlorensi, D. Menotti, and G. Moreira (2018) Multimodal feature level fusion based on particle swarm optimization with deep transfer learning. In

    2018 Congress on Evolutionary Computation (CEC)

    ,
    pp. 1–8. Cited by: §2, §3.
  • [43] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning (2011)

    Semi-supervised recursive autoencoders for predicting sentiment distributions

    .
    In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 151–161. Cited by: §1.
  • [44] J. Tapia and C. Aravena (2017) Gender classification from nir iris images using deep learning. In Deep Learning for Biometrics, pp. 219–239. Cited by: §2.
  • [45] K. Wang and A. Kumar (2019) Cross-spectral iris recognition using cnn and supervised discrete hashing. Pattern Recognition 86, pp. 85 – 98. Cited by: §1, §1, §2, §2, §2, §3, §3, §4.3, §4.3, §5.1, §5.2, §5.4, Table 2, Table 3.
  • [46] L. A. Zanlorensi, E. Luz, R. Laroca, A. S. Britto Jr., L. S. Oliveira, and D. Menotti (2018-10) The impact of preprocessing on deep representations for iris recognition on unconstrained environments. In Conference on Graphics, Patterns and Images (SIBGRAPI), Vol. , pp. 289–296. External Links: Document, ISSN 2377-5416 Cited by: §1, §1, §2, §3, §3.
  • [47] Y. Zhang, W. Chan, and N. Jaitly (2017) Very deep convolutional networks for end-to-end speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4845–4849. External Links: Document, 1610.03022, ISBN 978-1-5090-4117-6, ISSN 15206149 Cited by: §1.