Gabor kernels are widely recognized as an effective tool for iris feature extraction, as proposed by Daugman[Daugman_PAMI_1993], and still are a dominant approach for calculating iris codes in iris recognition systems. Daugman’s method is straightforward in principle: convolve a normalized iris image with a set of kernels (only real parts of Gabor kernels are used in commercial systems [daugman2015information]
), binarize the result to form a code, and use only “strong” (not “fragile”) bits corresponding to non-occluded parts of the iris in calculating the distance between two iris samples. An important question is how to find optimal kernels that offer highly individual iris codes. Should these kernels be selected only from a family of Gabor kernels, as in a dominant approach, or can they be selected from a broader set of functions? And if we start from a broader set of kernels and run a hyperparameter search, will we eventually (and independently of the database) converge to Gabor kernels? Since they have solid theoretical foundations in computer vision (e.g., they serve as uncertainty-minimising elementary functions, and have been found to model activation profiles of visual cortex neurons[jones1987evaluation]), observing that such a hyperparameter search always converges to Gabor kernels would simply justify our choice for using these kernels in iris recognition. If, however, optimal filtering kernels turn out to be different than Gabor, what do they look like? And how to find them effectively given a database of iris images? Do they perform better than Gabor, or worse? This paper answers these questions.
We propose to use a single-layer convolutional neural network that replicates a Daugman’s approach to iris recognition. This network provides a framework to learn data-driven kernels for iris recognition that can be directly implanted into open-source tools such as OSIRIS [othman2016osiris]
. The input to the network is a normalized iris image and the output is a feature vector of sizewhich mimics an iris code. Figure 1
shows the high-level overview of the pipeline. The iris occlusion mask is incorporated directly into training, so the network does not use non-iris areas in searching for optimal filters. We also propose a triplet loss function which incorporates normalized occlusion masks into the loss and also uses a domain-specific distance metric and a soft margin[hermans2017defense]. We considered starting the training procedure from randomly initialized weights, and from weights initialized with open-source Gabor kernels.
The main and interesting findings presented in this paper are:
the network training procedure converges to kernels deviant of Gabor kernels, instead converging on a combination edge detectors and blob detectors,
the above happens independently of the weight initialization procedure (initialization with Gabor kernels or random numbers),
these learned kernels are also shown to offer similar performance than the only known open-source Gabor kernels included with the OSIRIS tool.
The data used to train the kernels was a large set of in-house data of more than 340,000 iris images collected from 3,000 distinct eyes. For testing result reporting purposes we used three independent datasets of varying difficulty:
WVU Non-ideal Iris Database – Release 1,
Live partition from the entire LivDet 2017 Liveness Detection-Iris – Warsaw Subset.
Practical contributions of this paper include:
kernels that are ready to be used in any implementation of Daugman’s algorithm (e.g., OSIRIS), that were learned using a large database of 340,000 iris images collected from 3,000 classes,
source codes of the framework allowing to train filters for own database of iris images along with all visualization scripts [aboyd-github].
2 Related Work
In a work by Zhao [zhao2017towards], the use of triplet loss for the purpose of iris recognition is explored. This paper outlines the increase in performance and generalization that can be attained by applying a modified triplet loss that includes occlusion masks to the task of iris recognition. Direct comparisons are made to the performance to the 2D Gabor kernels included with the OSIRIS tool. Our paper differs from this in that instead of creating a multi-layer network to learn the kernels, we create a one layer network to learn the kernels that can be used outside of the model and accurately mimic the Daugman approach to iris recognition. Their work was later extended [wang2019towards] to show performance increases using a dilated neural network. The purpose of their work was to increase performance of an iris recognition system, whereas we wish to apply iris data to a one layer convolutional network and examine the kernels that are learned to determine whether Gabor kernels are optimal. Additionally, our work applies the occlusion masks directly to the loss function, rather than passing through a network as in the above works.
The use of batch mining in our work is described in a work by Hermans [hermans2017defense]. In their work, extensive experimentation is done on different triplet selection methodologies and they show that the use of hard mining within batches (referred to as Batch Hard) was optimal for their application to person re-identification. We adapt this approach to our task and perform Batch Hard selection of triplets. Also described in this paper is the use of a soft-margin for triplet loss calculation. They describe that instead of using an parameter to determine separation between classes within triplets, a log based approach provided a smoother decay without a cutoff as with the original approach. In our work we use this soft-margin to calculate the triplet loss.
In a work by Ahmad and Fuller [ahmad2019thirdeye], both the Batch Hard mining and soft-margin approach is explored for the purpose of iris recognition without normalization. In this work, this approach to triplet loss is validated by showing that a triplet based network can achieve better performance than end to end deep networks such as DeepIrisNet [gangwar2016deepirisnet]. Our work is different from this in that we use normalized images as the input to the network and include the occlusion mask in the loss function. In their work, a five convolutional layer structure is proposed with a pooling layer. No kernel analysis is performed. In our work we analyse the kernels learned by our single convolutional layer.
Figure 2 shows three examples from each of the databases used in this paper. The top row represents the data used to train the network and the bottom three rows are samples from each of the three subject-disjoint testing datasets.
3.1.1 Training, Validation and Testing Datasets
In this work a set of in-house data was employed to train the network. This set consists of 339,400 iris images from 3,000 classes. All images in this dataset are live irises and are of size . Images in this set were acquired using the LG 2200, LG 4000 and IrisGuard AD100 sensors. In the training pipeline, this data is further subdivided into training and validation. The training subset consists of 80% of the data and the validation set is the remaining 20%. The training and validation subsets are subject-disjoint meaning out of the 3,000 classes, 2,400 are used in training and 600 for validation.
To test the trained network, it was decided to use the following three subject-disjoint and cross-sensor databases:
CASIA-Iris-Thousand V4 [casia-database]
which contains 20,000 images from 1,000 subjects. Both left and right eye images were acquired for each subject, meaning there are 2,000 classes total in this dataset. All images in this dataset were acquired using the IKEMB-100 sensor from IrisKing.
WVU Non-Ideal Iris Database-Release 1 [crihalmeanu2007protocol]
, which contains 3,042 live iris images from a total of 231 subjects, totalling 461 individual classes. Included in this database are images collected in real environments including samples of off-angled, blurred, sensor noise and occlusions.
LivDet 2017 Liveness Detection-Iris – Warsaw Subset (live subset) [yambay2017livdet]
, which consists of 5,168 images from 468 classes. The sensor used to capture these images was the IrisGuard AD100 sensor and a setup composed of Aritech ARX-3M3C camera with SONY EX-View CCD sensor (with an increased NIR sensitivity), equipped with Fujinon DV10X7.5A-SA2 lens and B+W 092 NIR filter. This database contains clean, constrained, high-quality images.
3.1.2 Genuine and Impostor List Generation
The genuine list for both CASIA, Warsaw and WVU databases includes all possible genuine pairs. For the impostor list generation, in order to reduce the number of impostor comparisons but also maintain a representative sample of comparisons, a sample selection strategy was devised. This strategy works as follows: firstly, we only compare images of right eyes with other classes of right eyes and images of left eyes with classes of left eyes. Next, for each class, we select one random image from this reference class and compare it to a random image from another impostor class of the same side of the face. We then select another random image from the first reference class and compare with another different impostor class of the same eye. This is repeated until every class has a comparison with an image in every other class of the same eye type.
For the CASIA-Iris-Thousand database there are 2,000 classes consisting of 1,000 of both left and right eyes. Each of these classes contains 10 images, totalling to 20,000 images. For genuine comparisons, this means there are genuine comparisons per class, resulting in 90,000 total genuine comparisons. For impostor comparisons, using the above described method results in comparisons for each class (one class to all other classes of eyes on the same side of the face). This results in impostor comparisons.
Applying the same methodology, the resulting number of genuine comparisons for the Warsaw database is 47,408 and the number of impostor comparisons equals 218,556. For the WVU database there are 10,344 genuine comparisons and 212,060 impostor comparisons.
3.2 Image Preprocessing
There were two steps undertaken to preprocess the images before they could be used in the network. The first preprocessing step was to segment and normalize the iris to a standard size and after this data alignment was performed.
3.2.1 Segmentation and Normalization
The software employed to segment and normalize all iris images in this work is OSIRIS [othman2016osiris]. The open-source OSIRIS software outputs a normalized iris image of size , as well as the corresponding normalized occlusion mask of the same size. During image segmentation and normalization, if there was a failure, this sample was excluded from the subset. The reasoning for excluding these samples is that only valid training samples are desired so that the network learns iris-related features exclusively. On the network training set there was only 101 failures out of the 339,400 images, so the final network training set comprises 339,299 normalized iris images and occlusion masks from 3,000 classes. For the CASIA-Iris-Thousand database, there was 30 failures and a final testing dataset includes 19,970 images from 2,000 classes. For the Warsaw database, out of the 5,168 images there were 13 failures and for the WVU database out of the 3,043 images there were 3 failures.
3.2.2 Data Alignment
To prevent contradictory inter-class information from being presented to the network, a data alignment scheme was developed and hence incorporates rotation compensation [Daugman_PAMI_1993] into the iris recognition system. The data alignment method operates as follows: for each unique class, the image with the highest mean Pearson Correlation Coefficient (PCC) when compared to all images in that same class is selected as the reference. Using this reference image, each of the other images in the same class are individually rotated one pixel at a time along the x-axis until they reach the point with the highest positive PCC in relation to the reference image. The corresponding occlusion masks are also aligned accordingly.
3.3 Open-source Gabor Kernels
OSIRIS software offers six hand-crafted Gabor kernels. These Gabor kernels consist of three pairs at different scales: , and . Each image is convolved with each of the six kernels independently and then from each of the six feature outputs 256 points are extracted from predefined coordinates. This results in an Iris Code of length 1536. The list of which 256 coordinates to extract the value from is included with the OSIRIS tool, and identical bit sampling was applied in our network.
3.4 Network Architecture
One Convolutional Layer: For this work the network architecture was designed to implement a Daugman-style approach to iris recognition, as shown in Figure 3. To this end, the model only contains one single convolutional layer with six feature maps (three pairs of filters at different scales of , and to mimic the filters provided with the OSIRIS software). The sigmoidactivation function is applied to the output of this convolutional layer to normalize all features to the space.
The input to this network is a normalized iris image of size . The associated normalized mask is used in the loss function and as such is not regarded as the input to the network. After this convolutional layer, the output of each feature map is concatenated to form a feature vector of size .
Bit Sampling Layer: Upon concatenation of the convolutional outputs, this feature vector is passed into an non-trainable custom layer we denote as the bit sampling layer. The purpose is to extract final iris code bits from the predefined 256 locations in each of the six resulting feature maps, as provided by the OSIRIS tool, resulting in a feature vector of size in the range of 0 to 1. This feature vector serves as the output of the network.
Wrap Padding: To prevent the shrinking of the input image after undergoing convolution and to make sure that the selected bits are not just representing a region of zero padding, the input images are padded using a wrapping technique before undergoing convolution. This padding takes one side of an image and wraps it around to the opposite side of varying size depending on the convolutional filter size. As shown in Figure 3, for the filter, a pad of 4 pixels in both y-directions and 7 pixels in each x-direction is applied.
3.5 Triplet Loss
Since the problem of developing effective kernels for meaningful iris feature extraction is inherently a verification problem rather than a classification problem, the application of a triplet loss seemed natural. The goal of triplet loss is to minimize the feature embedding distance between instances of the same class and to maximise the feature embedding distance of different classes. To do this triplets are formed, which include an image called a anchor, another image of the same class as the anchor called the positive and an image of a different class to the anchor and positive called the negative. The triplet loss then takes the output of the network a feature vector for each of the anchor, positive and negative and then compares these feature vectors. Based on these vectors the network weights are updated to push the seen negative classes further and further away from the anchors while pulling the instances of the same class closer and closer.
3.5.1 Network Modifications
To extend our single convolutional layer model to be used in a network with triplet loss, we instantiate this network with shared weights for three inputs: the anchor, the positive and the negative, as shown in Fig. 3. The output vector of size for each input is then concatenated to form a feature vector of where the first elements correspond to the anchor, the second elements correspond to the positive and the final elements represent the negative image. This concatenated feature vector appears in the loss function as the predicted label and from here we extract each image’s features and calculate the distance.
3.5.2 Incorporation of Masks
In iris recognition, it is essential that occlusion masks are used in iris code comparisons. These occluded regions give us no useful information about the iris and also may appear very similar across different subjects. To this end, we propose a methodology to incorporate the exclusion of these occluded regions into the loss function.
For both the anchor/positive and anchor/negative pairs, a combined occlusion mask is calculated by excluding each pixel that is occluded in either the anchor or the positive/negative. The result is two masks of size that only represents regions that contain useful iris information in both individual samples for both pairs. We then pass these masks through the same bit sampling layer defined above to output the bits in the mask that represent the points that were sampled from the output of the convolution, resulting in two binary vectors of length .
To utilize the fact that when using triplet loss there is no true labels for the data as the function is simply trying to maximise a distance between samples rather than between the true and predicted labels, true labels can be defined to anything the user wishes. Using this knowledge, in order to make the combined masks of both the anchor/positive and anchor/negative pairs appear in the loss function, the true labels are set to be a concatenation of these two binary vectors, the first features corresponding to the combined mask of the anchor and positive samples and the second features corresponding to the combined mask of the anchor and negative samples. Because the true and predicted vectors must be the same size, we add a vector of zeros of size to the concatenation of the two combined masks. This results in a total feature vector of , the same as the predicted vector detailed above. This final set of zeros of length is simply discarded in the loss function.
3.5.3 Distance Metric
In general, for triplet loss based solutions, the Euclidean squared distance between two samples is used as the distance metric [ahmad2019thirdeye]. For the task of iris recognition, however, we use a metric that incorporates occlusion masks:
where and denote feature vectors of two samples being compared, and denote the corresponding masks (one denotes iris pixel, zero denotes occlusion), and . One may see a close similarity between and the fractional Hamming distance calculated when and are binary vectors. Ideally, the value of would be zero for the anchor/positive pair () and one for the anchor/negative pair ().
Traditional triplet losses use a margin value to specify how far we wish to separate out the anchor/positve () and anchor/negative () as follows:
However, setting is dataset-specific and requires hyper-parameter tuning to find out what works best as this value affects training accuracy and convergence. Additionally, this means the loss calculation acts as a hinge function. Instead, we applied a soft margin, as proposed in [hermans2017defense]:
This soft-margin loss function acts similarly to the hinge function but instead of having a hard cut-off, the loss decays exponentially and also does not require the discovery of an appropriate value of .
3.5.5 Triplet Mining
It has been demonstrated that the use of triplet mining has shown an increase in training accuracy and generalization [hermans2017defense]. Following this, it was decided to implement the selection of batch hard triplets for training. If the batch size is , batch generation is completed by randomly selecting unique classes (this set of classes is denoted as ). One random image is then selected from each class in as the anchor and another random image from the same class is selected as the positive. We do not do any mining on the anchor/positive pair. When selecting the negative for a triplet, for each of the anchor/positive pairs independently we randomly select new classes that do not appear in (denoted as ). One random image is then selected from each class in . The embeddings are then calculated using the current model weights when setting each of these images from are set as the negative for the current anchor/positive sample. Distances and are then calculated as described above. The negative that corresponds to the smallest distance is set as the negative for that triplet.
This is repeated for each anchor/positive pair in . It is important to note that occlusion masks were taken into account when calculating . After this batch is used in training, the model weights will update such that the hardest samples gradually grows further away from . This process of selecting the hardest negative only is referred to as batch hard negative mining. Because this hard mining is only performed on a small sample of the overall data, these discovered hardest triplets can be considered moderately hard with respect to the entire dataset. It was found that a batch size of was optimal with regards to time. Since this operation of batch hard mining is , increasing this batch size beyond results in batches taking too long to be created for a feasible training regimen.
Due to GPU memory and time constraints, from the within validation subset we use random triplets to act as the actual validation set for the training. This validation set is persistent throughout training so that an accurate representation on training progression is provided. Training was stopped for both the randomly initialized weight network and gabor initialized network after 20,000 batches as this was subjectively decided to represent the convergence of both networks based on the validation loss.
4.1 Kernel Analysis
Figure 4 shows 2D and 3D plots of the used kernel sets: (a) original approximations of Gabor kernels as offered by the existing open-source tool, (b) kernels that were learned from data using random weight initialization, and (c) kernels learned from the same data using Gabor initialized weights. From this figure it is clear that the data-driven kernels did not converge on a similar structure to the original Gabor kernels, instead, some interesting behaviour is observed.
On initial observation of the data-driven kernels, both sets appear to have converged to similar structures. For both pairs of data-driven kernels, the network converged on what appears to be edge detectors. This can be seen in the 3D plots in that the kernels look like flat planes, meaning they extract edge based features. For the kernels, the network converged on blob detectors and for the kernels the final kernels look like simple waves. There is slight differences between the randomly initialized kernels and the Gabor initialized kernels, however, both sets converged on similar looking kernels even though the weights were initialized completely differently. Both the and the data-driven pairs do not display Gabor wavelet properties, so we can say these converged to something deviant of Gabor. The kernels for both weight sets, however, do exhibit some Gabor wavelet properties.
4.2 Evaluation Methodology
To perform a fair comparison between the data-driven kernels and the open-source Gabor kernels, the testing methodology involved the use of the OSIRIS tool to generate the matching scores. First, segmentation and normalization is performed on the testing databases using the out-of-the-box OSIRIS configuration. Then matching scores are calculated using the iris codes generated by original Gabor kernels. Then, we swap out these original kernels and replace them with the kernels that were learned from data and then the exact same encoding and matching as before is performed. We make sure that the learned kernels all sum to zero by subtracting the mean value of each kernel on all elements in that kernel, as OSIRIS binarizes the filter responses at zero to produce the iris code.
4.3 Hand-crafted versus Data-driven
Figure 5 shows genuine/impostor distributions as well as the corresponding ROC curves, along with the decidability value , which evaluates the separation between two distributions for each of the three testing datasets. The same methodology is applied to the original Gabor kernels, the randomly initialized data-driven kernels and the Gabor initialized data-driven kernels. These experiments are conducted independently on all three testing datasets.
From the results in Figure 5, the performance of the learned kernels in comparison to the hand-crafted Gabor kernels is very similar, with the hand-crafted set and the data-driven kernels performing almost identically on all three testing databases. These hand-crafted Gabor kernels were developed for the purpose of extracting features on all iris images. The data-driven kernels were learned on a large database of iris images. Even though both the hand-crafted and data-driven kernels both extract different features, it seems that both perform comparably in extracting unique iris features. The interesting takeaway from this is what the kernels actually converged to. The data-driven kernels did not converge to Gabor kernels, which hints that an optimal solution for the kernels for iris recognition may be from outside the family of Gabor wavelets. If a simple network with minimal parameters can learn something that performs almost identically to the time-intensive task of hand-crafting these Gabor kernels, it may be that there are some undiscovered kernel set that is optimal. We were asking ourselves in the introduction whether a neural network that replicates a Daugman approach to iris recognition ‘thinks’ that Gabor kernels are optimal by converging on this family of kernels, however, we do not see this!
4.4 Weight Initialization
Another question posed in the introduction was whether the weight initialization for this shallow network has an affect on accuracy. To do this we trained the network from two starting points, one using random weight initialization and the other using the hand-crafted Gabor kernels as the starting point. From Figure 5 we can see that there is no discernible difference between these kernels sets in terms of results, even though it can be seen in Figure 4 that these two sets of data-driven kernels appear significantly different than open-source Gabor. There are some clear similarities between the kernels the network converged on, but none of these are explicitly Gabor. This means that no matter the starting point, and even if the network is started from Gabor, it does not converge to Gabor instead opting for a combination of edge detectors, blob detectors and simple waves.
5 Discussions and Conclusions
In this work we develop and release a neural network framework that mimics Daugman’s approach to iris recognition that can be used to learn data-driven kernels for the purpose of iris recognition. We use this framework along with a large iris database to learn data-driven kernels and then we compare these learned kernels to hand-crafted, open-source Gabor kernels included with the OSIRIS tool.
To answer the questions posed in the introduction, we see that the learned kernels perform almost identically to the hand-crafted kernels. Shown is that the initial weight initialization also does not play a huge role in the convergence of the network as the two weight sets converged on similar looking and performing kernels.
A goal of this paper was to determine if a neural network that mimics Daugman’s approach converges on Gabor kernels, widely accepted as the optimal kernel set for iris recognition. Through learning data-driven kernels we show that this may not be true. The data-driven kernels learned in this work do not resemble Gabor, instead are comprised of a mixture of edge detectors, blob detectors and simple waves. This may mean that the optimal kernel set for iris recognition could be from a broader family than Gabor kernels alone.
Through experimentation, we see that the learned kernels perform very similarly to the hand-crafted kernels. This could mean that we have hit the maximum accuracy possible using a one convolutional layer with six feature maps. To increase accuracy, it seems that additional operations are needed such as the addition of extra layers in the network.