The authenticity and integrity of acquired finger vein (FV) images play a significant role in the overall security of a finger-vein-based biometric system. In the advent of forgery techniques, it is important to link FV-images to their corresponding acquisition devices. A FV sample image not linked to a proper sensor of the recognition system would raise alarm and stop an eventual autentication process. Therefore, having reliable and trustful algorithms to achieve finger vein authenticity and integrity is vital.
Many biometric modalities (e.g. face, fingerprint, palm, finger vein images) are vulnerable to attacks. Presentation attacks, which present spoof artifacts to the biometric sensor and insertion attacks, which bypass the sensor by inserting biometric samples into the transmission process between sensor and feature extraction module are the most important examples for attacking the user interface of the biometric system (see Figure1).
Sensor / camera identification in general can be achieved at different levels, camera model level, brand level, and device level. In biometric systems, we often intend to work on device level to actually uniquely identify the sensor instance having captured a certain sample. Still, sensor model identification is of interest as well [maser2021identifying]: Securing a finger vein recognition system against insertion attacks in case the attacker does not know the employed sensor model and enabling device selective processing of the image data.
To identify the source device of an image, many algorithms have been proposed. The most prominent way to deduce sensor information from images is to exploit the image inherent Photo Response Non-Uniformity (PRNU). The PRNU relies on intrinsic characteristics of an image caused by different sensitivity of pixels to light due to inhomogeneity of silicon wafer and imperfection during fabrication of sensors.
Lukas et al. in [lukas2006digital] and Fridrich in [fridrich2009digital] propose to compute the residual image noise extracted from a subtraction of image and a denoised image version. Linking between image and desired device is established by evaluating the similarity between the PRNU factor (also called the PRNU finger-print) and residual noise using NCC (Normalized Cross-Correlation).
Another more recent way of source camera identification is based on deep learning specifically using CNN-based methods. Ahmed et al. [ahmed2019comparative]
proposed a CNN model (three convolutional layers with a Softmax classifier) and compare the CNN-based result to a result based on PRNU, showing the PRNU-based approach to perform better than the proposed CNN approach. Baroffioet al. [baroffio2016camera] obtained good accuracy with a three convolutional layer CNN model on a larger dataset. Tuma et al. [tuama2016camera] obtained good results as well looking into different CNN models, among them AlexNet, GoogleNet, and an architecture proposed in their work. Bondi et al. [bondi2016first] propose a CNN model with four layers of convolution along with an SVM classifier instead of a fully connected layer for classification. Note that so far, CNN-based techniques have been proven to be successful in device model identification, while PRNU-based techniques are able to support both model as well as device level identification.
Moving to the biometric domain, Bartlow et al.  investigated on identifying sensors from fingerprint images using a PRNU-based technique. They examined effect and influence of sensor identification even when one only has access to a limited number of samples. Focusing on the iris subdomain, Kauba et al. [Kauba18c] used PRNU- and image texture-based methods. For the texture-based method, the authors extracted texture descriptors by applying DSIFT, DMD, and LBP. The extracted features are represented by Fisher Encoding and discriminated by SVM. Banerjee et al. [banerjee2017image]
evaluate the applicability of different PRNU estimation schemes to deduce sensor information from NIR iris images. Moving from classical approaches to deep learning in iris sensor identification, Marraet al. [marra2018deep]
proposed a three layer CNN architecture with a Softmax layer at the end.
Shifting from the iris to the FV subdomain, Maser et al. [maser2019prnu, maser2019prnuOAGM] and Söllinger et al. [sollinger2019prnu] applied PRNU-based sensor identification methods on finger vein datasets. Maser et al. proposed a texture-based sensor model identification approach [maser2021identifying]
, the features have been extracted by applying several classic and statistical properties of an image such as Histogram, Wavelet variance, Entropy, and LBP, etc. SVM is applied to discriminate the sample image origins. In this work, we use their results in comparison to our research results. So far, deep learning based techniques have not been used to identify the origin of vascular sample data at all.
In this work, we focus on identifying the origin of finger vein sample images using a CNN-based approach. Besides the evaluation of existing CNN models, we also propose a custom one and show its beneficial properties.
This work is structured as follows: In section II we describe the six used CNN models. Section III discusses the properties of the finger vein sample datasets as considered and the setup of the conducted experiments. Next, we discuss and analyze the experimental results in Section IV, and finally, we end this manuscript with a conclusion in Section V.
Ii CNN models’ structure
In this section we discuss briefly five state-of-the-art CNN models, and introduce further a novel CNN model adapted for the target application. To select the most appropriate CNN models our attention is on examining a full range of prominent CNN models with varying properties, from a simple variance of AlexNet to more complex architectures like the Xception model which hopefully gives us a deep understanding which type of model is suitable to learn the patterns of finger vein samples. We also study the complexity of introduced models in Table I.
Ii-a Marra and Bonidi Models
These two models are simple stacked networks (AlexNet family networks, both have been used in camera / sensor identification before).
(i) Bondi Model: The CNN model proposed by Bondi et al. [bondi2016first] is a stack of convolutional layers which end with a fully-connected layer. We adopted the model to preserve the given specification as good as possible. However, the only changes we made in their proposed model is that the SVM classifier is replaced by a Softmax layer. The detailed structure of the Bondi model is given in the above mention paper. Referring to Table I Bondi model is a relatively light stacked network.
(ii) Marra Model: Marra et al. proposed a network [marra2018deep]
which is an AlexNet variant, however, the number of layers has been reduced as compared to its predecessor. Due to having 2 fully-connection layers with 1024 and 2048 neurons, the number of trainable parameters is significantly increased (TableI) which causes high complexity of the model.
In the above section, we discussed two relatively shallow CNN networks (Bondi and Mara). Goodfellow et al. [goodfellow2013multi] showed that the increasing depth of a network layer led to better performance. In other words, incrementing the number of layers in a network leads to gain enriched feature maps. Thus, due to the mentioned fact, we wanted to know how various deeper networks will perform on our FV databases.
Ii-B VGG16 Model
We employ the VGG16 model which has been introduced in [simonyan2014very]. VGG16 is an example of a deep network, and, as a result, this model showed improvements in performance with respect to its predecessor models. Furthermore, VGG16 is designed to enrich the feature maps by expanding layers to have a deeper network compared to simple convolutional layers like described in Section II-A. We adopted ConvNet Configuration type B of the VGG16 network which is represented in Table 1 of [simonyan2014very] with a slight modification in the input layer, and and adapted number of classes in . Also, we reduced the number of Convolutional layers with 512 channels from 4 to 2.
Even though deeper networks have advantages w.r.t more shallow networks, higher network depth may leed to another problem called degradation. To avoid this potential problem, Residual networks have been introduced.
Ii-C 50-Layer ResNet Model (ResNet50)
The 50-layers Residual network exploits the concepts of deep residual learning. One problem of a deep network is degradation, i.e. when a deep network starts converging, accuracy gets often saturated. Having accuracy saturation in a network implies that the model does not optimize well. In addition, a deep network leads to higher training errors. He et al. addressed these problems by introducing a deep residual network (a.k.a ResNet), therefore, we select the Residual network proposed by He et al. in [he2016deep] as a further candidate. In summary, (i) the deep Residual network is easily optimized (training error decrease) as compared to its counterpart stacked networks, and, (ii) gains better accuracy while increasing the network depth. Further, (iii) the complexity of a Residual network is low compared to plain stacked CNN networks (referring to Table I), e.g. A ResNet having 152 layers has less parameters than e.g. a VGG model which has been discussed in Section II-B. Details of the ResNet architecture is given in [he2016deep].
Ii-D Xception Model
We selected a variation of the Inception model called Xception that is proposed by Chollet [chollet2017xception]. The Xception model is claimed to be capable of learning with fewer parameters. The philosophy behind this architecture is to decouple the mapping of cross-channel correlations and spatial correlations in the feature maps of CNNs. To achieve decoupling, the depth-wise separable convolution is applied, which works as follows: A spatial convolution is executed independently over each channel of an input, then a point-wise convolution (
) is applied sequentially. The output of channels is projected by depth-wise convolution onto a new channel space. It is important to mention that Xception applies a nonlinearity mapping after each operation in the depth-wise separable convolution process. In summary, the Xception model is a linear stack of depth-wise separable convolution layers with a residual connection. The details of the Xception architecture is given in[chollet2017xception].
Ii-E 6-layer CNN Model (FV2021)
To propose a novel network that has the advantage of being small and also exploits the advantage of the most prominent CNN models, we could think of many architectures, and most might also work. However, we propose a small model (eventually also well suited for a mobile device) and aim to achieve the same accuracy as the large CNN models. Thus, one of the advantages of the FV2021 model is having the lowest complexity (Table I). We exploit the advantage of Separable Convolution (SC, as used in the Xception net) instead of the classic convolution layer. As explained before, separable convolution performs a depth-wise spatial convolution (which acts on each input channel separately) followed by a point-wise convolution that mixes the resulting output channels. Thus, in developing FV2021, we took the advantage of cross-channel correlations as well as spatial correlations. Therefore, we applied and exploited the utilization of small receptive as well as
convolution filters, which can be seen as a linear transformation of the input channels. The network architecture is composed of two sequential blocks, the first block has a skip connection, but the second block has a residual connection (a connection with a convolution operator). To reduce the computational complexity in the first layer, the number of filters are reduced to 32, and kernel size is2, also with parameters given in each convolution block as follow: Separable Conv. number of filters, receptive field size , s=strides.
Ii-F Complexity of CNN Architectures
Complexity and memory consumption are vital criteria to select an algorithm in general. Besides, in particular when selecting CNN models, the performance, resource consumption, and complexity of the model can be considered as Achilles heel for practical applications. One way to estimate the complexity and resource consumption in a CNN is to calculate the number of trainable parameters which are being used by a CNN architecture. We show the number of total and trainable parameters for each of discussed CNNs architecture in Table I. In addistion, the last column shows the number of weighted layes. Please, note that in each model the last fully-connected layer (FC) includes the Softmax.
|Bondi||2,681,368||2,681,304||4 Conv + 2 FC|
|Marra||65,563,720||65,563,720||3 Conv + 2 FC|
|VGG16||55,097,288||55,077,064||8 Conv + 3 FC|
|ResNet50||23,597,832||23,544,712||50 Conv + 1 FC|
|Xception||20,877,296||20,822,768||36 Conv + 1 FC|
|FV2021||314,632||314,376||6 Conv + 1 FC|
Table I reveals that FV2021 has the minimum number of trainable parameters while other models have an enormous number of parameters, ranging from 2.5 to 65 millions. Thus, the proposed CNN has the lowest complexity in comparison to the other discussed models in this section. We will discuss the sensor identification performance of the CNN architectures in Section IV.
Iii Finger Vein Sample Data & Experimental Design
We consider eight different finger vein databases (acquired with distinct prototype near infrared sensing devices) that are well known, in addition they are accessible publicly. In this work we took 120 samples from each database. The databases are as follow:
Information on the size of the original samples and how the samples have been withdrawn from the datasets are given in [sollinger2019prnu, maser2021identifying].
Iii-a Finger Vein Region of Interest(ROI)
In finger vein recognition features are typically not extracted from a raw sample images but from a region-of-interest that is the portion of an image containing only finger vein texture. In addition, an insertion attack can also be mounted using ROI samples (in case the sensor does not deliver a raw sample to the recognition module but ROI data instead). Thus, we produced cropped image samples (ROI datasets) out of the original samples to be able to test our approach on these data as well. To produce ROI datasets we follow the same approach as it is proposed by Maser et al. in [maser2021identifying].
The original samples, as shown in Fig. 5
, can be discriminated easily: Besides the differences in size (which can be adjusted by an attacker of course), the sample images can be probably distinguished by the extent and luminance of background. To illustrate this, we display the images’ histograms beside each example in Figure5, and those histograms clearly exhibit a very different structure. Thus, we have learned that even texture descriptors have an easy job to identify the origin of the respective original sample images. This is not necessarily the case for ROI data.
To investigate the differences between raw sample data and ROI data in more detail, we have investigated the range of luminance values and their variance across all datasets. Figures 3 and 4 display the results in the form of box-plots, where the left box-plot corresponds to the original raw sample data, and the right one to the ROI data, respectively. We can clearly see that the luminance distribution properties have been changed dramatically once we change our focus from original datasets to ROI datasets. For example, original HKPU_FV samples can be discriminated from FV_USM, MMCBNUm, PALMAR, UTFVP, and THU_FVFDT ones by just considering luminance value distribution. For the ROI data, the differences are not very pronounced any more. When looking at the variance value distributions, we observe no such strong discrepancy between original sample and ROI data, still for some datasets variance can be used as discrimination criterion (e.g. Palmar vs. HKPU_FV in original data, FV_USM vs. HKPU_FV in ROI data). Consequently, we expect the discrimination of the considered datasets to be much more challenging when focusing on the ROI data only.
Iii-B Pipeline setup and preprocessing
Each dataset consists of 120 images, in total 960 images. To enhance the image samples and improve the contrast, we applied Contrast Limited Adaptive Histogram Equalization (CLAHE). The entire data (960 images from eight datasets) are shuffled and then we take randomly 70% of data for the training set, 10% for the evaluation set and 20% as test set. The splitting policy assures that the data samples used during training is never used during validation or testing. Thus, performance reported is not biased since we have empty intersection among training, validation and test sets. In addition, the Adam optimizer is applied for all networks except for Bondi and Marra (as per authors recommendation, the SGD has been applied as an optimizer in both models). Furthermore, batch size is set to be 64. To feed the input to CNN models, we normalized the image samples to uniform width and height, the size of patches for uncropped sample and ROI which are fed to networks is of . To compare results of CNN-based approaches with a PRNU-based approach, we use the results given in [sollinger2019prnu]. The authors worked on five patches which have been taken from different locations of image samples. For the comparison we consider that results of patch size are comparable to our results of original image samples. Similarly, the results of patch size should be comparable to our results of ROI samples.
Iii-C Evaluation metrics
We use classical measures to rate our sensor identification task, which is basically a multi-class classification problem. We use the area under curve of the receiver operating characteristic () which relates the false positive rate (FPR) to the false negative rate (FNR). The Analysis of the AUC-ROC is significant as the AUC-ROC shows the ability of the proposed classifier to distinguish classes.
In sensor identification, Precision is a further important measurement metric because it indicates the proportions of positives and negatives, and good result can be interpreted as high performance of a classifier. Also, in the field of biometrics, it is vital to verify the correct sensors (i.e. True Positive). In contrast, it would be a catastrophe if the biometric system verifies the wrong sensor (False Positive). Therefore, Precision is more important than Recall and consequently also used to assess our results.
In this section, we discuss the result of applying six CNN models on sensor identification of original samples (uncropped datasets) as well as cropped samples (ROI data).
Iv-a Results of the six CNN models
In the following paragraphs, we will analyze the outcomes of the six mentioned CNN models. The first and the second columns of Table II111Results are rounded to five digits after the decimal point exhibit the AUC-ROC score of the six applied CNN models on original samples and their corresponding ROI.
As was expected, the achieved AUC-ROC scores on original samples are excellent (the first column). All CNN models demonstrated perfect results. However, the modified Bondi model and ResNet(50-layer) results are slightly lower than 1.00 with a small and narrow gap (0.0001). We have the same situation for ROI datasets (the second column). Almost all models exhibited excellent results. Respectively Xception, FV2021, ResNet50 and VGG16 scores are . However, concerning the first four mentioned models in the Table, the Marra model and Bondi model results are inferior.
The third column of Table II displays the Precision score of all CNN models on original samples (uncropped datasets). Respectively Xception model, FV2021 model, VGG16 model and Marra model are superior to ResNet50 and Bondi’s model. The Precision scores are either or close to . As a result, by observing Precision results, we can imply that obtained results by four models are highly reliable and accurate.
Moving from the original samples (uncropped) to the ROI, the fourth column of the Table II exhibits the performance of the applied models on the region of interest. By observing Precision scores, respectively the Precision score of Xception model FV2021 model ResNet50 model VGG16 model Marra model Bondi model. Thus the Xception model and the proposed FV2021 model improved results slightly as compared to the others.
We would like to emphasize that the value of false positives (FP) in ResNet50, VGG16, Marra and Bondi models are relatively high which causes their Precision scores to get lower than this of Xception and FV2021 models.
Iv-B Comparison of various approaches
In this section, we compare the performance of various approaches for identifying the FV image origin. As we explained in Section III-B, we compare to the results of a PRNU-based approach which is proposed by Söllinger et al. [sollinger2019prnu] and a texture-based approach which is done by Maser et al. [maser2021identifying]. Table III shows results of these three approaches. We observe the superiority of deep learning (CNN) methods using the proposed FV2021 and the Xception CNN models, respectively, over the PRNU-based approach and the texture-based approach. FV2021 and Xception models compete closely in the race, their results are approximately equal, but due to the significantly lower complexity of FV2021 (which has been approved by analysing the number of trainable parameters in Table I), we take the FV2021 as the superior CNN model.
Iv-C Single Sensor-based Results
In this section, Table IV displays results of the employed CNN models for all uncropped datasets (instead of overall results shown before).
All sensors are discriminated ideally except than MMCBNU. The ResNet50 and Bondi experienced some difficulties to discriminate the MMCBNU sensor.
We observe results of the employed CNN models for all ROI datasets in Table V. The excellent performance of Xception model and FV2021, ResNet50 on all sensors can be seen. among these results only FV2021 was successful to gain the results either at 1.0 or at 0.999 on every sensor.
In this research, we studied the results of using five CNN models and a novel CNN model (FV2021) for sensor identification on the ROI as well as the original finger vein samples. Finger vein samples are taken from eight databases. As a result, the performance of the proposed FV2021 and Xception models are superior to other CNN models. Then we compare CNN-based results with other results including PRNU correlation-based and texture descriptor-based research. The CNN-based results show slightly better performance. Besides, the two top performing CNN architectures perform very closely in terms of sensor identification accuracy but due to much lower model complexity, we recommend the proposed FV2021. The achieved result by FV2021 is excellent, i.e., the AUC-ROC score for ROI data is 0.9997 and for original samples it is at 1.0.