1 Introduction
Authentication on the basis of “who we are” instead of “something we possess” or “something we remember”, offers convenience and often, stronger system security. One of the important factors related to making biometric passwords as widespread as text based ones is that of template protection. Text based password authentication provides strong template protection whereas biometric data generally suffers from lesser protection due to difficulties in exact matching. Given the sensitive nature of biometric data, algorithms that provide the same level of template security without compromising on matching accuracy would be ideal.
A typical password authentication system would use a sample of the user’s password to extract and store a template from it. It is desirable that this template is stored in a protected and cancelable manner for the purpose of system security. During authentication, a new template is extracted from the presented password and matched to the stored template. Depending on the matching score, access is granted or denied. In the case of text based passwords, a one way noninvertible transform (i.e. a hash) of it is stored as the template. During verification, a password is entered and its hash value is calculated. The hash is compared with the stored hash and if the two strings matched exactly, their hashes would match as well, and access would be granted. In such a scenario, the stored hash reveals no information about the original password (protection) and also, if the password is compromised, it can be changed and a new password can be registered (cancelability).
This kind of security would be ideal for biometric based authentication as well but, unlike text passwords, biometric modalities lack two important aspects. 1) They rarely match exactly between different readings, and 2) they cannot be changed if compromised. Thus, the objective of cancelable biometrics approaches is to extract template from biometric modalities that are 1) protected i.e. given the template, it should be infeasible to extract any information about the original modality, and 2) cancelable i.e. if compromised, it should be possible to extract a new template from the same modality.
1.1 Contribution
We tackle these objectives by using a deep convolutional neural network (CNN) to learn a robust mapping of face classes to maximum entropy binary (MEB) codes. The mapping is robust enough to tackle the problem of exact matching, yielding the same code for new samples of a user as the code assigned during training. This exact matching enables us to store a hash of the code as the template of the user. The hash function used could be any function that follows the random oracle model, and in our case we choose SHA512 since it is the current standard for string based passwords and offers strong security. Once hashed, the template has no correlation with the code assigned to the user. Furthermore, the codes assigned to users are bitwise randomly generated and thus, possess maximum entropy, and have no correlation with the original biometric modality (the user’s face). These properties make attacks on the template very difficult, leaving brute force attacks as the only feasible option. Cancelability is achieved by changing the codes assigned to users and relearning the mapping.
Exploiting the large learning capacity of the CNN with powerful regularization, we also achieve stateoftheart matching performance on PIE, Extended Yale B and MultiPIE databases. Note that, in this work, we focus on the usecase of using faces as passwords and thus, validate our results on data collected in controlled environments.
1.2 Related Work
A variety of template protection algorithms have been applied to faces. Schemes that used cryptosystem based approaches include Fuzzy commitment schemes by Ao and Li [1], Lu [11] and Van Der Veen [23], and fuzzy vault by Wu and Qiu [24]. In general, the fuzzy commitment schemes suffered from limited error correcting capacity or short keys. In Fuzzy vault schemes the data is stored in the open between chaff points, and this also causes an overhead in storage space. Some quantization schemes were used by Sutcu [16, 17] to generate somewhat stable keys. There were also several works that combine the face data with user specific keys. These include combination with a password by Chen and Chandran [2], user specific token binding by Ngo [12, 21, 22], biometric salting by Savvides [14], and user specific random projection schemes by Teoh and Yuang [20] and Kim and Toh [9]. Hybrid approaches that combine transform based cancelability with cryptosystem based security like [5] have also been proposed but give out user specific information to generate the template creating possibilities of masquerade attacks. Pandey and Govindraju [13]
proposed a security centric scheme that used features extracted from local regions of the face to obtain exact matching and thus, benefited from the security of hash functions. Although more secure, the matching accuracy of the scheme suffered and the feature space being hashed was not uniformly distributed.
On the image recognition side, deep CNNs algorithms like Deepface [18]
have shown exceptional performance and hold the current stateoftheart results for face recognition. There is also some recent work that seeks to map data to binary codes using deep neural networks like
[3]. Although mapping to binary codes (or learning hash functions) in this manner may seem similar to our approach, these methods are fundamentally different from what we are trying to achieve. Algorithms such as [3] seek to learn a natural binary representation of the data and thus, the binary codes they map to are correlated to the data distribution. Our MEB codes have no correlation to the original data distribution. This gives us the template security we seek, but also makes it a more challenging problem since the mapping function we seek to learn is more complex.2 Algorithm
In this section of the paper we describe the individual components of our architecture in more detail. An overview of the algorithm is shown in Figure 1.
2.1 Convolutional Neural Networks
Convolutional neural networks (CNNs) [10]
are biologically inspired models, which contain three basic components: convolution, pooling and fully connected layers. In the convolution layer one tries to learn a filter bank given input feature maps. The input of a convolution layer is a 3D tensor with
number of 2D feature maps of size . Let denote the component at row and column in the th feature map, and we use to denote the complete th feature map at layer . If one wants to learn set of filters of size , the output for the next layer will still be a 3D tensor with number of 2D feature maps of size . More formally, the convolution layer computes the following:(1) 
where denotes the filter that connects feature map to output map at layer , is the bias for the th output feature map, is some elementwise nonlinearity function and denotes the discrete 2D convolution.
The pooling (or subsample) layer takes a 3D feature map and tries to downsample/summarize the content with less spatial resolution. Pooling is commonly done for every feature map independently and with nonoverlapping windows. An intuition of such operation is to have some built in invariance against small translations as well as reduce the spatial resolution and thus save computation for the upper layers. For average (mean) pooling, the output will be the average value inside the pooling window, and for max pooling the output will be the maximum value inside the pooling window.
The fully connected layer connects all the input units from the lower layer to all the output units in the next layer . In more detail, the next layer output is calculated by:
(2) 
where
is the vectorized input from layer
, and are the parameters of the fully connected layers at layer .A CNN is commonly composed of several stacks of convolution and pooling layers followed by a few fully connected layers. The last layer is normally associated with some loss to provide training signals, and the training for CNN can be done by doing gradient descent on the parameters with respect to the loss. For example, in classification the last layer is normally a softmax layer and cross entropy loss is calculated against the 1 of K representation of the class labels. In more detail, let
be the preactivation of the last layer, denotes the final output and the th component of , and denote the target 1 of K vector and the th dimension of that vector, then(3)  
(4) 
where
is the loss function.
2.2 Maximum Entropy Binary Codes
Our first step of training is to assign unique codes to each user to be enrolled. From a template security point of view, these codes should ideally possess two properties. First, they should posses high entropy. Since a hash of these codes is the final protected template, the higher the entropy of the codes, the larger the search space for a brute force attack would be. In order to make brute force attacks in the code domain infeasible, we use binary codes with a minimum size bits and experiment with values up to bits. The second desirable property of the codes is that they should not be correlated with the original biometric modality. Any correlation between the biometric samples and the secure codes can be exploited by an attacker to reduce the search space during a brute force attack. One example to illustrate this can be to think of binary features extracted from faces. Even though the dimensionality of the feature vector may be high, given the feature extraction algorithm and type of data, the number of possible values the vector can take is severely reduced. In order to prevent such reduction of entropy, the codes we used are bitwise randomly generated and have no correlation with the original biometric samples. This makes the space to be hashed truly uniformly distributed. More precisely, let
be the binary variable for each bit of the code, where
is the maximum entropy Bernoulli distribution, and the resultant MEB code with
independen bits is thus . We denote the code for user by .2.3 Learning the Mapping
In order to learn a robust mapping of a user’s face samples to the codes, we make some modifications to the CNN training procedure. The 1 of K encoding of the class labels is replaced by the MEB codes assigned to each user. Since we now want several bits of the network output to be one instead of a single bit, we use sigmoid activation instead of softmax. In more detail:
(5)  
(6) 
where is the th output from the last layer and is the binary cross entropy loss.
2.3.1 Data Augmentation
Deep learning algorithms generally require a large number of training samples whereas, training samples are generally limited in the case of biometric data. In order to magnify the number of training samples per user, we perform the following data augmentation. For each training sample of size we extract all possible crops of size . Each crop is also flipped along its vertical axis yielding a total of crops. The crops are then resized back to and used for training the CNN.
2.3.2 Regularization
The large learning capacity of deep neural networks comes with the inherent risk of overfitting. The number of parameters in the network are often enough to memorize the entire training set, and the performance of such a network does not generalize to new data. In addition to general concerns, mapping to MEB codes is equivalent to learning a highly complex function, where each dimension of the function output can be regarded as an arbitrary binary partition of the classes. This further increases the risk of overfitting and powerful regularization techniques need be employed to achieve good matching performance.
We apply dropout [8]
on all fully connected layers with 0.5 probability of discarding one hidden activation. Dropout is a very effective regularizer and can also be regarded as training an ensemble of an exponential number of neural networks that share the same parameters, therefore reducing the variance of the resulting model.
2.4 Protected Template
Even though MEB codes assigned to each user have no correlation with the original samples, another step of taking a hash of the code is required to generate the protected template. Given the parameters of the network, it is not possible to entirely recover the original samples from the code (due to the max pooling operation in the forward pass of the network) but, some information is leaked. Using a hash digest of the code as the final protected template prevents any information leakage. The hash function used can be any function that follows the random oracle model. For our experiments we utilized SHA512, yielding the final protected template .
During verification, a new sample of the enrolled user is fed through the network to get the network output
. We then binarize this output via a simple thresholding operation yielding the code for the sample
, where and is the indicator function. At this point, the SHA512 hash of the code, could be taken and compared with the stored hash for the user. Due to the exact matching nature of the framework, this would yield a matching score of true/false nature. This is not ideal for a biometric based authentication system since it is desirable to obtain a tunable score in order to adjust the false accept (FAR) and false reject rates (FRR). In order to obtain an adjustable score, several crops and their flipped counterparts are taken for the new sample (in the manner described in Section 2.3.1) and is calculated for each one, yielding a set of hashes . We define the final matching score as the number of in that match the stored template, scaled by the cardinality of . Thus, the score for matching against user is given by,(7) 
Now the score can be set to achieve the desired value of FAR/FRR. Note that, the framework provides the flexibility to work in both verification and identification modes. For identification can be matched against templates of all the users stored in the database.
3 Experiments
We now describe the databases, evaluation protocols, and specifics of the parameters used for experimental evaluation.
3.1 Databases
In this study we tackle the the problem of using faces as passwords and thus, choose face databases that have been collected in controlled environments for experimentation. We use evaluation protocols including variations in lighting, session and pose that would be typical to the application.
The CMU PIE [15] database consists of 41,368 images of 68 people under 13 different poses, 43 different illumination conditions, and with 4 different expressions. We use 5 poses (c27, c05, c29, c09 and c07) and all illumination variations for our experiments. 10 images are randomly chosen for training and the rest are used for testing.
The extended Yale Face Database B [6] contains 2432 images of 38 subjects with frontal pose and under different illumination variations. We use the cropped version of the database for our experiments. Again, we use 10 randomly selected images for training and the rest for testing.
The CMU MultiPIE [7] face database contains more than 750,000 images of 337 people recorded in 4 different sessions, 15 view points and 19 illumination conditions. We use this database to highlight the algorithm’s robustness to changes in session and lighting conditions. We chose two sessions (3 and 4) which had the most number of common users (198) between them. 10 randomly chosen frontal faces from session 3 were used for enrollment and all frontal faces from session 4 were used for verification.
3.2 Evaluation Metrics
We use the genuine accept rate (GAR) at 0 false accept rate (FAR) as the evaluation metric. We also report the equal error rate (EER) as an alternative operating point for the system. Since the traintest splits we use are randomly generated, we report the mean and standard deviation of the results for 10 different splits.
3.3 Experimental Parameters
We use the same training procedure for all databases. The CNN architecture that we used is as follows: two convolutional layers of filters of size and filters of size , each followed by max pooling layers of size . The convolutional and pooling layers are followed by two fully connected layers of size
each, and finally the output. We use rectifier activation function
for all layers, and apply dropout with probability of discarding activations to both fully connected layers.MEB codes of dimensionality are assigned to each user. All training images are resized to and roughly aligned using eye center locations. For augmentation we use crops yielding crops per image. Each crop is also illumination normalized using the algorithm in [19]. We train the network by minimizing the crossentropy loss against user codes for epochs using minibatch stochastic gradient descent with a batch size of . 5 of the training samples are initially used for validation to determine the mentioned training parameters. Once the network is trained, the SHA512 hashes of the codes are stored as the protected templates and the original codes are purged. During verification, crops are extracted from the new sample, preprocessed, and fed through the trained network. Finally, the SHA512 hash of each crop is calculated and matched to the stored template, yielding the matching score in Equation 7.
3.4 Results
The results of our experiments are shown in Table 1. We report the mean and standard deviation of GAR at zero FAR, and EER for the 10 different traintest splits at bits of security (BoS) or . We achieve GARs up to on PIE, on Yale, and on MultiPIE with up to bits of security at the strict operating point of zero FAR. During experimentation we observed that our results were stable with respect to , making the parameter selectable purely on the basis of desired template security. A comparison of our results to other face template protection algorithms on the PIE database is shown in Table 2. Our algorithm offers significantly higher template security with true 1024 BoS due to the MEB codes. In terms of matching performance we outperform [5], which offers acceptable BoS, and are comparable to [4], which lacks in adequate BoS for protection against brute force attacks.
Database  BoS (K)  GAR@0FAR  EER 

PIE  
Yale  
MultiPIE  
4 Security Analysis
We analyze the security of the system in a stolen template scenario. The attacker has possession of the protected templates, knowledge of the template generation algorithm, and the CNN parameters. Given these, the attacker’s goal is to extract information about the original biometric of the users. The only assumption we make is that the hash function we use follows the random oracle model. Due to this, given the hash digests, the attacker cannot extract any information about the MEB codes assigned to the users. This removes the possibility of using the CNN parameters to reverse engineer the face from the secure codes. Now, the only way in which the attacker can get the codes is by brute forcing through all possible values the codes can take, hash each one, and compare to the hashed templates. Since the minimum code length we use is bits, the search space is of the order of or larger, making brute force attacks computationally infeasible.
Another possibility of an attack could be brute force in the input domain i.e. feed random noise or faces into the network and hope for a match. This comes down to the question of the entropy of faces in general, which is beyond the scope of this paper but, we do empirically analyze the behavior of the network under such attacks. So far, the imposter scores have been calculated using other enrolled users. We now analyze the score distribution when face samples that have not been seen by the network are fed to it. For this experiment, we enroll samples from Extended Yale B Database and use all the faces in MultiPIE as imposter samples. In addition to unseen faces, we also feed 1 million samples of random noise through the network. The results of this experiment are shown in Figure 3 with the attack distribution representing the scores for faces from MultiPIE and the random noise. It can be seen that the scores for the attack data are always zero and well separated from the genuine scores, empirically verifying the security of the system to attacks in the input space.
5 Conclusion and Future Work
We presented a template protection algorithm which achieves provable security by using MEB codes to address the issue of uniformity, and relying on the strength of standard hash functions. We achieved high () GARs at the strict operating point of zero FAR and showed that the exceptional performance of deep CNNs can be utilized to minimize loss of matching accuracy in template protection algorithms. The current work deals with the problem of using faces as passwords in controlled environments, and we plan to extend our results to faces in uncontrolled environments, other biometric modalities, and broader applications like Microsoft Windows picture passwords. Our future efforts also seek to make a formal analysis of our algorithm from an information theoretic perspective.
References
 [1] M. Ao and S. Z. Li. Near infrared face based biometric key binding. In Advances in Biometrics, pages 376–385. Springer, 2009.

[2]
B. Chen and V. Chandran.
Biometric based cryptographic key generation from faces.
In
Digital Image Computing Techniques and Applications, 9th Biennial Conference of the Australian Pattern Recognition Society on
, pages 394–401. IEEE, 2007. 
[3]
V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou.
Deep hashing for compact binary codes learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2475–2483, 2015.  [4] Y. C. Feng and P. C. Yuen. Binary discriminant analysis for generating binary face template. Information Forensics and Security, IEEE Transactions on, 7(2):613–624, 2012.
 [5] Y. C. Feng, P. C. Yuen, and A. K. Jain. A hybrid approach for generating secure and discriminating face template. Information Forensics and Security, IEEE Transactions on, 5(1):103–117, 2010.
 [6] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.
 [7] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multipie. Image and Vision Computing, 28(5):807–813, 2010.
 [8] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 [9] Y. Kim and K.A. Toh. A method to enhance face biometric security. In Biometrics: Theory, Applications, and Systems, 2007. BTAS 2007. First IEEE International Conference on, pages 1–6. IEEE, 2007.
 [10] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [11] H. Lu, K. Martin, F. Bui, K. Plataniotis, and D. Hatzinakos. Face recognition with biometric encryption for privacyenhancing selfexclusion. In Digital Signal Processing, 2009 16th International Conference on, pages 1–8. IEEE, 2009.
 [12] D. C. Ngo, A. B. Teoh, and A. Goh. Biometric hash: highconfidence face recognition. Circuits and Systems for Video Technology, IEEE Transactions on, 16(6):771–775, 2006.
 [13] R. K. Pandey and V. Govindaraju. Secure face template generation via local region hashing. In Biometrics (ICB), 2015 International Conference on, pages 1–6. IEEE, 2015.
 [14] M. Savvides, B. V. Kumar, and P. K. Khosla. Cancelable biometric filters for face recognition. In ICPR 2004, volume 3, pages 922–925. IEEE, 2004.
 [15] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination, and expression (pie) database. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, pages 46–51. IEEE, 2002.
 [16] Y. Sutcu, Q. Li, and N. Memon. Protecting biometric templates with sketch: Theory and practice. Information Forensics and Security, IEEE Transactions on, 2(3):503–512, 2007.
 [17] Y. Sutcu, H. T. Sencar, and N. Memon. A secure biometric authentication scheme based on robust hashing. In Proceedings of the 7th workshop on Multimedia and security, pages 111–116. ACM, 2005.
 [18] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to humanlevel performance in face verification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1701–1708. IEEE, 2014.
 [19] X. Tan and B. Triggs. Enhanced local texture feature sets for face recognition under difficult lighting conditions. Image Processing, IEEE Transactions on, 19(6):1635–1650, 2010.
 [20] A. Teoh and C. T. Yuang. Cancelable biometrics realization with multispace random projections. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 37(5):1096–1106, 2007.
 [21] A. B. Teoh, A. Goh, and D. C. Ngo. Random multispace quantization as an analytic mechanism for biohashing of biometric and random identity inputs. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(12):1892–1901, 2006.
 [22] A. B. Teoh, D. C. Ngo, and A. Goh. Personalised cryptographic key generation based on facehashing. Computers & Security, 23(7):606–614, 2004.
 [23] M. Van Der Veen, T. Kevenaar, G.J. Schrijen, T. H. Akkermans, F. Zuo, et al. Face biometrics with renewable templates. In Proceedings of SPIE, volume 6072, page 60720J, 2006.
 [24] Y. Wu and B. Qiu. Transforming a pattern identifier into biometric key generators. In ICME 2010, pages 78–82. IEEE, 2010.