Ear biometrics has become a popular research topic . A recent challenge, named as Unconstrained Ear Recognition Challenge  has shown the difficulties of performing person identification from ear images in the wild. To complement the identity related information from ear images, utilizing soft biometric traits, such as age and gender information can be auxiliary. For this purpose, in this paper, we have extensively investigated the tasks of age and gender classification from ear images.
Biometric characteristics are expected to not change much over time, easy to obtain and unique for each individual . Because of its several features, ear is an important modality in biometric studies and forensic science for identification. For example, compared to facial appearance, which is influenced by changes in facial expression, facial hair or makeup, ear appearance is relatively constant. Auricular is also a defining feature of face . Among the ear parts, earlobe is the most frequently used part in forensic cases. It is the only part of the ear that continues to grow and changes their shape . Ear can be still visible in whole or partly covered face in the captured images from security cameras, and can be used as an auxiliary information for identification. Also, when faces are viewed in profile, ear can be easily captured from video recordings or photos .
Although, there have been many studies on using ear images for person identification [1, 6], the number of studies on extracting soft biometric traits, such as age and gender, from ear images is limited. To the best of our knowledge, this study is the first work on age classification from ear images. However, there have been a couple of previous work on using ear images for gender classification [7, 8, 9, 10]. In 
, the ear-hole is used as the reference point for the measurements. The Euclidean distances between ear hole and seven features of ear, which are identified from masked ear images, are calculated. They used an internal database, which has 342 samples, for the experiments. They have employed Bayes classifier, KNN classifier, and neural networks. The best performance is achieved by KNN with 90.42% classification accuracy. In
, profile face images and ear images are used separately and are classified by support vector machines (SVM) with histogram intersection kernel. They performed score level fusion based on Bayesian analysis to improve the accuracy. The 2D images of UND biometrics dataset collection F have been used for the experiments. Fusion leads to 97.65% accuracy, whereas face only performance is around 95.43% and ear only accuracy is around 91.78%. In , Gabor filters have been utilized to extract features and classification has been performed with extracted features based on dictionary learning. The dictionary has been built from training samples and used in the test phase to represent a test sample as a linear combination of the training data. UND biometric dataset collection J , which contains large appearance, pose, and illumination variability, has been used in the experiments. The best obtained accuracy reported in the paper is 89.49% has been achieved by using 128 features. In , gender classification is performed both on 2D and 3D ear images. 3D ears are automatically detected and aligned. The experiments were performed on UND dataset collections F and J2 . Histogram of Indexed Shapes features were extracted and classified by SVM. The average performance of the system was 92.94%.
In this paper, we present an extensive analysis on age and gender classification from ear images. We have explored use of both geometric features and appearance-based features for ear representation. Geometric features are based on eight landmarks determined on the ear. From these landmarks, to extract the features, we have calculated 14 different distances between them as well as performed two area calculations. To classify these extracted features, four different classifiers —logistic regression, random forests, support vector machines, neural networks— have been employed. The appearance-based methods are based on well-known deep convolutional neural network (CNN) models, namely, AlexNet, VGG-16 , GoogLeNet , and SqueezeNet . They have been fine-tuned twice, first on a large-scale ear dataset to provide domain adaptation, then on the small-scale target ear dataset. In the experiments, appearance-based methods have outperformed geometric feature-based methods. We have achieved 94% accuracy for gender classification, exceeding the attained accuracies in the previous studies. For age classification 52% accuracy has been obtained. In summary, the contributions of the paper can be listed as below:
We have explored geometric and appearance-based features for age and gender classification from ear images.
For geometric features, we have used eight landmark points on the ear and derived 16 features from them.
We have achieved superior performance for gender classification compared to the previous work. We have presented the first work on age classification from ear images.
The remainder of the paper is organized as follows. In Section II, we explain the geometric features, the classifiers used with them, and the convolutional neural networks used for ear appearance representation and classification. In Section III, we introduce the dataset and experimental setup, and present the obtained results. Finally, in Section IV, we conclude the paper and point the future research directions.
In this section, we present the utilized geometric features and the employed classifiers on them, as well as the appearance-based representation and classification.
2.1 Geometric Features
Identified landmarks and performed measurements are shown in Fig. 1. The geometric features calculated from these landmarks are listed in Table I. In Table I, Selected column refers to the features that were found important by the random forest classifier. In summary, we have used 8 landmark points and calculated 16 measurements from them to generate the feature vector. To calculate the rectangular area of the ear, we used the most outer point between Obs and Obi on the left, Sa on the upper side, Pa on the right, and Sba at the bottom. To calculate the polygon area of the ear, Obs, Sa, Pa, Sba, Obi, and T points were used. The remaining measurements are distances between two landmarks as listed in Table I . Definition of the utilized landmarks are as below:
Otobasion superius (Obs) : It is the point, which connects helix in the temporal region, and it determines the upper limit of the junction of the ear with the face.
Otobasion inferius (Obi) : This is the connecting point of the earlobe to the cheek. It determines the lower bound of the junction of the ear with the face.
Tragus (T) : Tragus is a protruding part in front of the hearing canal.
Superaurale (Sa) : It is the highest point of the auricular.
Subaurale (Sba) : It is the lowest point of the auricular. This point varies person to person. It is the commissure point at the bottom of the ear with face skin on the person who does not have an earlobe (attached earlobes).
Postaurale (Pa) : It is the most outer point on the ear curve to the back.
Preaurale (Pra) : It is the front side of the ear, and present at the helix added level of the head.
Intertragic notch (Intno) : It is the deep notch between the tragus and antitragus.
|Ear Rectangle Area||+|
|Ear Polygon Area||-|
Since each geometric feature has a different value range, in order to normalize them, for each feature we have calculated its mean value and the standard deviation on the training set. Then, we have normalized them so that they have zero mean and unit variance.
Using random forest classifier, the importance of each feature was measured with respect to the predictability of the correct output variable. According to these importances, we determined a threshold value to choose the features. This way, six of the 16 geometric features of the ear were selected. We have observed that, since the amount of available data were limited, this feature selection scheme has improved the results.
2.2 Classification of Geometric Features
In this section, the classifiers which were used with geometric features, are explained and information about their parameters is presented. These classifiers are logistic regression, random forest, support vector machines, and neural networks.
2.2.1 Logistic Regression
Logistic regression provides a linear discrimination between different classes. As the name implies, its objective is to minimize the logistic function. Due to limited amount of training data, in the experiments, we used logistic regression with L2 regularization.
2.2.2 Random Forest
Random forest is an ensemble machine learning algorithm and was used for classification as well as feature selection in this study. Random forest works with sub-trees that are learned on a part of the training data. The advantage of using sub-trees is that each sub-tree’s predictions has lower correlation. Besides, each sub-tree uses different samples from training set instead of using all of them. In the end, the average result is calculated from different sub-trees. In our experiments, we obtained the best result using 1000 sub-trees in the forest.
2.2.3 Support Vector Machines
Support vector machine  is a classifier, which finds a decision boundary between two classes that enforces a margin. In some cases, different classes cannot be separated linearly, then nonlinear kernels are employed. In gender classification problem, we used binary SVM classifier but in age classification, there exists more than one class. For that case, we applied one-vs-one scheme. In the experiments, radial basis function (rbf) kernel were used. Gamma value was set to 1/number of features and the penalty parameter was set to C=250. These values have been determined empirically according to the accuracies obtained on the validation set.
2.2.4 Neural Network
We have employed a neural network that contains 3 hidden layers. This parameter was determined again emprically according to the accuracies obtained on the validation set. As we increased the number of layers, we observed that the training accuracy increased and the validation accuracy decreased, which is an indication of overfitting due to limited amount of data.
2.3 Appearance-based Representation and Classification
In this study to represent and classify the ear appearance, we have employed convolutional neural networks. We have benefited from well-known CNN architectures, namely, AlexNet , VGG-16 , GoogLeNet  and SqueezeNet . At the end of all these deep network architectures, softmax layer is used as classifier.
The first deep convolutional neural network architecture used in this study is AlexNet . It is one of the most popular CNN architectures as the winner model of ILSVRC 2012 challenge. AlexNet  contains only five convolutional layers and three fully connected layers. Therefore, it is relatively a less deep architecture. In the network training, to prevent overfitting dropout method  has been used.
VGG architecture  has two versions, VGG-16 which was used in this study, and VGG-19. VGG-16  contains 16 convolutional layers, 3 fully connected layers and softmax classifier after convolutional layers as in AlexNet . The main difference between AlexNet  and VGG-16  is that VGG-16  is a deeper network than AlexNet  and it uses many small size filters.
is a deeper network and contains 22 layers. It is based on the inception module and is mainly a concatenation of several inception modules. The inception module contains several filters of different sizes. Different filtering outputs are combined and this way multiple features from the input data are extracted. The architecture is also efficient in terms of the number of parameters. Although, it is deeper than the AlexNet, it has about twelve times fewer parameters.
The last CNN architecture, which is SqueezeNet , proposed a new approach to reduce the number of parameters and model size. filters are used rather than
filters. This architecture also contains residual connections to increase efficiency of back-propagation learning. In addition, there is no fully connected layers. Average pooling layer is used instead of fully connected layers.
models, which were trained on the ImageNet dataset. Then, we have applied fine-tuning on the ear datasets. In our previous works on age and gender classification from face images  and on person identification from ear images , we have shown that transferring a pretrained deep CNN model from a closer domain leads to improved performance. That is, for age and gender classification from face images transferring a pretrained model that were trained on face images and for person identification from ear images transferring a pretrained model that were trained on ear images is better than transferring a pretrained model that were trained on generic object images, such as the ones from ImageNet dataset . With this finding in mind, we have applied two-stage fine-tuning as shown in Fig. 2. In the first stage, we have fine-tuned the pretrained CNN models on a large ear dataset to provide domain adaptation. In the second stage, we have performed once more fine-tuning, but this time using our ear dataset that contains age and gender labels. We obtained these ear images from profile face images in Fig. 3. For the first stage, we have used the Multi-PIE ear dataset  that were prepared by processing the Multi-PIE face dataset . It is currently the largest ear dataset containing 17183 ear images from 205 different subjects. This dataset has been created by running an ear detector on the profile and close-to-profile face images available in the Multi-PIE dataset .
In all training steps, the learning rate of the last fully connected layer of AlexNet , VGG-16 , and GoogLeNet  has been increased by ten times. Increasing the learning rates of last layers during fine-tuning is a typical approach to improve the accuracy of the classification, since these layers focus more on high-level features and classification. The output of softmax layer has adjusted to the number of classes in all models. Global learning rate has been selected as 0.0001 for all models except SqueezeNet , for which we set it to 0.0004.
3 Experimental Results
Our dataset contains profile face images of 338 different subjects. All subjects in this dataset are over 18 years old. Sample images from the dataset can be seen from Fig. 3. These subjects are categorized into five different age groups. These age groups are 18-28, 29-38, 39-48, 49-58, 59-68+, respectively. Age groups are categorized with respect to the changes in geometric features. Geometric measurements are relatively closer to each other at the specified age ranges. As listed in Table II, distribution of subjects over different age groups is relatively even. Among the 338 subjects, 188 subjects are males and 150 of them are females. The dataset contains just one profile image for each subject. OpenCV  ear detection implementation has been utilized to detect and crop ear regions from profile face images and false detections have been eliminated manually. The dataset has been divided into two parts as training set and test set. 80% of the dataset has been utilized for training and 20% has been employed for testing. Besides, the validation set has been selected from training data for validation purposes during training. Distribution of training and testing sets is given in Table III. To prevent unbalanced class distribution, each age and gender group of the dataset has been divided with respect to train-test division procedure (80-20%), separately. Besides, the training and test sets do not contain the same subjects, that is, the experiments have been conducted in subject-independent manner.
|Group||Train Data||Validation Data||Test Data|
In the experiments, ear images have been resized to pixel resolution for the deep CNN models. During training, five different crops, with resolution for VGG-16  and GoogLeNet  and resolution for AlexNet  and SqueezeNet  have been taken from these pixel resolution images. During testing, single crop is taken from the center of the image at the appropriate size with respect to the used architecture.
The number of images in our dataset is limited to train deep convolutional neural network models. To overcome this limitation and increase the amount of data, we have applied data augmentation and obtained 55 different images with different variations from each image. We have also performed data augmentation on the Multi-PIE ear dataset. In this study for data augmentation, we have utilized Imgaug toolBBBhttp://github.com/aleju/imgaug. Many images have been created by different transformation techniques. First, flipped version of the original image has been created. Then, images with different brightness levels have been generated by adding some positive and negative values to the pixels’ intensity values. These positive and negative values are in the range of [-55 +55] with an increment by ten. The second way of changing brightness levels of the images have been applied by multiplying pixels’ intensity values with some constant values. These constant values have been chosen between 0.5 and 1.5 by incrementing by 0.1. To improve the generalization of the deep CNN models, we have performed Gaussian blur and dropout. For Gaussian blur, we have produced blurred images at different levels by using different sigma values, such as 0.25, 0.5, 0.75, 1, 1.25, 1.5, and 2. For dropout, some pixels have been dropped and new noisy images have been created. The last augmentation method, sharpening, has been applied on each image by choosing values between 0.5 and 2.0 by increasing by steps of 0.1. After all these processes, we have obtained 14795 training images for age classification and 14960 training images for gender classification.
|Logistic Regression||47% (Geometric)|
|Random Forest||47% (Geometric)|
|3 hidden layers NN||58% (Geometric)|
3.1 Gender Classification Results
Gender classification results are presented in Table IV. In the table, first column includes the name of the classifier and the second one contains the corresponding classification accuracy. To remind the reader about the used features, the type of the features are included in parenthesis in the second column. As can be seen from the table, appearance-based approaches are superior to the classifiers that utilize geometric features. Considering that the chance level of correct gender classification is 50%, the results obtained by using the geometric features are very poor. One main reason for this inferior performance could be the normalization step that has been applied on the geometric features. During the normalization procedure —making the features have zero mean and unit variance— discriminative information about gender might have been lost. Therefore, the effect of normalization requires further analysis. The appearance-based approaches have achieved around 90% accuracy. The best performance has been obtained using the GoogLeNet architecture  with 94% correct classification. This accuracy exceeds the gender classification accuracies achieved in previous studies on gender classification from ear images [7, 8, 9, 10]. A comparison of these approaches are given in Table V. Overall, in compliance with the findings of the previous work, we have found that ear images provide useful information to classify genders of the subjects.
|Method||Dataset||No. of Img.||Accuracy|
|SVC ||UND Collection F||942||91.7%|
|Majority Voting ||UND Collection J2||2430||89.49%|
|SVM ||UND Collection J2||2430||91.92%|
|SVM ||UND Collection F||942||92.94%|
|KNN ||Internal dataset||342||90.42%|
3.2 Age Classification Results
Age classification results are presented in Table VI. The first column includes the name of the classifier and the second one contains the corresponding classification accuracy. To remind the reader about the used features, the type of the features are included in parenthesis in the second column. This time performance gap between the geometric feature-based methods and appearance-based methods is close. However, appearance-based methods have been found again superior. Using geometric features, the best performance is achieved with 3 hidden layer neural network and logistic regression, reaching 43% accuracy. The best performance has been obtained with the appearance-based method using the GoogLeNet architecture  with 52% correct classification. Compared to the performance achieved for gender classification, age classification accuracy is relatively low. One possible reason for this outcome is the limited amount of samples per each age group. Since the number of classes is higher in age classification, the amount of samples per class is less. We plan to extend the dataset and analyze the results further. Since the accuracies obtained by the geometric feature-based methods and appearance-based methods are close, combining these two approaches could be another way to improve the performance. Overall, appearance provides more information compared to geometric features, therefore, have been found to be more useful for age and gender classification.
|Logistic Regression||43% (Geometric)|
|Random Forest||34% (Geometric)|
|3 hidden layers NN||43% (Geometric)|
In this paper, we have presented a thorough study on age and gender classification from ear images. To the best of our knowledge, this study is the first work on age classification from ear images and one of the few studies on gender classification using ear images. In the study, we have employed both geometric features and appearance-based features for ear representation. The geometric features are calculated with respect to eight anthropometric landmarks on the ear and consist of 14 distance measurements and two area calculations. These features have been then classified using four different methods: logistic regression, random forests, support vector machines, and neural networks. The appearance-based methods are based on deep convolutional neural networks. The well-known CNN models, namely, AlexNet , VGG-16 , GoogLeNet , and SqueezeNet  have been adopted for the study. To transfer them efficiently to the task at hand, they have been first fine-tuned on a large-scale ear dataset that has been built from the profile and close-to-profile face images available in the Multi-PIE face dataset . Afterwards, the updated models have been fine-tuned again on the small-scale target ear dataset. As a result of the experiments, appearance-based methods have been found to be superior to the methods based on geometric features. We have achieved 94% accuracy for gender classification, whereas 52% accuracy has been obtained for age classification. These results indicate that ear images provide useful cues for age and gender classification. However, gender classification using geometric features require further work. It has been noticed that for gender classification geometric features are sensitive to the normalization. Therefore, better normalization schemes have to be explored. For age estimation, we believe that the main reason for the lower performance is the lack of sufficient amount of training samples from each age group. We plan to extend the dataset and train the age classification system with larger amount of samples. We also aim to make comparisons by performing experiments on popular datasets, such as UND-F and UND-J2 . Besides, we also plan to investigate the complementarities between the geometric and appearance-based features. Moreover, we plan to combine profile face images and ear images for age and gender classification.
This work was supported by Istanbul Technical University Research Fund, ITU BAP, Project No. MGA-2017-40893.
-  Ž. Emeršič, V. Štruc, and P. Peer, “Ear recognition: More than a survey,” Neurocomputing, vol. 255, pp. 26–39, 2017.
-  Ž. Emeršič, D. Štepec, V. Štruc, P. Peer, A. George, A. Ahmad, E. Omar, T. E. Boult, R. Safdari, Y. Zhou et al., “The unconstrained ear recognition challenge,” arXiv preprint arXiv:1708.06997, 2017.
-  A. Kumar and C. Wu, “Automated human identification using ear imaging,” Pattern Recognition, vol. 45, no. 3, pp. 956–968, 2012.
-  M. S. Yavuz, E. Tatlısumak, B. Özyurt, and M. Aşırdizer, “The investigation of the effects of observers’ gender in personal identification from auricle morphology,” Journal of Forensic Medicine, vol. 27, no. 3, pp. 173–181.
-  M. S. Nixon, I. Bouchrika, B. Arbab-Zavar, and J. N. Carter, “On use of biometrics in forensics: gait and ear,” in Signal Processing Conference, 2010 18th European. IEEE, 2010, pp. 1655–1659.
-  A. Abaza, A. Ross, C. Hebert, M. A. F. Harrison, and M. S. Nixon, “A survey on ear biometrics,” ACM computing surveys (CSUR), vol. 45, no. 2, p. 22, 2013.
-  P. Gnanasivam and S. Muttan, “Gender classification using ear biometrics,” in Proceedings of the Fourth International Conference on Signal and Image Processing 2012 (ICSIP 2012). Springer, 2013, pp. 137–148.
-  G. Zhang and Y. Wang, “Hierarchical and discriminative bag of features for face profile and ear based gender classification,” in Biometrics (IJCB), 2011 International Joint Conference on. IEEE, 2011, pp. 1–8.
R. Khorsandi and M. Abdel-Mottaleb, “Gender classification using 2-d ear
images and sparse representation,” in
Applications of Computer Vision (WACV), 2013 IEEE Workshop on. IEEE, 2013, pp. 461–466.
-  J. Lei, J. Zhou, and M. Abdel-Mottaleb, “Gender classification using automatically detected and aligned 3d ear range data,” in Biometrics (ICB), 2013 International Conference on. IEEE, 2013, pp. 1–7.
-  P. Yan and K. Bowyer, “Empirical evaluation of advanced ear biometrics,” in Computer Vision and Pattern Recognition-Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on. IEEE, 2005, pp. 41–41.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
-  D. Yaman, F. Eyiokur, and H. K. Ekenel, “Domain adaptation for ear recognition using deep convolutional neural networks,” IET Biometrics, vol. 7, no. 3, pp. 199–206, 2018.
-  R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010.
-  N. Sezgin, “Investigation of age-dependent changes in human faces in digital environment as metric,” Ph.D. dissertation, Istanbul University, Turkey, 2017.
-  P. Guyomarc’h and C. N. Stephan, “The validity of ear prediction guidelines used in facial approximation,” Journal of forensic sciences, vol. 57, no. 6, pp. 1427–1441, 2012.
-  M. de Menezes, R. Rosati, C. Allievi, and C. Sforza, “A photographic system for the three-dimensional study of facial morphology,” The Angle Orthodontist, vol. 79, no. 6, pp. 1070–1077, 2009.
-  D. J. Hurley, B. Arbab-Zavar, and M. S. Nixon, “The ear as a biometric,” in Handbook of biometrics. Springer, 2008, pp. 131–150.
-  C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” arXiv preprint arXiv:1409.0575, 2014.
-  G. Ozbulak, Y. Aytar, and H. K. Ekenel, “How transferable are cnn-based features for age and gender classification?” in Biometrics Special Interest Group (BIOSIG), 2016 International Conference of the. IEEE, 2016, pp. 1–6.
-  (2000) Open source computer vision library. [Online]. Available: $https://opencv.org/$