Human Recognition Using Face in Computed Tomography

05/28/2020 ∙ by Jiuwen Zhu, et al. ∙ 5

With the mushrooming use of computed tomography (CT) images in clinical decision making, management of CT data becomes increasingly difficult. From the patient identification perspective, using the standard DICOM tag to track patient information is challenged by issues such as misspelling, lost file, site variation, etc. In this paper, we explore the feasibility of leveraging the faces in 3D CT images as biometric features. Specifically, we propose an automatic processing pipeline that first detects facial landmarks in 3D for ROI extraction and then generates aligned 2D depth images, which are used for automatic recognition. To boost the recognition performance, we employ transfer learning to reduce the data sparsity issue and to introduce a group sampling strategy to increase inter-class discrimination when training the recognition network. Our proposed method is capable of capturing underlying identity characteristics in medical images while reducing memory consumption. To test its effectiveness, we curate 600 3D CT images of 280 patients from multiple sources for performance evaluation. Experimental results demonstrate that our method achieves a 1:56 identification accuracy of 92.53 accuracy of 96.12



There are no comments yet.


page 1

page 4

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Computed tomography (CT) has become an essential imaging modality for medical imaging since its introduction into clinical practice in 1973 [1]. It can produce tomographic images combining many X-ray measurements taken from different angles and thus provide internal body structure and pathological characteristics. It has often been used for disease screening of various diseases in head, brain, lung heart, etc. The significant use of CT in clinical diagnosis also brings challenges in CT image management. The CT images may be lost partially or corrupted due to erroneous operations. For example, the CT images are usually stored in the form of Digital Imaging and Communications in Medicine (DICOM), which can contain not only the CT data, but also the users’ name and ID. When there are input errors with the name or ID information, or these fields get corrupted, the correspondence between the multiple CT scans may no longer be reliable, leading to difficulties when the doctors need to review all the historical CT scans. In addition, people may go to more than one hospital in order to get a confirmed diagnosis result. Doctors need to confirm whether the CT images obtained in other hospitals belong to the same patient. There is an increasing demand for automatically associating different CT scans to the corresponding subjects, so that the 3D CT images are identified, registered, and archived.

Exploring biometrics in medical images has already attracted some research interest [2, 3]. Medical biometrics such as electrocardiogram (ECG), electroencephalogram (EEG), or blood pressure signals have been considered to be informative modalities for the next generation of biometric characteristics [2]. In addition, exploration of biometric features in medical images is able to give assistance to the protection of medical data and provide references for improving biometric security system. For example, ECG has been employed in biometric applications [3]. A biometric authentication algorithm by utilizing ECG is proposed to unlock the mobile devices [4]. To the best of our knowledge, our work is the first attempt to discover a biometric identifier from 3D CT images.

The challenge of our work is to capture the most subject discriminative features from 3D CT images. CT images are usually stacked by a series of slices, which can be represented by a volume that is a collection of voxels. We note that the number of slices, slice spacing, and voxel resolution of different CT images are generally different, which can cause big diversities of CT images. Moreover, 3D CT images are susceptible to a number of artifacts, such as patient movement [5], representation method [6] and radiation dose [7]. To achieve the best performance, we focus on the face area in CT, which occupies the majority of CT images, especially in head and neck CT. Furthermore, face recognition has been the prominent and increasingly mature biometric technique for identity authentication [8]. Thus, we utilize the 3D CT face as our biometrics.

We investigate existing 3D data processing methods. 3D volumetric data has endowed a variety of representations and corresponding learning methods. For example, 3D convolutional neural networks (CNNs) 

[9, 10] are proposed to directly deal with volumetric data. In [9], V-Net, a special type of CNN, is proposed for the 3D clinical prostate magnetic resonance imaging (MRI) data for specific segmentation tasks. However, such methods usually require unifying input data (i.e., a fixed number of voxels and volume size), which may not be suitable in dealing with diverse 3D CT images. Furthermore, due to the computational cost of 3D convolution and data sparsity, volume representation is limited by its resolution. Vote3D [11] is a method proposed for dealing with sparsity problem. The main idea is utilizing a sliding window and a voting mechanism to handle 3D volume data. However, the operation is still based on sparse data, which may result in loss of detailed information. Again, these 3D CNN-based methods require unifying input, which may not be suitable for our irregular CT images. PointNet [12] and PointNet++ [13] are proposed for irregular 3D point cloud data representation learning, and have reported promising performance on classification and segmentation. However, such superiority is dispensable to the 3D data that can be regularly arranged. There is a size limitation of PointNet based method, i.e., usually smaller than 2048 points. Multiview CNNs [14] is proposed to render 3D data into 2D images and then applying 2D CNNs for classification tasks. By their design, promising performance has been reported in [14] in shape classification and retrieval tasks.

Since there are a number of excellent classification networks for 2D image-based classification and identification tasks, we also propose to perform 3D CT based person identification by projecting CT into multi-view 2D renderings.

Unlike common computer vision tasks like image classification and face recognition, which usually have millions of images, the typical amount of medical images is relatively small. To cope with this, one popular solution is to apply transfer learning 

[15, 16], which has been proved efficient in quantifying the severity of radiographic knee osteoarthritis [15] and improving schizophrenia (SZ) classification performance [16]. A popular and simple strategy is to fine-tune the pretrained network on the target dataset. In such cases, finding a suitable source domain dataset and minimizing the gap between source and target domain is crucial. In our tasks, we utilize the ample amount of face depth images obtain with RGB-D sensors to perform transfer learning towards CT based face recognition.

We summarize the main contributions of this work as follows:

  1. To the best of our knowledge, this is the first attempt to explore the biometric characteristic of 3D CT images for human recognition, and we show 3D CT images do contain subject discriminative information, and achieve 92.53% rank-1 identification rate on a dataset with 600 CT images.

  2. We propose an automatic processing pipeline for 3D CTs, which first detects facial landmarks in 3D CTs for ROI extraction and then generates aligned 2D depth images projected from 3D CTs.

  3. We employ transfer learning and a group sampling strategy to handle the small data issue, which is found to improve the person recognition performance by a large margin.

  4. We validate the effectiveness and robustness of our method using datasets from multiple sources. The results show our model is able to capture the discriminative characteristics from CT images.

The remainder of this paper is organized as follows. We briefly review related works in Section ii@. The proposed 3D CT based recognition method is detailed in Section iii@. In Section iv@, we provide the experiments on several public 3D CT datasets. Discussions of the method and results are provided in Section v@. We finally conclude this work in Section vi@.

Ii Related work

Ii-a Point Cloud Based 3D Modeling

PointNet [12] is proposed for data representation learning from point cloud, which is data format consisting of point coordinates listed irregularly. PointNet is a novel convolutional neural network that takes point cloud as input and learns the spatial pattern of each point, and finally aggregates the individual features into a global view. The network structure is simple but efficient. It has been successfully applied in object classification, partial segmentation, and scene semantic analysis tasks. A hierarchical feature learning method PointNet++ [13] is proposed to better capture local structure and improve the ability of managing the variable density of point cloud. The main idea is to extract features to a higher dimension in a global pattern from a small scale. It introduces a strategy to divide and group the input points into some overlapping local regions by measuring the distance in a metric space and produces higher level features by utilizing PointNet. Such a method can achieve good performance in dealing with irregular point cloud data. However, such superiority is dispensable to the 3D data that can be regularly arranged. Moreover, it has a strong demand for high computational consumption and large memory.

VoxelNet [17] proposed by Apple Inc offers a novelty for handling point cloud data. It encodes point cloud data to a descriptive volumetric representation by introducing a voxel feature encoding layer. Then the problem can be translated to the task of operating 3D volume data. But it is proved to perform well only in LiDAR dataset and its performance is uncertain for other tasks. PointGrid [18] integrates point and grid for dense 3D data. It holds a constant number of points in a grid cell and gets a local, geometrical shape representation. However, the random screening of points causes a loss of detailed information, thus difficult to recognize the features on a small scale.

We note that those methods are usually applied in object identification tasks, in which the target objects have macroscopic differences in shapes or colors and do not request for high resolution of source data. It may not be suitable for those tasks that require finer granularity such as facial texture.

Ii-B 2D Rendering Based 3D Modeling

3D shape models can be naturally encoded by a 3D convolutional neural network. The method may be limited by computational complexity and restricted memory. In addition, 3D model is usually defined on its surface, which ignores the observed information reflected in the pixels of 2D images [19]. Su et al. prove that applying a 2D CNN on multiple 2D views can achieve a better recognition result [20]

. They build classifiers of 3D shapes based on 2D rendered images, which outperform the classifiers directly based on 3D representations. Compared to the general 3D volume-based method, the 2D CNN on multiple 2D images consumes less memory. It allows an increased resolution and better expresses the fine-grained pattern. Another superiority of using 2D representations is that it offers an additional training data augmentation, since the 3D model of a subject is a single data point, which makes the 3D training dataset short of richness and multiformity. By projecting a 3D model into a 2D space, we can easily obtain a series of related 2D images for training, which is beneficial for network training. 2D images have wide sources and have already organized as massive image databases such as ImageNet 

[21] which offers a plethora of information conducive to pre-training powerful features. The depth information and observed information can also be generated for improved performance [20]. It has been also proved to achieve high results in recent retrieval benchmarks [22]. Due to its advantages, it has designed for some classification and detection tasks. For example, Hegde et al. [23] propose FusionNet for 3D object classification after generating multiple representations. Deng et al. [22] develop a 3D Shape retrieval method named CM-VGG by clock matching and using convolutional neural networks. Kalogerakis et al. [24], inspired by the projecting strategy, develop a deep architecture for 3D objects segmentation tasks by combining multi-view FCNs and CRF. It learns to adaptively select part-based views to obtain special view-based shape representations.

2D CNN has achieved a series of breakthroughs in image classification and segmentation tasks. It can naturally integrate low/medium/high level features and classifiers in an end-to-end manner. The network is flexible due to different ways of stacking layers. Among many promising methods, ResNet [25] proposed by He is an efficient and powerful structure which can be easily implemented and modified. It is capable of solving the network degradation problem as the depth of the network increases. Many improvements based on ResNet have been proposed, such as ResNeXt, DenseNet, MobileNet, and ShuffleNet. Adopting the great benefits of Muti-views CNN and mature implementation of ResNet, our framework is capable of offering a remedy for the small samples of source data, which is common in medical imaging field.

Ii-C Depth Representation for 3D Object

Recent developments in depth imaging sensors have induced an effective application of depth cameras which are actively used for a variety of image recognition tasks [26]. The depth camera is used to produce high quality depth images and is getting more significance in multimedia analysis and man-machine interaction [27]. The major principle of depth imaging system is extracting a depth target object silhouette and discarding the background, which renders insensitivity to lighting conditions and offers more spatial characteristics. Depth camera is proved to be helpful for many tasks by its benefits. For example, Papazov et al. [28]

employ a commodity depth sensor for 3D head pose estimation and facial landmark localization. Kamal et al. 

[29] describe a novel method for recognizing human activities from video using depth silhouettes. Raghavendra et al. [30] present a novel face recognition system by focusing on the multiple depth images rendered by the Light Field Camera. Ge et al. [31]

generate 2D multi-view projections from a single depth image and fuse to produce a robust 3D hand pose estimation. They have observed the transformation of 3D model, depth image, and multiple 2D projections, which is practicable and flexible. The idea has also been employed in a multi-view video coding system 


Depth representation can also play an important role in assisting detection or surface segmentation tasks combining with RGB images [33, 34]. Such studies all take advantage of the strong relevancy between the depth representation and 3D model. In other words, these indicate one specific subject. And the flexibility of such transformation can be utilized for improving the algorithm performance.

In our tasks, we exploit our medical imaging data and make use of the depth representation of facial part for human recognition.

Iii Proposed Approach

In this paper, we propose an effective approach for exploring the biometric features in CT images for face recognition. We first normalize different 3D CT images into the same spacing and organize them in a 3D manner. Then, we detect 3D facial landmarks from 3D CT volumes by learning a modified face alignment network (denoted as FAN below). The detected 3D facial landmarks are then used for cropping the facial region from CTs. Next, 3D rendering and 3D-to-2D projection are performed to obtain the face depth images, which will be used for face recognition. Finally, we leverage transfer learning to adapt a pretrained face recognition model from face depth images in the RGB-D domain to the computed face depth images from CTs. The overview of our approach is illustrated in Fig. 1. Our method can exploit the uniqueness of CT images into consideration and learn the most discriminative feature for face recognition tasks. We adopt transfer learning and a novel training strategy to improve performance. We describe the detailed procedure below.

Fig. 1: An overview of the proposed approach for 3D CT based face recognition.

Iii-a CT Normalization and Face Landmark Detection

While CTs are widely used in medical diagnosis, there are big diversities among different 3D CT images. First, the slice spacing of 3D CT images captured by different CT imaging machines are usually different, i.e., the slice spacing for CT images can vary from about to in practice. Second, the body areas presented in different CT scans are usually different. For example, some CT scans may contain body from the legs to the head, but some may just contain the head. Therefore, our first step is to extract the facial regions from these various CT scans, and normalize their slice spacing to the same scale.

We extract facial regions from CTs by detecting facial landmarks as widely used in face recognition. However, while facial landmark detection in 2D and 3D face images has been widely studied [35, 36], facial landmark detection from CTs is a relatively new problem because of the modality difference between CTs and traditional 2D and 3D face images. So, we propose a CT facial alignment network (CT-FAN) to localize a set of pre-defined facial landmarks from each CT image. Our CT-FAN consists of an input block with 4 convolution layers and an Hourglass (HG) [37] block. Following the widely used settings in visible RGB image-based face recognition [38, 39], we also pre-define facial landmarks on each 3D CT face image, including left eye center, right eye center, nose tip, and left mouth corner. We have manually annotated these landmarks for all the CT images using 3D Slicer [40]. We denote the landmarks for each CT image as , in which denoting a coordinate in 3D space. Then, landmark detection from CTs can be formulated as


where denotes an input CT image. Considering the ambiguity of landmarks in 3D CTs, instead of directly predicting the coordinate values of landmarks, we represent each landmark with a Gaussian blob (or heatmap) centered at the landmark, i.e., . Then, the goal of our CT-FAN becomes


where can be learned with the conventional Adam solver. After is learned, given a testing CT image , its predicted extended landmarks can be denoted as . Then, the final facial landmarks can be computed with an average of the 3D points in . A simple mean square error (MSE) loss is adapted for minimizing the differences.


Then, the facial ROI volume from each CT image can be easily computed based on the predicted landmarks and the bounding box.

Fig. 2: The results of face landmark detection. (a)-(c) are nose landmark detection results from a 3D CT image, and (d)-(f) are nose landmark detection results from another 3D CT image.

Iii-B 3D Rendering and Depth Image Generation

After obtaining facial ROI volume

, we perform 3D rendering and 3D-to-2D projection to generate CT depth images which will be used for the final face recognition task. Due to the diversity of slice spacing among CTs, scale normalization is crucial. However, the conventional scaling method by utilizing interpolation may lead to voxel holes (i.e., no value assigned to voxels). Therefore, we employ the visualization toolkit (VTK) 

[41, 42] tool and construct a 3D rendering model for each CT face image. Compared to other visualization methods for natural images, VTK is more suitable for medical image management and better restoring the original 3D contour information of subjects.

The 3D rendering of a CT face image can be written as:


where indicates 3D rendering. and indicate iso-surface extraction threshold and surface normal rendering angle for 3D model construction, respectively. Specifically, we set , and ranges in [-400, -300].

Then, we project the rendered 3D CTs to 2D to obtain CT face depth images. The key of projection is to find the relationship between 2D and 3D coordinates. VTK is capable of simulating the action of taking pictures in reality, like “photographing”, and it can transform the coordinates naturally. Assuming consists of a set of 3D voxels coordinate. Then, we can obtain the depth image with a set of 2D pixels as follows:


where is a transformation function, which is defined based on a number of parameters, like camera location , camera orientation , scaling parameter , the coordinates of the image center and the size of the 2D images . Specifically, we set as default, which means facing the object. We set , and .

The parameter setting is largely related to our task. In a common CT scan, there exist different rotation degrees of the human head, e.g., some patients will look up or bow, and others will turn right or left. In addition, our dataset has a limited number of images per patient (most individual only has one CT). Such situation causes difficulties in acquiring CTs of different head pose. Therefore, we utilize different camera locations for data augmentation to simulate different head poses. Camera location is determined by three types of rotation, i.e., pitch rotation angle , roll rotation , and yaw rotation . Specifically, we set , and and . For each 3D CT image, we obtain 90 depth images with different head poses.

Iii-C Face Separation

In order to suppress the influence of non-facial noises (neck or hands) and get more distinguishable CT images, we then adopt face segmentation. A fully convolutional encoder-decoder network (CEDN) [43] is employed for face contour extraction. We utilize VGG16 [44] as a basic framework and modified it as an encoder. The decoder part is a corresponding deconvolution neural network.

We first train the model on the Helen [45] dataset, which includes 2,000 face images with 68 landmarks, like eyes, nose, mouth, eyebrows, and chin, et al. Then the network is transferred to depth images of CTs.

The separated depth images can be obtained as follows:


where represents the output of CEDN network.

Iii-D Image Adjustment

Due to the limited number of samples, which is common in medical image analysis, training -from-scratch is insufficient to accomplish complex assignments. We employ transfer learning and adopt a depth image dataset collected in natural environment as source dataset for better performance. To reduce the gap between the source and target domain, we analyze the data distribution of the two datasets. The analysis results are shown in Fig. 3(a) and Fig. 3(b). It can be discovered that the pixel values of natural depth images are gathered in , while the pixel values of the CT depth images are gathered in . Then, we select an appropriate threshold and normalize the depth images. The normalization function is described as:


where is a scaling parameter and is threshold. The adjusted result is shown in Fig. 3(c). The normalized image is used as the input to the network.

Fig. 3: The procedure and results of image adjustment. (a) and (b) are data distribution maps from source domain and target domain, respectively. (c) is the normalization results.

Iii-E Classification Model and Training Strategy

We introduce a modified ResNet50 [25] as our backbone for face identification and verification tasks, and utilize triplet loss as a loss [46]

function for limited samples. To this end, the anchor, positive, and negative samples in each batch are required to select appropriately. The loss function for face recognition task is formulated as:


where represents the absolute value; , and represent anchor, positive sample and negative sample of each batch, respectively, and is a margin threshold.

To eliminate the error for determining the three samples of loss calculation, the diversity of each batch is largely required. Since our datasets are collected from multiple sources, there is a possible dissimilarity between 3D CTs of different patients. We employ a training strategy for 3D CTs, which forces the network to focus on the inter-individual differences of CTs. We first divide depth images of the same patient into random subgroups, where is patient ID (denote as label). The size of subgroups is related to the batch size. Then, labels are randomly selected. Finally, each batch consists of one random subgroup from each selected label.

Iv Experiments

Iv-a Dataset and Protocol

Our CT dataset consists of five open-source head and neck datasets from TCIA Collections

111https:/, including Head-Neck-PET-CT, QIN-HEADNECK, HNSCC-3DCT-RT, CPTAC-HNSCC, and Head-Neck Cetuximab. The Cancer Imaging Archive (TCIA) is a large archive of medical images. Some datasets also offer other image-related information such as patient treatment results, treatment details, genomics and pathology related information, and expert analysis.

Head-Neck-PET-CT [47] consists of 298 patients from FDG-PET/CT and radiation therapy programs for head and neck (H&N) cancer patients supported by 4 different institutions in Quebec. All patients underwent FDG-PET/CT scans between April 2006 and November 2014. The same transformation is applied to all images, preserving the time interval between serial scans. The patient treatments and image scanning protocols are elaborated in [47]. QIN-HEADNECK is a set of multiple positron emission tomography/computed tomography (PET/CT). 18F-FDG scans-before and after therapy-with follow up scans of head and neck cancer patients. The data contributes to the research activities of the National Cancer Institute’s (NCI’s) Quantitative Imaging Network (QIN). HNSCC-3DCT-RT consists of 3D high-resolution fan-beam CT scans of 31 head-and-neck squamous cell carcinoma (HNSCC) patients by using a Siemens 16-slice CT scanner with standard clinical protocol. CPTAC-HNSCC contains subjects from the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium Head-and-Neck cancer (CPTAC-HNSCC) cohort. Head-Neck Cetuximab is a subset of RTOG 0522/ACRIN 4500. The protocol is randomized chemotherapy and phase III trial of radiation therapy for stage III and IV head and neck cancer.

In total, 600 3D CT images of 280 patients are collected after integrating and screening the above five datasets. We randomly select 224 subjects for training and the remaining 56 subjects for testing. We obtain a total of 54,000 CT depth images through data augmentation. Since the number of CT depth images of different classes may be different, we divide the training and testing sets with a similar ratio for each class, i.e., split each class by 8:2.

We perform both face identification and verification experiments using the CT depth images obtained by our approach to verify the effectiveness of leveraging CT depth images for human identification. Firstly, we perform face identification and calculate the classification or identification accuracy (ACC) for performance evaluation. One sample from each class in test dataset is selected to the gallery, and the remaining CT depth images are used as probes. We measure the distance between each probe and all gallery samples. And class of the gallery with the smallest distance is regarded as predicted label. Our gallery and probe sets contain 54 CT depth images and CT depth images, respectively. Face identification is a simulation of 1 vs. N matching tasks for an unknown 3D CT in reality. The illustration is shown in Fig. 4. In addition, we perform face verification to replicate the 1 vs. 1 face matching scenarios. Specifically, given a random pair of CT depth images, it is called genuine pair if the two images are from the same subject; otherwise, the pair is called impostor pair. The verification accuracy (VACC) and the area under the ROC curve (AUC) is adopted for evaluation. The threshold calculated by ROC is utilized as indicator for positive and negative classification.

Fig. 4: An example of a CT probe face image and several gallery CT face images after our data processing pipeline.
Model Cropping Rotation Pretrain Mean ACC (std) Mean VACC (std) Mean AUC (std) Data format
VGG16 n .2482(0.026) .6733(0.028) .7518(0.023) 3D volume
VGG16 n .2024(0.059) .6693(0.021) .7272(0.017) 3D volume
VGG16 n .2461(0.085) .6951(0.047) .7589(0.052) 3D volume
VGG16 n .3444(0.041) .7568(0.033) .8265(0.022) 3D volume
ResNet10 n .2930(0.068) .6612(0.034) .7301(0.018) 3D volume
ResNet10 n .2820(0.044) .6829(0.030) .7446(0.039) 3D volume
ResNet10 n .2863(0.060) .7210(0.009) .7989(0.021) 3D volume
ResNet10 n .3571(0.085) .7608(0.026) .8423(0.029) 3D volume
ResNet18 n .3236(0.078) .6831(0.027) .7597(0.019) 3D volume
ResNet18 n .2837(0.048) .6747(0.051) .7342(0.04) 3D volume
ResNet18 n .2877(0.045) .7079(0.032) .7883(0.031) 3D volume
ResNet18 n .2775(0.076) .7318(0.043) .8052(0.041) 3D volume
ResNet34 n .3407(0.049) .7234(0.029) .7893(0.038) 3D volume
ResNet34 n .2930(0.042) .6831(0.032) .7379(0.027) 3D volume
ResNet34 n .3271(0.097) .7620(0.024) .8032(0.058) 3D volume
ResNet34 n .4000(0.094) .7771(0.056) .8535(0.048) 3D volume
PointNet - - n .3634(0.054) .7089(0.039) .7263(0.038) Point cloud
PointNet (ModelNet40 pretrained) - - y .3762(0.045) .6786(0.010) .7529(0.097) Point cloud
PointNet++ - - n .2066(0.013) .6308(0.009) .6907(0.014) Point cloud
PointNet++ (ModelNet40 pretrained) - - y .2055(0.017) .6355(0.009) .6965(0.13) Point cloud
Proposed - - n .8712(0.032) .9374(0.009) .9857(0.004) CT Depth images
Proposed (ImageNet pretrained) - - y .8764(0.051) .9513(0.027) .9849(0.005) CT Depth images
Proposed (Depth images pretrained) - - y .9253(0.044) .9612(0.016) .9936(0.004) CT Depth images

Face identification and verification results of our method and the other baseline methods. ‘Pretrain’ indicates if a pretrained model is used or not. Mean ACC, Mean VACC and Mean AUC indicates the mean identification accuracy, the mean verification accuracy and the mean area under ROC curve, respectively. The standard deviations of the 5-fold test are shown in brackets.

Fig. 5: The box plot figures of face identification and verification. Four baselines, e.g., VGG, ResNet34, pretrained Pointnet, pretrained PointNet++ and our proposed model are shown.
Fig. 6: The ROC curves of face verification, in which (a) highlights the true positive rates (TPR) at false positive rates (FPR) ranging in [0, 0.05], and (b) shows the whole ROC for FPR ranging in [0, 1].
Row ID Grouping Proportion Grouping setting (E*L) Pretrain Data stages Mean ACC (std) Mean VACC (std) Mean AUC (std)
1 15*18 y Projected .8778(0.034) .9419(0.024) .9708(0.020)
2 Segmented .8842(0.048) .9543(0.024) .9831(0.009)
3 3*90 y Final .8512(0.035) .9312(0.012) .9785 (0.008)
4 5*54 .8517(0.024) .9745(0.015) .9208(0.008)
5 10*27 .8928(0.030) .9349(0.013) .9807(0.002)
6 18*15 .8652(0.032) .9316(0.008) .9819(0.004)
7 30*9 .8709(0.027) .9417(0.007) .9821(0.009)
8 90*3 .8648(0.031) .9267(0.010) .9797(0.006)
9 15*18 n Final .7485(0.025) .8956(0.009) .9632(0.009)
10 y .8670(0.020) .9402(0.009) .9860(0.005)
11 n .8192(0.032) .9170(0.016) .9771(0.006)
12 y .9097(0.012) .9471(0.015) .9888(0.006)
13 n .8377(0.034) .9328(0.012) .9809(0.003)
14 y .8516(0.031) .9325(0.011) .9825(0.004)
15 15*18 n Final .8712(0.032) .9374(0.009) .9857(0.004)
16 y .9253(0.044) .9612(0.016) .9936(0.004)
TABLE II: Face identification and verification results of our method with different training settings. Grouping and proportion are related to different selections of training strategy. Grouping setting, Pretrain and Data stages indicate different parameters (E and L), a pretrained model is used or not, and different data processing stages, respectively.
Fig. 7: Face identification and verification results on our CT dataset when gradually adding the individual steps of our processing pipeline.
Fig. 8: Examples of (a) are the source domain face depth images of one subject from the RGB-D dataset in [48], and (b-d) are the 2D face depth images generated from 3D CTs for three subjects.

Iv-B Implement Details

We employ transfer learning to mitigate the challenges caused by small sample of our CT dataset. We first train our classification model ResNet50 by using depth images from an RGB-D face dataset, which contains 581,366 depth images of different shooting angles from 450 subjects [49, 48]. Then, the model is fine-tuned to CT depth images.

We set and for training strategy, and use a batch size of in our experiments. The CT depth images are normalized to with a

random crop as input. The classification model for CT depth is implemented using PyTorch. The initial learning rate is set to

. The Adam solver is used as the optimizer for network training.

For the baseline methods of PointNet and PointNet++, we extract point cloud from CT depth images so that they can be input to the baseline methods. Specifically, we downsample 4096 points from each depth image as input, and set the batch size to 48. For 3D volume-based baseline methods like 3D VGG, 3D ResNet10, 3D ResNet18, and 3D ResNet34, the 3D CTs of different pre-processing methods are fed into 3D network directly. We utilize multi-scale grouping (MSG) model for PoinNet++. We employ 5-fold cross validation for all baseline methods and our model.

Iv-C Results of Our Model and Baselines.

Firstly, we compare our method with other 3D based methods for 1 vs. 54 identification task, such as PointNet, PointNet++ and 3D ResNet. The results are summarized in Table I. It is easy to discover that our method can obtain much better performance than the baseline methods. Specifically, our pre-trained method obtains an identification accuracy of 92.53%, more than 52.53% higher than the best of the baseline method, e.g., 3D ResNet34. And our un-pretrained obtains an identification accuracy of 87.12%, 50.78% higher than the best of the baseline method, e.g., PointNet. The 3D CNNs are observed to get lower accuracy, which may because of the small number of training samples, since each 3D CT only has one data point. We also evaluate the effectiveness of data augmentation by random 3D rotation for 3D CNN. However, such augmentation proved to be unprofitable, which is affected by 3D deformation. In addition, it is difficult to find a large 3D dataset which is similar to 3D CTs for pre-training. In addition, the performances of PointNet and PointNet++ are weak. Because their limited input representation may cause detailed information loss. We also evaluate the pre-trained performances of PointNet and PointNet++ as mentioned in [13]. The transfer results from ModelNet40 are proved to be helpless, which largely due to the gap between CT depth images and 3D natural point cloud. And such point-cloud-based models are also limited in terms of input size. Thus, our model is capable of learning discriminatice feature by its design. We also draw the box plot and ROC in Fig. 5 and Fig. 6 by using the best model of each network, e.g., VGG with cropping and rotation, ResNet34 with cropping and rotation, PointNet with ModelNet40 pretrained and PointNet++ with ModelNet40 pretrained. The ROC results suggest that our model leads to the best performance, which shows a robustness feature learning capacity.

We evaluate the performance of 1 vs. 1 verification task. It can be discovered that our method can obtain the best verification performance than other baseline methods. The results are shown in Table I. Specifically, our method obtains a verification rate of 96.12%, with 18.41% higher than the best of the baseline method, e.g., ResNet34. Also, the AUC (0.9936 by our proposed method) suggests that our method can yield a discriminative performance, indicating powerful identification abilities for 3D CTs. The 3D volume-based baseline methods and point-cloud-based methods also proved to less beneficial, compared to our model.

Iv-D Results of Different Training Settings.

To validate the robustness of our training strategy, we also compare our training strategy with other training settings. There are two important operations in our proposed method. First, we shuffle the samples in subgroup generation, and each subgroup could contain depth images generated from different 3D CTs of the same patient. Second, we split the training and testing set according to the number of 3D CTs obtained from the same patient, and we expect to find the superiority and effectiveness of such a partition. To this end, we perform experiments without at least one of the above operations, where other parameters are kept the same to those of the original setting. And results are shown in rows of Table II. We discover that the two operations of our training strategy are beneficial and each operation shows some boost. And the model with two operations can improve accuracy by 3.35% and 1.56% over un-pretrained and pretrained models, respectively. Such results indicate that a training method with a suitable partition for training and testing for 3D CTs of multiple sources is crucial.

There are two parameters in our training. i.e., subgroup size and selected label number . To validate the robustness of our training strategy, we perform different parameter settings with the same batch size . The results are shown in rows of Table II. It can be discovered that the best result is obtained by setting and . The triple loss is sensitive to the positive and negative samples. Thus, our design for each mini-batch enables the network to focus on the inter-individual differences of 3D CTs as designed.

In order to evaluate the necessity of our data processing pipeline, we utilize the data generated by each step as network input, including original projected 2D images (Projected), face segmented images (Segmented), and final adjusted images (Final). The results are shown in rows of Table II and Fig. 7. It can be seen that the final processed data can achieve the best accuracy and AUC value. Each stage can improve the identification performance, e.g. 0.64% and 4.11% boost, respectively. The results demonstrate that each step of our model can be essential and can improve the accuracy accordingly.

V Discussions

Transfer learning is employed in our tasks. However, there exists a gap between natural depth images and CT depth images. The samples of natural depth images come from every frame of four videos. And the samples are shown in Fig. 8(a). Each row indicates a video. We discover that the images from different video images have high similarity. It demonstrates the ROI (face area) of a subject is consistent across the different videos.

Meanwhile, we analyze our depth CT images, especially the images generated from different 3D CTs with the same patient. Different from natural depth images, the images are not consistent. For example, the patient may have a tube in his nose when undergoing one CT scans and not have a tube when undergoing the other. The samples are shown in Fig 8(b). Such condition is unusual but exists in the medical field. In addition, the time interval of undergoing multiple CT scans may vary, which can lead to changes of facial shape. The patients underwent CT scans are generally diseased. And they may become more and more emaciated in case of serious conditions. For example, the images in Fig. 8(c) come from the same patient, and different rows indicate different 3D CTs. It is obvious that the images in the first row are thinner than the second row, which leads to a variation of face. The images in Fig. 8(d) come from the same patient as well. We found that the images in the second row have a bigger cheek, and the patient may take a deep breath when scanning. Such conditions may not appear in the laboratory environment. However, our datasets contain the above 3D CTs, which may common in other medical images.

It is difficult for a network to recognize the differences and similarities between an unusual image and usual images. For example, the network may not be able to distinguish the CT depth images with a tube during testing, due to such images not appearing in the training dataset. Our model has its limitation to handle such 3D CTs. Utilizing style transfer learning method to generate more images with different cases may be helpful. The style transfer learning methods largely based on generative adversarial networks (GANs). However, training such networks from 3D CTs may be limited by the small sample problem. And the generated images may not close to the real images.

Our paper employs model transferring from natural depth images. Minimizing the gap between the source domain (natural depth images) and the target domain (CT depth images) is crucial.

Vi Conclusions

In this paper, we explore the biometric characteristic of 3D CTs and use them to perform face recognition and verification. We propose an automatic processing pipeline for human recognition based on 3D CT. The pipeline first detects facial landmarks for ROI extraction and then project 3D CT faces to 2D to obtain depth images. To address small training data issue and improve the inter-class separability, we use transfer learning and a group sampling strategy to train our classification network from face depth images obtained from 3D CTs. We perform experiments on 3D CTs collected from multiple sources, which shows the proposed method performs better than the state-of-the-art methods like point cloud networks and 3D CNNs. We also validate and discuss the effectiveness of the data processing and transfer learning modules.

Our work is of vital importance to the integrity of medical image big data. because medical images are inevitably expanding, to assure correct associations between different scans of the same patient it is necessary to leverage automatic human recognition technologies like the proposed CT based face recognition. In addition, the proposed CT based face recognition method can be used to identify unknown CT images when the meta data is lost. Our work can improve the efficiency of hospital operations to a certain extent, avoid cumbersome manual inspections, and bring convenience to hospital visits and medical treatment. Our future work includes combining the CT images with other modalities of the same individual to perform multi-modality or cross-modality human recognition.


  • [1] G. N. Hounsfield, “Computerized transverse axial scanning (tomography): Part 1. description of system,” The British Journal of Radiology, vol. 46, no. 552, pp. 1016–1022, 1973.
  • [2] F. Agrafioti, F. M. Bui, and D. Hatzinakos, “Medical biometrics: The perils of ignoring time dependency,” in IEEE International Conference on Biometrics: Theory, Applications, and Systems, 2009, pp. 1–6.
  • [3] F. Porée, G. Kervio, and G. Carrault, “Ecg biometric analysis in different physiological recording conditions,” Signal, Image and Video Processing, vol. 10, no. 2, pp. 267–276, 2016.
  • [4] J. S. Arteaga-Falconi, H. Al Osman, and A. El Saddik, “Ecg authentication for mobile devices,” IEEE Transactions on Instrumentation and Measurement, vol. 65, no. 3, pp. 591–600, 2015.
  • [5] U. Bhowmik, M. Z. Iqbal, and R. Adhami, “Mitigating motion artifacts in FDK based 3d cone-beam brain imaging system using markers,” Open Engineering, vol. 2, no. 3, pp. 369–382, 2012.
  • [6] L. W. Goldman, “Principles of CT: multislice CT,” Journal of Nuclear Medicine Technology, vol. 36, no. 2, pp. 57–68, 2008.
  • [7] O. Barkan, J. Weill, A. Averbuch, and S. Dekel, “Adaptive compressed tomography sensing,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2013, pp. 2195–2202.
  • [8] M. Wang and W. Deng, “Deep face recognition: A survey,” arXiv preprint arXiv:1804.06655, 2018.
  • [9] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in IEEE International Conference on 3D Vision, 2016, pp. 565–571.
  • [10] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015, pp. 922–928.
  • [11] D. Z. Wang and I. Posner, “Voting for voting in online point cloud object detection.” in Robotics: Science and Systems, vol. 1, no. 3, 2015, pp. 10–15 607.
  • [12]

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660.
  • [13] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in Neural Information Processing Systems, 2017, pp. 5099–5108.
  • [14] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 945–953.
  • [15] J. Antony, K. McGuinness, N. E. O’Connor, and K. Moran, “Quantifying radiographic knee osteoarthritis severity using deep convolutional neural networks,” in IEEE International Conference on Pattern Recognition, 2016, pp. 1195–1200.
  • [16] J. Kim, V. D. Calhoun, E. Shim, and J.-H. Lee, “Deep neural network with weight sparsity control and pre-training extracts hierarchical features and enhances classification performance: Evidence from whole-brain resting-state functional connectivity patterns of schizophrenia,” Neuroimage, vol. 124, pp. 127–146, 2016.
  • [17] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499.
  • [18] T. Le and Y. Duan, “Pointgrid: A deep network for 3d shape understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9204–9214.
  • [19] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multi-view cnns for object classification on 3d data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5648–5656.
  • [20] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3D shape recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 945–953.
  • [21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • [22] M. Savva, F. Yu, H. Su, M. Aono, B. Chen, D. Cohen-Or, W. Deng, H. Su, S. Bai, X. Bai, et al., “Shrec16 track: largescale 3d shape retrieval from shapenet core55,” in Proceedings of the Eurographics Workshop on 3D Object Retrieval, 2016, pp. 89–98.
  • [23] V. Hegde and R. Zadeh, “Fusionnet: 3d object classification using multiple data representations,” arXiv preprint arXiv:1607.05695, 2016.
  • [24] P. Shui, P. Wang, F. Yu, B. Hu, Y. Gan, K. Liu, and Y. Zhang, “3d shape segmentation based on viewpoint entropy and projective fully convolutional networks fusing multi-view features,” in IEEE International Conference on Pattern Recognition, 2018, pp. 1056–1061.
  • [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [26] X. Xu, J. Tang, X. Zhang, X. Liu, H. Zhang, and Y. Qiu, “Exploring techniques for vision based human activity recognition: Methods, systems, and evaluation,” Sensors, vol. 13, no. 2, pp. 1635–1650, 2013.
  • [27] L. Zhang, J. Sturm, D. Cremers, and D. Lee, “Real-time human motion tracking using multiple depth cameras,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 2389–2395.
  • [28] C. Papazov, T. K. Marks, and M. Jones, “Real-time 3d head pose and facial landmark estimation from depth images using triangular surface patch features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4722–4730.
  • [29] S. Kamal, A. Jalal, and D. Kim, “Depth images-based human detection, tracking and activity recognition using spatiotemporal features and modified hmm,” Journal of Electrical Engineering & Technology, vol. 11, no. 3, pp. 1921–1926, 2016.
  • [30] R. Raghavendra, K. B. Raja, and C. Busch, “Presentation attack detection for face recognition using light field camera,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 1060–1075, 2015.
  • [31] L. Ge, H. Liang, J. Yuan, and D. Thalmann, “Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3593–3601.
  • [32] P. Merkle, A. Smolic, K. Muller, and T. Wiegand, “Multi-view video plus depth representation and coding,” in IEEE International Conference on Image Processing, vol. 1, 2007, pp. I–201.
  • [33] Z. Deng and L. Jan Latecki, “Amodal detection of 3d objects: Inferring 3d bounding boxes from 2d ones in rgb-depth images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5762–5770.
  • [34] D. Hoiem and P. Kohli, “Surface segmentation from RGB and depth images,” Aug. 25 2015, US Patent 9,117,281.
  • [35] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face alignment across large poses: A 3d solution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 146–155.
  • [36] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by deep multi-task learning,” in European Conference on Computer Vision.   Springer, 2014, pp. 94–108.
  • [37] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision.   Springer, 2016, pp. 483–499.
  • [38] R. Valle, J. M. Buenaposada, A. Valdes, and L. Baumela, “Face alignment using a 3d deeply-initialized ensemble of regression trees,” vol. 189, 2019, p. 102846.
  • [39] X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Style aggregated network for facial landmark detection,” 2018, pp. 379–388.
  • [40] S. Pieper, M. Halle, and R. Kikinis, “3d slicer,” in IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2004, pp. 632–635.
  • [41] W. J. Schroeder, B. Lorensen, and K. Martin, The visualization toolkit: an object-oriented approach to 3D graphics.   Kitware, 2004.
  • [42] M. Bozorgi and F. Lindseth, “Gpu-based multi-volume ray casting within VTK for medical applications,” International Journal of Computer Assisted Radiology and Surgery, vol. 10, no. 3, pp. 293–300, 2015.
  • [43] J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang, “Object contour detection with a fully convolutional encoder-decoder network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 193–202.
  • [44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2015.
  • [45] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive facial feature localization,” in European Conference on Computer Vision.   Springer, 2012, pp. 679–692.
  • [46] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 815–823.
  • [47] M. Vallières, E. Kay-Rivest, L. J. Perrin, X. Liem, C. Furstoss, H. J. Aerts, N. Khaouam, P. F. Nguyen-Tan, C.-S. Wang, K. Sultanem, et al., “Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer,” Scientific Reports, vol. 7, no. 1, p. 10117, 2017.
  • [48] J. Cui, H. Han, S. Shan, and X. Chen, “RGB-D face recognition: A comparative study of representative fusion schemes,” in Chinese Conference on Biometric Recognition, 2018, pp. 358–366.
  • [49] H. Zhang, H. Han, J. Cui, S. Shan, and X. Chen, “RGB-D face recognition via deep complementary and common feature learning,” in Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, 2018, pp. 1–8.