Autism spectrum disorder (ASD) is a neurodevelopment disorder that affects social communication and behavior of children [1, 2]. According to the Centers for Disease Control and Prevention, one out of 59 children is diagnosed with ASD in the United States111https://www.cdc.gov/ncbddd/autism/data.html. Diagnosing ASD can be difficult because (1) the type and severity of symptoms have a wide spectrum, and (2) the behavior of children with autism is dependent on non-autism-specific factors such as cognitive functioning and age . Facial attributes including expressions have been suggested as effective markers in autism related clinical studies [1, 2, 4, 5].
Convolutional neural networks (CNNs) produce state-of-the-art results for recognizing different facial attributes (e.g., expressions, gender, and action units (AUs)) in the wild [6, 7, 8, 9, 10]. The high accuracy achieved in these recognition tasks can be attributed to large-scale labeled datasets, such as AffectNet  and EmotioNet , that enable CNNs to learn rich and generalizable representations. However, datasets at such scale do not exist for ASD, making it difficult to apply CNNs directly in the autism field.
In this paper, we introduce a system for ASD classification using facial attributes. Along with two widely used categorical facial attributes (facial expressions and AUs) for natural images, our system also predicts two continuous facial affect attributes (arousal and valence) that have been found to be effective in autism related clinical studies . For simplicity, we use facial attributes to represent facial expression, AUs, arousal, and valence. Since there are no publicly available datasets for autism with all of these different attributes, we learn representations for these attributes by leveraging two large-scale facial datasets of natural images that are collected in a wide variety of settings, including age, gender, race, pose, and lighting variations. The contributions of this work are: (1) present an ASD classification system based on facial attributes, (2) show the importance of these facial attributes in improving the performance of our system through statistical analysis, and (3) analysis of single vs. multi-task learning for facial attribute recognition.
2 Related Work
We train a CNN-based model that takes a facial image as input and outputs four facial attributes to be used for ASD prediction. In this section, we briefly review the existing work for facial attribute recognition and their application in autism.
Facial attribute recognition: With recently curated, large-scale datasets [6, 11, 13, 14, 15], it has become possible to train CNNs for facial attribute recognition. These networks can learn facial representations either independently [7, 16, 17] or simultaneously . Most existing datasets contain annotations for one or two facial attributes. In this work, we combine two large-scale datasets [6, 11] and train a model to produce four facial attributes simultaneously for ASD classification.
, are effective markers for autism. With recent developments in technology, including sensors and artificial intelligence, affective computing is gaining interest in the autism community. Egger et al. use head orientation and expression to study autism-related behavior. Rudovic et al.  use facial landmarks and body pose along with captured audio and bio-signals for an automatic perception of children’s affective state and engagement. In this work, we use representations of different facial attributes for ASD classification.
3 Our System for ASD classification
Our system, shown in Fig. 1, takes a video as an input and uses CNN to extract four facial attributes per frame: facial expressions, AUs, arousal, and valence on the participant’s detected face222We use a HoG-based face detector for its good trade-off between speed and accuracy on an iPad. However, any other face detector can be used.. These outputs corresponding to four facial attributes are concatenated to form a
-dimensional feature vector per frame, represented aswhere and are feature vectors corresponding to AUs and expressions available in the dataset, while and
are scalar values between -1 and 1 that correspond to arousal and valence, respectively. We apply temporal feature extraction methods on each vectorto extract a single lower-dimensional temporal feature vector per video. Each temporal feature vector is fed to a binary classifier for ASD prediction. In this section, we describe our system for ASD classification in detail.
Facial attribute recognition: In this paper, we are interested in ASD classification. However, there are no large publicly available datasets that provide annotated videos with facial attributes as labels for ASD. Therefore, we use publicly available large-scale datasets that provide one or more facial attributes for natural images in the wild. We use these datasets to train a CNN-based model that simultaneously predicts different facial attributes. Our network is a standard CNN that learns spatial representations by stacking convolution and down-sampling units, as shown in Table 1
. During training, we minimize the following multi-task loss function:
Here, is a function of input and learnable parameters , is a task specific loss function, is the regularization term, is the number of tasks, and is the number of data samples.
ASD classification: After training our model on the publicly available facial datasets, we generate a -dimensional feature vector for each frame in the participant video by feeding the video into our trained CNN-model frame by frame. These vectors are concatenated to form a feature matrix per video, where denotes the total number of frames in the video. Due to the temporal nature of the data, there may be redundancies in the feature matrix that could hinder the analysis of the differences between ASD and non-ASD participants. Therefore, we project this high-dimensional feature matrix to an -dimensional vector using temporal analysis methods. In particular, we compute mean vector
and standard deviation vectorthat contain the mean and standard deviation values across our features.
In addition, we compute an activation vector that captures the mean activation time per action unit, because of its significance in interpretability. We define as: where is an indicator function and is a threshold. We use in our experiments.
Similarly, because it has been shown that the percentages of positive arousal and positive valence frames are meaningful for autism related studies [19, 12], we also compute these features. We concatenate the vectors and scalars obtained after temporal analysis to produce -dimensional feature vector . We feed to a binary classifier to predict if the participant is affected by ASD or not.
In this section, we first study the performance of our system on facial attribute recognition on different facial datasets. We then study the impact of each facial attribute on ASD classification along with their statistical significance.
4.1 Facial attribute recognition
Dataset: Most of the existing datasets provide annotations for one or two facial attributes. To train a network with all four facial attributes (expressions, AUs, arousal, and valence), we combine two publicly available datasets333All images do not have labels for all facial attributes. Therefore, we fill the missing attribute value with an UNK which is ignored during training. The expression, arousal, and valence labels are from AffectNet, and the AU labels are from EmotioNet.: (1) AffectNet  and (2) EmotioNet . The resulting dataset contains about 1.2 million samples. For AffectNet, we split the training set into two subsets: training (285K) and validation (2.4K). Following , we use AffectNet’s validation set as the test set (5.5K). For EmotioNet, we split the training set into three subsets: training (754K), validation (63K), and testing (126K).
We train our models in PyTorch for a total of 30 epochs using Stochastic Gradient Descent with a momentum of 0.9 and an initial learning rate of 0.01. For faster convergence, we decrease the learning rate by 5% after every epoch. Annotations for facial attributes are different: some are continuous (arousal and valence), and some are discrete (AUs and expressions). Therefore, we use task-specific loss functions to learn representations for different facial attributes. In particular, we minimize cross-entropy loss for expression, binary cross-entropy loss for AUs, and sum of L1 and L2 loss for arousal and valence respectively. For multi-task learning, we use the sum of task-specific loss functions, similar to[20, 24]
. We also use inverse class probability weighting scheme for each loss function to address the class imbalance. We use standard data augmentation strategies such as random flipping, cropping, rotation, and shearing while training our models.
Results: We use CNNs to predict facial attributes for a given input image in both single-task and multi-task settings. In the single-task set-up, the input image was fed to four different CNNs, where each CNN predicts a different facial attribute. In the multi-task set-up, the input image was fed to a single CNN that predicts all facial attributes at once. A comparison between single and multi-task learning set-up is shown in Fig. 1(a). Furthermore, different CNN units (e.g. bottleneck block in ResNet ) have been proposed in the literature to learn richer representations. To find a suitable trade-off between accuracy and a network’s complexity, we study three different convolutional units: (1) the Bottleneck unit , (2) the EESP unit , and (3) the MobileNet unit . Following the conventions in the literature, we use the following metrics to evaluate the performance of our model: (1) an average of F1-score and accuracy for AUs , (2) F1-score for expressions , and (3) correlation coefficient (CC) for arousal and valence .
We make the following observations from the results shown in Fig.1(b) : (1) multi-task learning delivers better performance than single-task learning for all different facial attributes except AUs. In particular, the multi-task learning-based system outperforms the single-task learning-based system for arousal by about
, and (2) the EESP unit delivers similar performance to the Bottleneck and the MobileNet units, but is much more efficient and uses much fewer parameters and floating point operations (FLOPs). The second observation is in contrast to other large scale datasets, such as the ImageNet, where the complex models deliver better performance. This suggests that facial expression datasets are not as complex as the ImageNet and that complex CNN models (e.g.,) learn redundant parameters without giving significant performance gains. We note that the recognition performance of our method is on par with existing CNN-based methods [6, 7].
4.2 Application to ASD classification
Dataset: We collect a video dataset of 105 children (ASD: 62 and non-ASD: 43) with one video per participant using an iPad application; 88 of these children (ASD: 49 and non-ASD: 39) finish the experiment and then consent to use their data for our research. The diagnostic labels, ASD or non-ASD, are provided by clinicians based on the neuropsychological tests, which are done independently of these experiments.
During the experiment, each participant watches an expert-designed video stimulus on an iPad. The video stimulus is a compilation of short video clips that display both dynamic naturalistic scene and social communication scene together. These clips are shown simultaneously, side-by-side, on a vertically split iPad screen. While the participant watches a video, our application captures and records the participant’s facial response using the iPad’s front camera. The video recorded using the iPad application is about 6 minutes and 35 seconds (9,575 valid frames) per participant.
Methods: We construct a 22-dimensional feature vector from four facial attributes produced by the CNN. The first 12 values in this vector represent the probability of each action unit, the next 8 values represent the probability of each expression, and the last two values represent the arousal and valence attributes. This results in a -dimensional matrix per participant. We then use temporal analysis methods (see Section 3) to construct a 58-dimensional feature vector per participant444Temporal features can also be learned using methods such as RNNs and temporal CNNs. However, we find these methods exhibit poor generalizability on our dataset. This is likely because these methods require a large amount of training data.. This feature vector comprises 44 values of mean and standard deviation per dimension (
), 12 values representing mean percentage activation time of action units, and two values representing the percentage of positive arousal and positive valence. We train seven binary classifiers (logistic regression, LASSO, LDA, QDA, SVM with RBF kernel, XGBoost, and two-hidden-layer neural network (NN)) using these 58-dimensional feature vectors for ASD classification. Since the dataset is limited, we measure the classification performance (F1 score, sensitivity, and specificity) using leave-one-out cross-validation.
Results: Fig. 2(a) compares the performances of seven ASD classifiers that use representations from different CNNs. Our system achieves the best F1 score, sensitivity, and specificity with the Bottleneck as the base feature extractor. We also note that the ASD classification performance improves by 7% when we add features related to arousal, valence, and facial expressions. This result is consistent with our statistical analysis (Fig. 2(c)) where we found these three attributes are the most significant.
ASD classification results: (a) comparison of different binary classification methods, (b) impact of different facial attributes on the classification performance with BottleNeck as a CNN unit, and (c) statistical significance using Student’s t-test of different facial attributes.
We presented an end-to-end system for ASD classification using different facial attributes: facial expressions, AUs, arousal, and valence. The multi-task learning approach used in our experiments is more effective to classify different facial attributes than the single-task approach. We also showed that representations of different facial attributes used in our study are statistically significant and improve the ASD classification performance by about 7% with F1 score of .
Acknowledgement: This work is supported by NIH awards K01 MH104739, R21 MH103550; the NSF Expedition in Socially Assistive Robotics #1139078; and Simons Award #383661. We would like to thank Nicholas Nuechterlein, Erin Barney, James Snider, Minah Kim, Yeojin Amy Ahn, Madeline Aubertine, Kelsey Jackson, Quan Wang, Adham Atyabi, and participants for data collection and participation in this work.
-  A Ting Wang, Mirella Dapretto, Ahmad R Hariri, Marian Sigman, and Susan Y Bookheimer, “Neural correlates of facial affect processing in children and adolescents with autism spectrum disorder,” Journal of the American Academy of Child & Adolescent Psychiatry, vol. 43, no. 4, pp. 481–490, 2004.
-  E Loth, L Garrido, J Ahmad, E Watson, A Duff, and B Duchaine, “Facial expression recognition as a candidate marker for autism spectrum disorder: how frequent and severe are deficits?,” Molecular autism, 2018.
-  Tony Charman, “Variability in neurodevelopmental disorders: evidence from autism spectrum disorders,” in Neurodevelopmental Disorders. 2014.
Mirella Dapretto, Mari S Davies, Jennifer H Pfeifer, Ashley A Scott, Marian
Sigman, Susan Y Bookheimer, and Marco Iacoboni,
“Understanding emotions in others: mirror neuron dysfunction in children with autism spectrum disorders,”Nature neuroscience, 2006.
-  H Ozgen, GS Hellemann, MV De Jonge, FA Beemer, and H van Engeland, “Predictive value of morphological features in patients with autism versus normal controls,” Journal of autism and developmental disorders, 2013.
-  A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, 2018.
Jiabei Zeng, Shiguang Shan, and Xilin Chen,
“Facial expression recognition with inconsistently annotated
Proceedings of the European conference on computer vision (ECCV), 2018, pp. 222–37.
Yong Li, Jiabei Zeng, Shiguang Shan, and Xilin Chen,
“Patch-gated cnn for occlusion-aware facial expression
2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 2209–2214.
-  Corentin Kervadec, Valentin Vielzeuf, Stéphane Pateux, Alexis Lechervy, and Frédéric Jurie, “Cake: Compact and accurate k-dimensional representation of emotion,” in Image Analysis for Human Facial and Activity Recognition (BMVC Workshop), 2018.
-  C Fabian Benitez-Quiroz, Ramprakash Srinivasan, Qianli Feng, Yan Wang, and Aleix M Martinez, “Emotionet challenge: Recognition of facial expressions of emotion in the wild,” arXiv preprint arXiv:1703.01210, 2017.
-  C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5562–5570.
-  Ognjen Rudovic, Jaeryoung Lee, Miles Dai, Bjorn Schuller, and Rosalind Picard, “Personalized machine learning for robot perception of affect and engagement in autism therapy,” Science. 3. 10.1126/scirobotics.aao6760., 2018.
-  Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 2016, pp. 279–283.
-  Manuel G Calvo and Daniel Lundqvist, “Facial expressions of emotion (kdef): Identification under different display-duration conditions,” Behavior research methods, vol. 40, no. 1, pp. 109–115, 2008.
-  Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard, “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,” Image and Vision Computing, vol. 32, no. 10, pp. 692–706, 2014.
-  Olivia Wiles, A Koepke, and Andrew Zisserman, “Self-supervised learning of a facial attribute embedding from video,” arXiv preprint arXiv:1808.06882, 2018.
-  Shan Li and Weihong Deng, “Deep facial expression recognition: A survey,” arXiv preprint arXiv:1804.08348, 2018.
-  Guosheng Hu, Li Liu, Yang Yuan, Zehao Yu, Yang Hua, Zhihong Zhang, Fumin Shen, Ling Shao, Timothy Hospedales, Neil Robertson, et al., “Deep multi-task learning to recognise subtle facial expressions of mental states,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 103–119.
-  Helen L Egger, Geraldine Dawson, Jordan Hashemi, Kimberly LH Carpenter, Steven Espinosa, Kathleen Campbell, Samuel Brotkin, Jana Schaich-Borg, Qiang Qiu, Mariano Tepper, et al., “Automatic emotion and attention analysis of young children at home: a researchkit autism feasibility study,” npj Digital Medicine, vol. 1, no. 1, pp. 20, 2018.
-  Theodoros Evgeniou and Massimiliano Pontil, “Regularized multi–task learning,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004, pp. 109–117.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi, “Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network,” arXiv preprint arXiv:1811.11431, 2018.
-  Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang, “Facial landmark detection by deep multi-task learning,” in European Conference on Computer Vision. Springer, 2014, pp. 94–108.