Close tracking of motor function development in infants could yield to the discovery of the prodromal risk markers of developmental disruption such as autism spectrum disorder [ali2020early], cerebral palsy [centers2012data], pediatric feeding disorders [lindberg1991early], among other conditions. Screening for motor delays enables earlier, more targeted interventions that have cascading effects on multiple domains of infant development and behavioral conditions, including social communication and interaction, cognitive ability, and working memory capacity [anderson2013role]. However, only about 29% of US children under 5 years of age receive developmental screening due to expense and shortages of testing resources, contributing negatively to lifelong outcomes for infants at risk for developmental delays [canadian2016recommendations]. Over the past decade, deep learning-based assistance systems have been used in the field of automated medical prediction and diagnosis, particularly to process visual data captured from the subjects of interest performing daily activities and detect the onset of diseases [thevenot2017survey]
. Many such vision-based systems analyze and predict adult human behavior by estimating and tracking face and body poses via powerful deep learning models. However, models with deep structure trained on large-scale adult pose datasets exhibit limited success in estimating infant poses due to the significant differences in body ratios, typical poses, and activity types. A recent study shows that successful mainstream human body pose estimation algorithms yield poor results when tested on infant images[huang2021infant]. Privacy and security considerations hinder the availability of sufficient infant data required to train a robust deep model from scratch, making this an exceptionally constrained “small data problem” [liu2019bed].
In order to bridge the infant–adult domain gap in facial landmark estimation, in this paper, we (i) introduce InfAnFace, the Infant Annotated Faces dataset of 200 images of infants in the wild with precise 2D facial landmark and attribute annotations, sampled in Fig. 1, (ii) perform a detailed qualitative study of the performance of current facial landmarking algorithms on InfAnFace to quantify the domain gap (see Fig. 2) and establish benchmarks for future tailored algorithms to surpass, and (iii) analyze our results to provide informed guidance for such future work.
2 Related work
Facial landmark estimation has garnered attention thanks to its key role in facial behavior analysis. Traditional methods based on cascaded regression trees have found applications in many areas, such as affective experience prediction [yin2020facial]
. More recently, deep learning methods based on convolution neural networks (CNNs) have dominated the field. Examples include the face alignment network (FAN) model, which achieves 2D and 3D facial landmark localization through stacked hourglass structures with a hierarchical residual block[bulat_how_2017], and the high-resolution network (HRNet) model, which uses multi-resolution convolution to detect 2D facial landmarks [wang_deep_2021]. The 3FabRec model achieves few-shot facial landmark localization by training an auto-encoder to project the image to a latent domain, then predicting landmarks using a lightweight decoder [browatzki_3fabrec_2020].
Evaluation of the robustness of such algorithms is facilitated by several datasets of faces “in the wild,” including: the Helen dataset, which contains 2,330 high resolution images with both 194 and 68 annotated facial landmarks; the annotated faces in the wild (AFW) dataset, which contains 205 images with large variation in head poses [zhu2012face]; and various 300-W image sets discussed below [sagonas_300_2016]. Most facial landmarking models are trained and evaluated using such benchmark datasets, where infant faces are scarcely represented at all (e.g. only making up 1.4% of the AFW dataset). Our work seeks to fill this representational gap by providing a benchmark dataset for infant facial landmarking.
3 The InfAnFace Dataset
Images were canvassed from Google (Images) and YouTube via a wide range of search queries, to represent a diversity of appearances, poses, expressions, scene conditions, and image quality. The target age for infants was between 0 and 12 months. Infants photographed in the wild assume a large variety of face and body angles, and our dataset reflects this natural variation.
The 100 images from Google were converted to PNG format, and the 100 images from YouTube were captured as screenshots in PNG format. Many were cropped to focus on a single subject. For myriad reasons, including differences in search engines and how we used them, and differences in still vs. video capture, the photos from Google appear to be of higher quality than the stills taken from YouTube.
3.1 Landmark and attribute annotations
Each infant image was human-annotated with 68 facial landmarks, adhering to the industry standard Multi-PIE layout [gross_multi-pie_2010], used, for instance, in the popular 300 faces in the wild (300-W) adult faces dataset [sagonas_300_2016]. Following conventions for 2D-based annotations, landmarks obscured by the face itself (e.g. when turned to the side) are assigned to the nearest point on the face which is
visible in the image, as the true projected position is hard to estimate. We employed the human–artificial intelligence hybrid tool AH-CoLT[huang2019ah], with the landmark predictions of the FAN model [bulat_how_2017] serving as the starting point for the human annotations222Consequently, this model cannot be used as an impartial benchmark with which to compare our dataset with others (cf. Section 4).. See Fig. 1 for a sample of annotated images from InfAnFace.
We also include binary annotations for adverse attributes, indicating whether each face is: turned (if the eyes, nose, and mouth are not clearly visible), tilted (if the head axis, projected on the image plane, is or more beyond upright), occluded (if landmarks are covered by body parts or objects), and excessively expressive (if the facial muscles are tense, as when crying, laughing, etc.). To aid our analysis, we define the Common subset of InfAnFace to be those with faces which are free from all four of the annotated adverse attributes, and we define the Challenging subset to be its complement. See Table 1 for a breakdown of the attribute counts and the newly derived subsets against the original Google and YouTube sets.
3.2 Facial landmark characteristics
We compare the InfAnFace dataset against three well-known adult image sets: 300-W Test (600 images), 300-W Common (554 images), and 300-W Challenging (135 images) [sagonas_300_2016]. These are partially drawn from earlier datasets so a wide range of sources is represented. Fig. 3 shows the mean absolute and scaled positions of the ground truth landmarks for the infant faces and the 300-W adult faces. The infant faces appear stouter than the adult ones, and this is confirmed by the ratios of the mean lengths of each minimal bounding box, recorded in Table 2.
The second column of Table 2 shows mean ratios of measurements of particular interest for facial landmarking: the interocular distance (between the outer corners of the eyes) and the minimal bonding box size
(the geometric mean of the dimensions of the smallest upright box containing all ground truth landmarks). These are common choices of normalization factor for the normalized mean error, which will be defined in Section4. While we expect existing facial landmarking systems to perform better on standard adult image sets compared to our infant image sets, these ratios show that this performance gap will be exaggerated under the box normalization, compared to the interocular normalization. Note that these differences could reflect both intrinsic variations in infant and adult facial geometry and also incidental differences like the distribution of facial poses.
4 Model evaluation and analysis
In order to situate the infant dataset in relation to existing facial landmarking efforts, we performed predictions using two recent 2D facial landmarking models—HRNet [wang_deep_2021] and 3FabRec [browatzki_3fabrec_2020]333Specifically, we used the HRNetV2-W18 and 3FabRec lms_300w models, both pretrained on 300-W training data. For evaluations, HRNet was initialized with minimal ground truth bounding boxes, and 3FabRec on 30%-padded ground truth bounding boxes, to match their published error results.
models, both pretrained on 300-W training data. For evaluations, HRNet was initialized with minimal ground truth bounding boxes, and 3FabRec on 30%-padded ground truth bounding boxes, to match their published error results.. A selection of ground truth and predicted landmarks is shown in Fig. 5.
The main error metric used in facial landmarking is the normalized mean error (NME): the mean Euclidean distance (in pixels) between each predicted landmark and the corresponding ground truth landmark, divided by a normalization factor (in pixels) to achieve scale-independence. We employ the two normalization factors defined in Section 3.2, but as alluded to there, neither is known to be domain-invariant, so our use of both factors and of two models helps us mitigate potential domain biases. Table 3 presents the means of NME in each image subset, under our four configurations. Note that although NME is sometimes reported as a percentage, it can exceed 100 in value.
|300-W Test||300-W Common||300-W Challenging||YouTube||Common||Challenging|
Further characterizations of the error distribution include the failure rate (FR) of images in the dataset with NME greater than a set threshold, and the area under the curve of the cumulative NME distribution up to a set threshold. Table 3 also shows the AUC and FR for NME normalized by the interocular distance, with a threshold of . Fig. 4 plots the cumulative NME distribution curves themselves, under both normalizations, and with a threshold of for greater context.
4.1 Analysis and interpretation
The numerical results in Table 3 support an ordering of difficulty between five natural subsets, namely, from easiest to hardest: 300-W Common, 300-W Test, 300-W Challenging, Google, and YouTube. What is more, they lay bare a significant performance gap between the adult sets and the infant sets (which, in line with our expectations from Section 3.2, is more pronounced under the box norm). A comparison of the two sets of cumulative error graphs in Fig. 4 suggests that, in contrast to the adult datasets, the Google and YouTube infant sets perform reasonably well only up to a point, after which the errors of their worst-performing members balloon rapidly off the charts.
A visual perusal of the infant landmark predictions grouped by attribute, as in Fig. 5, hints at a connection between adverse conditions and poorer landmark estimations. This link is corroborated by a comparison of the performance on the Common and Challenging infant sets: the Common infant subset, free of our adverse conditions, has a failure rate of 0.00 at across both models and normalizations, while the Challenging subset struggles.
These considerations expose adverse conditions such as tilt and occlusion as correlates and likely causes of poor facial landmark estimations on infant faces. We believe, though, that such conditions are endemic to infant images captured in the wild and thus, infant-focused algorithms should seek to overcome them. While these adverse attributes appear to dramatically deflate model performance, they are also conditions frequently encountered in computer vision and deep learning, and diverse tools exist to counteract them. Such tools include general techniques such as data augmentation and regularization, computer vision methods for occlusion detection and mitigation, and specific facial processing techniques such as 3D pose modeling [zhu_face_2019].
Coming back to the Common infant subset, we note that although some of its performance metrics comparable to those on the adult 300-W Challenging set, the latter consists of faces in a range of difficult conditions. The 300-W Common and 300-W Test sets are fairer targets, and there remains a significant gulf in the magnitude of performance metrics on these sets compared to InfAnFace Common. This indicates that there is room for improvement and, in particular, for the fruitful application of domain adaptation techniques from machine learning.
4.2 t-SNE visualization
An elegant visual companion to our analyses can be found in the t-SNE plots in Fig. 2, which offer a glimpse into the how the HRNet neural network “perceives” each image relative to one another. Specifically, each infant or adult image is processed by HRNet into a
-dimensional vector (prior to landmark estimation), and the set of these representations is compressed by the t-SNE algorithm [van_der_maaten_visualizing_2008] into a set of two-dimensional coordinates for each image, with relative relationships probabilistically preserved. We plot this set of coordinates three times in Fig. 2, with different coloring systems highlighting the set membership and prediction error for each image444We employed the scikit-learn [scikit-learn] implementation of t-SNE, with perplexity and otherwise default settings..
Going from left to right, images in the t-SNE representation follow a gradient of increasing landmark estimation error (NME). The 300-W subsets cluster roughly in a manner consistent with this ordering, with 300-W Common gravitating to the left and 300-W Challenging to the right. The infant images fall into two categories: those generally located on the right side of the main cluster, and those residing in the small “satellite” cluster in the top right. Our observations about the poor performance of InfAnFace Challenging are reinforced by the fact that it comprises the entirety of the poorest-performing satellite cluster. The InfAnFace Common cluster is more integrated with the 300-W images, but a notable domain gap remains, particularly in comparison with the more natural 300-W Test and 300-W Common images.
5 Conclusion and future work
We have presented the diverse and richly annotated InfAnFace dataset, filling a much needed gap in facial landmark estimation. Our analysis of existing facial landmarking models reveals a performance differential between infant and adult subjects. We characterized this delta as stemming from both a fundamental domain difference between infant and adult faces, and also from a higher natural occurrence of adverse conditions such as difficult head poses. Correspondingly, we hypothesize that progress can be made in infant facial landmark estimation by (i) applying domain adaptation tools from deep learning to mitigate the domain gap, and (ii) implementing general computer vision and machine learning techniques to mitigate the adverse conditions. It is anticipated that such developments will open doors for meaningful medical applications in predictive diagnostics.