Young children have wide-ranging, sophisticated knowledge about the world around them, yet the origin of early knowledge is often unclear. Within the first few months of life, infants show meaningful expectations about objects and agents (Spelke and Kinzler, 2007). Well before learning to speak, infants can discriminate between many common categories; at 3-4 months, infants can discriminate simple shapes (Bomba and Siqueland, 1983) and animal classes (Quinn et al., 1993), preferring to look at exemplars from a novel class (e.g., bird) after observing exemplars from a different class (dogs). How much of this early knowledge can be learned by relatively generic learning architectures receiving sensory data through the eyes of a developing child, and how much of it requires more substantive inductive biases?
This is, of course, a modern reformulation of the age-old nature vs. nurture question that is central in psychology. Answering this question requires both a precise characterization of the sensory data received by humans during development and determining what generic models can learn from this data without assuming strong priors. Although addressing this question in its full generality would require unprecedentedly large and rich datasets and hence still remains out of reach, we can hope to make real progress in more narrowly defined domains, such as the development of visual categories, thanks to new large-scale developmental datasets (Sullivan et al., 2020; Smith and Slone, 2017; Bambach et al., 2018) and the recent progress in deep learning methods.
In this paper, our goal is precisely to achieve such progress by utilizing modern self-supervised deep learning techniques (He et al., 2019; Chen et al., 2020a) and a recent longitudinal egocentric dataset of headcam videos (SAYCam) recorded from the perspective of developing children (Sullivan et al., 2020). The scale and longitudinal nature of this dataset allows us to train a large-scale model “through the eyes” of individual developing children; in this case, based on 150-200 hours of video sampled regularly from 6 months to 32 months of age. Our choice of self-supervised learning avoids extra supervision that a child would not have access to; training only on data from individual children ensures a strict subset of actual developmental experience. We trained self-supervised models on raw unlabeled videos, with the hope of extracting useful high-level visual representations. The acquired visual representations were then evaluated based on their ability to distinguish common visual categories in the child’s environment, using only linear readouts. Our results demonstrate, for the first time, the emergence of powerful, high-level visual representations from natural videos collected from a child’s perspective, using generic self-supervised learning methods. More specifically, we show that these emergent visual representations are powerful enough to support (i) high accuracy in non-trivial visual categorization tasks, (ii) invariance to natural transformations, and (iii) generalization to unseen category exemplars from a handful of training exemplars.
2 Related work
In developmental psychology, there is extensive experimental work on the acquisition of perceptual categories in children. As discussed in the introduction, 3-4 month old infants can discriminate between many common categories (Bomba and Siqueland, 1983; Quinn et al., 1993), such as dogs vs. birds or triangles vs. squares. It can be hard to known whether infants possess these contrasts before entering the lab—as opposed to acquiring them during the experiment—but other work probes knowledge that is more clearly acquired at home. For example, at 6-9 months, infants already seem to know the meanings of many common nouns referring to food or body-part categories, such as “apple” or “mouth” (Bergelson and Swingley, 2012). Soon after children begin speaking, there is a vocabulary explosion; a six-year-old knows approximately 14000 words, implying they learn about 9 or 10 words a day in early development (Carey and Bartlett, 1978; Bloom, 2002) and suggesting a similarly rapid acquisition of a large amount of categorical knowledge. Although linguistic supervision (e.g. in the form of verbal labels) can guide and sharpen the acquisition of perceptual categories in young infants (Xu et al., 2005), experimental evidence regarding early categorization points to a primarily unsupervised perceptually driven process (Behl-Chadha, 1996; Quinn, 2002).
Learning useful, high-level representations without explicit labels is also a major goal in machine learning. Unsupervised or self-supervised learning methods have been experiencing a robust revival recently, with state-of-the-art self-supervised methods now rivaling the representational power of supervised learning in downstream tasks(He et al., 2019; Chen et al., 2020b, a). Currently, the most successful approaches to self-supervised learning are based on contrastive learning (Hadsell et al., 2006), where the basic idea is to learn nearby embeddings for semantically similar objects (e.g. images or videos) by pushing their embeddings together and distant embeddings for semantically dissimilar objects by pulling their embeddings apart. In the absence of explicit labels to determine semantic similarity, one often uses data augmentation to create semantically similar objects, e.g. by applying color distortions to an image (Chen et al., 2020b). Contrastive self-supervised learning has been applied to both images (Oord et al., 2018; Hjelm et al., 2018; He et al., 2019; Chen et al., 2020b, a) and videos (Sermanet et al., 2018; Zhuang et al., 2020; Knights et al., 2020)
with promising results. However, these works were primarily motivated by computer vision applications, and did not apply self-supervised learning methods to a developmentally realistic, longitudinal, first-person video dataset. Some relatively large first-person video datasets do exist in the computer vision literature, however they are either not longitudinal, e.g. Charades-Ego(Sigurdsson et al., 2018), or they are not developmentally realistic (i.e. not recorded from the perspective of developing children), e.g. KrishnaCam (Singh et al., 2016). Moreover, we are not aware of any prior systematic efforts to apply modern self-supervised learning techniques to such datasets.
Conversely, there are some developmentally realistic, first-person datasets recorded from the perspective of children (Jayaraman et al., 2015; Fausey et al., 2016; Bambach et al., 2018), but these datasets are not longitudinal, instead they were collected from multiple children in relatively short segments. Hence, unlike the SAYCam dataset (Sullivan et al., 2020) that we use in this paper, they are not ideally suited to addressing the fundamental nature vs. nurture question that we are interested in.
Our main contributions in this paper are as follows:
We show for the first time that it is possible to learn useful, high-level visual representations from longitudinal, naturalistic video data representative of the visual experiences of developing infants, using state-of-the-art self-supervised learning methods.
We develop a novel self-supervised learning objective for learning high-level visual representations from video data based on the principle of temporal invariance (Földiák, 1991; Wiskott and Sejnowski, 2002) and show that this objective yields better representations than state-of-the-art image-based and temporal contrastive self-supervised learning objectives on the SAYCam dataset.
From SAYCam, we curate a large, developmentally realistic dataset of labeled images for evaluating self-supervised models and analyze the learned visual representations.
We use the SAYCam dataset (Sullivan et al., 2020) in this study, hosted on the Databrary repository for behavioral science: https://nyu.databrary.org/. Researchers can apply for access to the dataset, with approval from their institution’s IRB.
This dataset contains approximately 500 hours of longitudinal egocentric audiovisual data from three children: 221 hours from child S, 141 hours from child A, and 137 hours from child Y. The data were collected from head-mounted cameras worn by the children over an approximately two year period (ages 6-32 months) with a frequency of 1-2 hours of recording per week. Figure 1a illustrates the overall structure of the dataset. Although the dataset contains both video and audio data, we only make use of the video component in this paper, as our focus is on studying the development of high-level visual representations. In future work, it would be interesting to consider the potential benefits of the audio data in the development of high-level visual representations. The native spatial resolution of the videos is 640
480 pixels for all babies and the temporal resolution is 30 frames per second for children A and S, and 25 frames per second for baby Y. Before we apply any self-supervised learning algorithms on the data, we first resize the frames (using bicubic interpolation) so that the minor edge is 256 pixels and then take the 224224 center crop of the frame shifted by 16 pixels upward to exclude the time stamps at the bottom of the video frames that exist for a subset of the videos. A comparative dimensionality analysis of the SAYCam dataset is provided in Appendix A1.
Although only raw videos are required for training, we need annotated data for model evaluation, preferably from the same dataset. Fortunately, data from one baby (child S) comes with rich annotations for 25% of the videos, transcribed by human annotators. Using these annotations, we manually curated a large dataset of labeled frames, containing 58K frames from 26 classes. Further details on how this labeled dataset was created can be found in Appendix A2. Figure 1b shows the 26 classes in the labeled dataset and the number of frames in each class. Example images from the final labeled dataset are shown in Figure 1c.
Our modeling effort evaluates the feasibility of learning high-level visual representations from a subset of an individual child’s experience. The aim is to measure what is learnable, in principle, without necessarily constraining the algorithms to be psychologically plausible. With this aim in mind, we trained deep convolutional networks from scratch, using self-supervised learning algorithms on the headcam videos. After training, we evaluated the self-supervised models on downstream classification tasks with developmentally-relevant categories, freezing the trunk of the model and only training linear readouts from the model’s penultimate, embedding layer. We used the MobileNetV2 architecture in all experiments below due to its favorable efficiency-accuracy trade-off (Sandler et al., 2018). This architecture has an embedding layer of 1280 units. Pre-trained models and training/testing code are available at: https://github.com/eminorhan/baby-vision.
To train models on the headcam video data without using any labels, we adapted a self-supervised learning objective based on the principle of temporal invariance (Földiák, 1991; Wiskott and Sejnowski, 2002). This objective is based on the observation that higher level variables in a visual scene change on a slower time scale than lower level variables, hence a model that learns to be invariant to changes on fast time scales may learn useful high-level features. We implemented this idea with a standard classification set-up, by dividing the entire video dataset into a finite number of temporal classes of equal duration. A schematic illustration of this temporal classification objective is presented in Figure 2. Separate models were trained on data from each child to ensure they capture individual rather than aggregate visual experience. Below, we present results demonstrating the effects of various experimental factors (Figure 5), such as the frame rate at which the videos are sampled, the segment length (i.e. the duration of each temporal class), and data augmentation, on properties of the learned representations. For the remaining results, we always use our best self-supervised model, as measured by classification performance on the downstream classification tasks. Our best model is a temporal classification model that uses a sampling rate of 5 fps (frames per second), a segment length of 288 seconds, and data augmentation in the form of color and grayscale augmentations as in Chen et al. (2020a).
Static contrastive learning.
To build a strong purely image-based baseline model that does not make use of any temporal information, we trained models with the momentum contrast (MoCo) objective (He et al., 2019) on the headcam data (now treated as a collection of images with no temporal information). We used the “improved” implementation (V2) of MoCo proposed in Chen et al. (2020b)
. This objective currently achieves near state-of-the-art results on ImageNet among self-supervised learning methods. The basic idea in contrastive learning is to learn similar embeddings for semantically similar (“positive”) pairs of frames and dissimilar embeddings for semantically dissimilar (“negative”) pairs. We used the PyTorch implementation provided byChen et al. (2020b) for this model, with the same hyper-parameter choices and data augmentation strategies as in Chen et al. (2020b). Further implementation details can be found in Appendix A3.
Temporal contrastive learning.
We also trained a temporal contrastive learner that did take the temporal relationship between frames into account. This model is similar to the static contrastive learner above, with the difference that each frame’s two immediate neighbors are now treated as positive examples with respect to that frame (temporally non-adjacent frames are still considered as negative pairs as in the static model). Effectively, this model treats temporal jitter between neighboring frames as another type of data augmentation. A similar temporal contrastive learning model was proposed by Knights et al. (2020) before.
In addition to the self-supervised models above, we considered several baseline models as controls: (i) an untrained MobileNetV2 model with random weights, (ii) a MobileNetV2 model pre-trained on ImageNet, (iii) HOG features (histogram of oriented gradients) as a shallow baseline (Dalal and Triggs, 2005). For the HOG model, we used the implementation in skimage.feature. Further details can be found in Appendix A3.
5 Evaluation and analysis of the learned representations
Downstream linear classification tasks.
As our main evaluation metric, we evaluated the classification accuracy of our self-supervised models, as well as the accuracy of the baseline models, on two downstream linear classification tasks: (i) the curated, labeled subset of the annotated data from child S described above (we call this datasetlabeled S), and (ii) the Toybox dataset (Wang et al., 2018).
The Toybox dataset is a video dataset consisting of 12 object categories (airplane, ball, car, cat, cup, duck, giraffe, helicopter, horse, mug, spoon, truck), with 30 different exemplars in each category, each undergoing 10 different transformations, such as translations and rotations, for approximately 20 seconds each (plus a brief canonical shot of the object). When the videos are sampled at 1 fps, the entire dataset contains 7K frames from each of the 12 categories for a total of 84K frames (example images from the dataset are shown in Appendix A4). We chose the Toybox dataset because it contains developmentally realistic objects (toys) belonging to 12 early-learned basic-level categories in child development (Wang et al., 2018). We reasoned that this dataset would thus pose less of a distribution shift problem for our models trained on infant headcam videos, compared to a more complex and diverse dataset such as ImageNet, which is developmentally less realistic (although we do provide results on ImageNet in Appendix A5). Another advantage of the Toybox dataset is that it allows us to evaluate the robustness of self-supervised models to a variety of natural transformations.
Because both the labeled S and the Toybox datasets are video datasets originally, they contain temporal correlations between nearby frames. This raises the potential concern that with random iid train-test splits, even models employing relatively low-level strategies might be able to perform well by exploiting these temporal correlations. To address this concern, in addition to evaluating the performance of simple baseline models like the shallow HOG model and the random MobileNetV2 model on these datasets, we also introduced more challenging train-test splits for both datasets. For the labeled S dataset, we reduced the temporal correlations between train and test data by subsampling the entire dataset by a factor of 10 (i.e. an effective frame rate of 0.1 fps). For the Toybox dataset, we introduced an exemplar split to examine generalization to novel category exemplars, using 90% of the exemplars for training and 10% for testing (i.e. 27 vs. 3 exemplars from each class).
Figure 3 shows the top-1 classification accuracy of all models on the linear classification tasks for both random iid splits (with 50% training-50% test data) and the more challenging splits discussed above. We observe that the self-supervised models with the temporal classification objective perform well in all cases, sometimes even outperforming the strong ImageNet-trained baseline model. Of particular note is the fact that temporal classification models trained on data from children A and Y are able to generalize well to labeled data from baby S. The temporal classification model outperformed the contrastive self-supervised models in all conditions. The random MobileNetV2 model and the shallow HOG model generally perform poorly compared to the other models, suggesting that learning and sufficiently deep models are necessary for these tasks. The self-supervised models continue to perform reasonably well on the more challenging splits, suggesting that their performance cannot be explained in terms of simple, low-level strategies.
To give a more intuitive sense of the representational power of the self-supervised models, we also set up four challenging but natural binary classification tasks from the labeled S dataset: car vs. road, door vs. window, foot vs. hand, chair vs. table. These tasks are challenging because the two classes in each task are semantically related and co-occur in many images, creating ambiguities. Figure 4a shows that the self-supervised TC-S model achieves relatively high accuracy in all tasks (comparable to the ImageNet-trained model) and more importantly the spatial attention maps verify that the model’s choices seem to be based on the correct regions in ambiguous images (Figure 4b), suggesting that the model does not exploit possible spurious correlations to perform well in these tasks (see below for a description of how these spatial attention maps were computed): e.g. in Figure 4b, note how the model attends to the upper part of the image with the red socks in the third row for a foot choice, and the high-chair on the left in the fourth row for a chair choice.
The effect of various experimental factors on downstream linear classification accuracy.
Figure 5 shows the effects of sampling rate, segment length, and data augmentation on classification accuracy in the downstream labeled S task (random iid split condition). Higher sampling rates, longer segments (i.e. fewer temporal classes), and using data augmentation all improve classification accuracy in the downstream task. Increasing the sampling rate can be seen as a form of data augmentation, and the effect of these two factors is intuitively clear: data with more variation enables the learning of stronger features. The dependence on segment length, on the other hand, is a function of the average time scale over which semantically meaningful changes take place in the videos. It is interesting to note that the optimal time scale appears to be fairly long (5 minutes).
Spatial attention maps.
To better understand what kind of features the self-supervised models rely on to make their decisions in the downstream classification tasks, we computed spatial attention maps from the final spatial layer of the model (features or feats18 for short). This layer is a 12807
7 layer in MobileNetV2 and there is a single global spatial averaging layer between this layer and the output layer. After fitting a linear classifier on top of our best self-supervised model, for each output class in the labeled S dataset, we created a composite attention map from thefeats18 layer by taking a linear sum of all 1280 spatial maps in that layer, where the weights in the linear combination were determined by the output weights for the corresponding class (see Appendix A6 for further details).
We then upsampled these composite attention maps and multiplied them with the input image, creating image masks that show where the composite map was most activated. Figure 6 shows examples of masked images for the cat class with both actual cat images (a) and non-cat images (b). These images suggest that the model, in general, attends to the correct regions in cat images, but the spatial extent of attention is usually larger than the cat itself, suggesting possible reliance on contextual cues as well. For the non-cat images, the composite map is usually silent, as would be expected from a successful classifier. But there are occasional regions of high activation even in these images, suggesting that the composite maps—and hence the outputs themselves—are not purely class selective.
Analysis of single feature selectivity.
To measure how distributed vs. localized the representation of class information is, we quantified the class selectivity of individual features, , in the model by the following class selectivity index (Morcos et al., 2018; Leavitt and Morcos, 2020):
where denotes the average response of the feature to its most activating class and denotes its average response to the remaining classes. In computing the average responses, we averaged across the spatial dimensions to obtain a single value per feature per image. The ranges from (for features completely agnostic between classes) to (for features perfectly selective for a single class).
Figure 7a shows the distribution of CSI
s of individual features in different layers of our best self-supervised model. Single features were generally not very selective for individual classes, pointing to a more distributed representation of class information. The layers close to the output, in general, had higherCSIs, consistent with a similar observation made in Leavitt and Morcos (2020) for supervised models. Figure 7b shows 10 highly activating images from the labeled S dataset for 3 example features with high CSIs, selective for the carseat, computer, and floor classes, respectively. The example features shown in this figure are from the highest spatial layer of the network (feats18). More examples are presented in Appendix A7.
To our knowledge, our work is the first to systematically explore what can be learned from naturalistic visual experience children receive during their development. However, it has several limitations which are important to keep in mind and would be worthwhile to address in future work.
First, although SAYCam offers an unprecedented look at the experience of individual children, the training videos are a very small fraction of a child’s total visual experience, equivalent to 1 week of visual experience (well-distributed over two years of development). Scaling this up to better approximate a 2.5 year old’s experience would require roughly two orders of magnitude more data. Recent results in machine learning suggest that increases in data size on this scale can lead to very large qualitative improvements in model behavior (Halevy et al., 2009; Orhan, 2019; Xie et al., 2019; Brown et al., 2020).
Second, the dataset used in this study includes videos only, hence it ignores the embodied aspect of visual development in children. Humans (and other animals) control their bodies to select the visual experiences they receive. This creates a rich array of sensorimotor inputs that likely help the observer better factorize the sources of variation in their visual experiences. Visual development also has a significant haptic component in animals, especially in dexterous animals like primates, as they can haptically explore the objects around them, which gives them high-quality information about the shapes of objects, for example. Recent computational studies suggest that taking this embodied perspective into account can improve representation learning both in terms of learning speed and in terms of generalization capacity (Jacobs and Xu, 2019; Hill et al., 2019).
Third, and related to the previous point, we also ignored the multimodal nature of cognitive development. Particularly relevant for the development of visual categories is word learning in children. Experimental studies in developmental psychology show that learning object names can change the visual features children use for word learning (Smith et al., 2002; Gershkoff-Stowe and Smith, 2004). We plan to address the role of language through the auditory component of SAYCam in future work.
Fourth, our best self-supervised models currently require unrealistic data augmentation strategies, such as color distortions and grayscaling. It remains to be seen whether equally powerful models can be learned without such unrealistic data augmentation strategies.
In this work, we took a first step toward rigorously addressing a fundamental nature vs. nurture question regarding the acquisition of basic visual categories in developing children: can these visual categories be learned through generic learning mechanisms or do they require more substantive inductive biases? By applying modern self-supervised learning algorithms to a strict subset of the visual experiences of individual developing children, we demonstrated the emergence of powerful high-level visual representations, underscoring the power of generic learning mechanisms. Our analysis suggests that although these representations do not strictly correspond to abstract categories (Figure 6), they are abstract enough to support (i) high accuracy in non-trivial downstream categorization tasks (Figure 3), (ii) invariance to natural transformations (random conditions in Figure 3) and (iii) generalization to unseen category exemplars (Figure 3b; exemplar split). It still remains open how far we can push generic learning mechanisms, through even larger and richer data sources, to learn mental representations ever closer to those acquired by children early in their development.
We are very grateful to the volunteers who contributed recordings to the SAYCam dataset (Sullivan et al., 2020). We thank Jessica Sullivan for her generous assistance with the dataset. This work was partly funded by NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science.
- Toddler-Inspired Visual Object Learning. Advances in Neural Information Processing Systems. Cited by: §1, §2.
- Basic-level and superordinate-like categorical representations in early infancy. Cognition 60 (2), pp. 105–141. Cited by: §2.
- At 6-9 months, human infants know the meanings of many common nouns. Proceedings of the National Academy of Sciences 109 (9), pp. 3253–8. Cited by: §2.
- How children learn the meanings of words. MIT press. Cited by: §2.
- The nature and structure of infant form categories. Journal of Experimental Child Psychology 35 (2), pp. 294–328. Cited by: §1, §2.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §6.
- Acquiring a single new word.. In Proceedings of the Stanford Child Language Conference, pp. 17–29. Cited by: §2.
- A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: Appendix A3, §1, §2, §4.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: Appendix A3, §2, §4.
Histograms of oriented gradients for human detection.
2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Vol. 1, pp. 886–893. Cited by: §4.
- From faces to hands: changing visual input in the first two years. Cognition 152, pp. 101–107. Cited by: §2.
- Learning invariance from transformation sequences. Neural Computation 3 (2), pp. 194–200. Cited by: 2nd item, §4.
- Shape and the first hundred nouns. Child Development 75 (4), pp. 1098–1114. Cited by: §6.
- Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §2.
- The unreasonable effectiveness of data. IEEE Intelligent Systems 24 (2), pp. 8–12. Cited by: §6.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §2, §4.
- Emergent systematic generalization in a situated agent. arXiv preprint arXiv:1910.00571. Cited by: §6.
Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.
- Can multisensory training aid visual learning? a computational investigation. Journal of vision 19 (11), pp. 1–1. Cited by: §6.
- The faces in infant-perspective scenes change over the first year of life. PLoS ONE 10 (5). Cited by: §2.
- Temporally coherent embeddings for self-supervised video representation learning. arXiv preprint arXiv:2004.02753. Cited by: §2, §4.
- Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs. arXiv preprint arXiv:2003.01262. Cited by: §5, §5.
- Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: Appendix A1.
- On the importance of single directions for generalization. In International Conference on Learning Representations, Cited by: §5.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
- Robustness properties of Facebook’s ResNeXt WSL models. arXiv preprint arXiv:1907.07640. Cited by: §6.
- Evidence for representations of perceptually similar natural categories by 3-month-old and 4-month-old infants.. Perception 22 (4), pp. 463–475. Cited by: §1, §2.
- Category representation in young infants. Current Directions in Psychological Science 11 (2), pp. 66–70. Cited by: §2.
- MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §4.
- Time-contrastive networks: self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. Cited by: §2.
- Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7396–7404. Cited by: Appendix A1, §2.
KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks.. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. Cited by: §2.
- Object name learning provides on-the-job training for attention. Psychological Science 13 (1), pp. 13–19. Cited by: §6.
- A developmental approach to machine learning?. Frontiers in Psychology 8. Cited by: §1.
- Core knowledge. Developmental Science 10 (1), pp. 89–96. Cited by: §1.
- SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. PsyArXiv. Note: https://doi.org/10.31234/osf.io/fy8zx External Links: Cited by: §1, §1, §2, §3, Acknowledgements.
Seeing neural networks through a box of toys: the Toybox dataset of visual object transformations. arxiv preprint arXiv:1806.06034. Cited by: Figure A2, Appendix A4, §5, §5.
Slow feature analysis: unsupervised learning of invariances. Neural Computation 14 (4), pp. 715–770. Cited by: 2nd item, §4.
- Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252. Cited by: §6.
- Labeling guides object individuation in 12-month-old infants. Psychological Science 16 (5), pp. 372–377. Cited by: §2.
- Unsupervised learning from video with deep neural embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9563–9572. Cited by: §2.
Appendix A1 Comparative dimensionality analysis
shows the variance explained in same sized subsets of 4 different image and video datasets: ImageNet, the headcam videos from child S, and the matched first-person and third-person videos from the Charades-Ego dataset(Sigurdsson et al., 2018). The images or video frames in each dataset were first passed through the largest ResNeXt WSL model with an embedding layer of size 2048 (Mahajan et al., 2018). We then performed a PCA analysis on the embeddings from each dataset, looking at the variance explained as a function of the number of retained dimensions. The video datasets were sampled at 1 fps. Figure A1 shows that, as expected, the video datasets (first-person and third-person videos from the Charades-Ego dataset and the headcam videos from child S) have lower information content than ImageNet, due to temporal correlations in the video datasets. The first-person video datasets (first-person Charades-Ego and the headcam data child S) have slightly higher information content than the third-person Charades-Ego dataset, presumably because of the higher degree of variability due to natural distortions and perturbations in these first-person videos.
Appendix A2 Curation process for the labeled S dataset
As mentioned in the main text, the headcam data from one of the infants (child S) comes with rich annotations for 25% of the videos, transcribed by human annotators. Using these annotations, we manually curated a large dataset of labeled frames, containing 58K frames from 26 classes.The annotations include information such as the objects being looked at by the child, the objects being touched by the child, and the objects being referred to, as well as the utterances made, together with approximate time stamps. We used the the objects being looked at by the child field to assemble a large collection of labeled frames for evaluation purposes. This field often includes multiple labels for each cell (a cell is the collection of frames between two consecutive time stamps). We only considered the first used label in each cell and performed basic string processing operations to reduce the redundancies in the labels due to annotation inconsistencies (e.g. capitalization, typos, synonymous labels etc.). This reduced the final number of unique labels to 414. We used these labels and the time stamps provided in the annotations to label individual frames in the videos, where we sampled the frames at 1 fps (frames per second).
For evaluation purposes, we further modified this noisy labeled dataset as follows. To obtain a dataset with a sufficiently large number of frames from each class, we restricted ourselves to the top 30 classes containing the largest number of frames. To obtain a balanced dataset, we then removed the top two classes (mom and book), which contained significantly more frames than the remaining classes. For the remaining classes, to make sure that the labels are clean enough for evaluation, we manually went through each of them, removing frames or changing their labels as necessary. We finally removed any classes that contained fewer than 100 frames. This yielded a labeled dataset containing a total of 58K frames from 26 classes. The final classes and the number of frames in each class are shown in Figure 1b in the main text.
Appendix A3 Model implementation details
We trained the temporal classification models with the Adam optimizer with learning rate and a batch size of
(maximum batch size we could fit into 4 GPUs using data parallelism). Models trained with 1 fps data were trained for 20 epochs and models trained with 5 fps data were trained for 6 epochs. Final top-1 training accuracy in the temporal classification task was always in the 80-85% range. Before feeding the frames into the model, we always applied the standard ImageNet normalization step (see below). In addition, in the data augmentation conditions, we also applied the probabilistic color jittering and grayscaling transformations fromChen et al. (2020a):
transforms.RandomApply([transforms.ColorJitter(0.8, 0.8, 0.8, 0.2)], p=0.8),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
For the static and temporal contrastive learning models, we used the PyTorch implementation of MoCo-V2 provided by Chen et al. (2020b) as is, with the same hyper-parameter choices and data augmentation strategies111Code available at: https://github.com/facebookresearch/moco. We trained the models for 6 epochs with headcam video frames sampled at 5 fps. The learning rate was reduced by a factor of 10 in the final epoch.
For the histogram of oriented gradients (HOG) model, we used the implementation provided in scikits-image (skimage.feature) with the following arguments: orientations=9, pixels_per_cell=(16, 16), cells_per_block=(3, 3), block_norm=‘L2’, visualize=False, transform_sqrt=False, feature_vector=True, multichannel=True. To fit linear classifiers on top of these features, we used the SGD classifier in scikit-learn (sklearn.linear_model), SGDClassifier, with the following arguments: loss=‘‘hinge’’, penalty=‘‘l2’’, alpha=0.0001, max_iter=10.
Appendix A4 Example images from the Toybox dataset
Appendix A5 Linear classification results on ImageNet
Although the ImageNet dataset poses a significant distribution shift challenge for models trained on the infant headcam videos, we still evaluated the performance of linear classifiers trained on top of our self-supervised models and obtained the following top-1 accuracies on the ImageNet validation set: TC-S: 20.9%, TC-A: 18.1%, TC-Y: 17.6%, MoCo-V2-S: 16.4%. We also observed that it was possible to achieve close to 25% top-1 accuracy with a temporal classification model trained on data from all three children. For comparison, a linear classifier trained on top of a random, untrained MobileNetV2 model (RandomNet) yields a top-1 accuracy of 1.2%. For these ImageNet results, we trained the linear classifiers for 20 epochs with the Adam optimizer using a learning rate of 0.0005 and a batch size of 1024. The training and validation images from ImageNet were subjected to the standard ImageNet pre-processing pipeline.
Appendix A6 Spatial attention maps
The spatial attention maps shown in Figure 4b and Figure 6 in the main text were generated from the final spatial layer of the network. This layer is a 128077 layer in MobileNetV2. There is a single global spatial averaging layer between this layer and the output layer. After fitting a linear classifier on top of our best self-supervised model, for each output class in the dataset, we created a composite attention map by taking a linear sum of all 1280 77 spatial maps, where the weights in the linear combination were determined by the output weights from the corresponding feature to the output node. This results in a single 77 spatial map for each image. We then upsampled this 77 map to the image size (224
224) using bicubic interpolation, divided each pixel value by the standard deviation across all pixels, multiplied the entire map by 10 to amplify it and finally passed it through a pixelwise sigmoid non-linearity, i.e.m<-sigmoid(10.0*m/std(m)). We then multiplied the attention map with the presented image pixel-by-pixel to obtain the masked images shown in Figure 4b and Figure 6 in the main text. Figure A3 below shows further examples of spatial-attention-multiplied images for the computer class.