Seeing Neural Networks Through a Box of Toys: The Toybox Dataset of Visual Object Transformations

06/15/2018 ∙ by Xiaohan Wang, et al. ∙ 0

Deep convolutional neural networks (CNNs) have enjoyed tremendous success in computer vision in the past several years, particularly for visual object recognition.However, how CNNs work remains poorly understood, and the training of deep CNNs is still considered more art than science. To better characterize deep CNNs and the training process, we introduce a new video dataset called Toybox. Images in Toybox come from first-person, wearable camera recordings of common household objects and toys being manually manipulated to undergo structured transformations like rotations and translations. We also present results from initial experiments using deep CNNs that begin to examine how different distributions of training data can affect visual object recognition performance, and how visual object concepts are represented within a trained network.



There are no comments yet.


page 2

page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Many recent breakthroughs in computer vision, such as for the problem of visual object recognition, have been driven by the creation and use of large-scale labeled image datasets collected from the Internet, with ImageNet being a canonical example [Deng et al.2009]. In a sense, these datasets have coevolved with the algorithms that learn so successfully from them—researchers continue to develop algorithms that work better and better on the types of data distributions found on the Internet, and also continue to collect ever larger and more densely labeled datasets using similar online data collection methods.

While the scientific advances and practical applications generated by these efforts have undoubtedly transformed the landscape of AI and machine learning, there are still many fundamental open research questions about how (and how well) intelligent agents can learn to recognize objects under very different types of training regimens. In addition, there are many problems going beyond recognition that may require richer visual experiences with objects than are typically gathered online, for instance problems of visual commonsense reasoning or mental simulation. Two areas that will especially benefit from studying such questions are:

Figure 1: Toybox examples. (Color adjusted for PDF view.)

1. Research in AI + cognitive science to study the development of human object recognition.

Research in developmental psychology has produced many interesting findings about the number and types of object instances children experience while learning category concepts. In a fascinating diary study of her infant son, Mervis (mervis1987child) observed that his initial concept of “duck” was likely based on seeing live ducks at a nearby park, his plush duck toy, a plastic duck rattle, a rubber duck, and a handful of other odd, duck-themed household objects, toys, and picture books, and was gradually generalized and pruned over time.

Recent wearable-camera-based infant studies have found similar uneven distributions of experience across object instances, categories, and viewpoints, and these distributions also change with an infant’s age [James et al.2014, Smith et al.2018]. It is likely not the case that infants learn despite these irregularities, but rather that infant learning leverages these distributional properties, e.g., through bootstrapping, curriculum learning, etc. In other words, the infant’s “learning algorithm” has coevolved alongside the neural, sensorimotor, cognitive, and sociocultural factors that combine to create an infant’s visual world. Thus, AI research that explores interactions between training distributions and learning algorithms is poised to play a critical role in the cognitive science of visual learning, including for understanding the effects of technology (e.g., advent of print media, television, and now Internet) on child development.

2. Research in AI + robotics to advance the learning capabilities of embodied agents.

Robots or other physically embodied agents (e.g., stationary cameras) may have access to online information but will likely also need to be able to learn from their immediate environment. For example, a household robot may need to learn about specific objects in a person’s kitchen through a series of naturalistic visuomotor interactions that generate complex, unevenly distributed, and heavily occluded object views. What kinds of learning algorithms would be capable of learning from such data? One-shot learning, active recognition, and various forms of active learning are all relevant to this question, and will benefit from the creation of richer visual datasets.

The Toybox dataset. Here, we present a new dataset called Toybox (see example images in Figure 1) designed to facilitate computational experiments on visual object recognition and related vision problems, especially in the context of studying aspects of embodied (e.g., human or robot) visual object experience, including: (1) continuously sampled views of objects undergoing several different types of transformations, including rotation, translation, and zooming; (2) an egocentric perspective (i.e., handheld, first-person views), which means that objects are held in naturalistic grips and thus are always partially occluded; (3) a range of everyday categories, including household objects, animals, and vehicles; (4) a diversity of object instances, with 30 distinct physical objects representing each category.

We also present two examples of the kinds of studies that Toybox is designed to support, including one experiment to studying the effects of instance and viewpoint diversity on recognition performance, and a second experiment to investigate how hidden layer neurons in a trained neural network respond to continuous variations in object pose.

Multi-View Object Recognition Datasets

As described above, ImageNet and many other widely used, “Google Image Search”-type vision datasets contain only one image per real-world object [Deng et al.2009]. In addition, object viewpoints are constrained by the fact that most online images are created by adult humans using handheld camera devices [Torralba and Efros2011].

Figure 2: Viewpoint distributions of ImageNet categories, from the ObjectNet3D dataset [Xiang et al.2016].

Providing an interesting demonstration of this viewpoint bias, the ObjectNet3D dataset contains images from ImageNet annotated with the 3D pose of pictured objects [Xiang et al.2016]. As shown in Figure 2, different categories show very different viewpoint distributions, based on how people (adults) tend to encounter certain objects in everyday life.

Taking a complementary approach, several vision datasets have been created to provide multiple views of the same physical object, as reviewed in Table 1. These datasets are typically of two types: (1) discrete but structured object viewpoints are collected with the help of a turntable, e.g., the NORB, RGB-D, and iLab-20M datasets [LeCun, Huang, and Bottou2004, Lai et al.2011, Borji, Izadi, and Itti2016]; or (2) continuous but unstructured objects viewpoints are collected using human handheld object and/or camera manipulations, e.g., the Intel Egocentric and CORe50 datasets [Ren and Philipose2009, Lomonaco and Maltoni2017].

Figure 3:

Comparison of viewpoint distributions across several multi-view datasets. (Toybox off-axis viewpoints are estimations only and will vary from object to object.)

We designed the Toybox dataset to capture the advantages of both approaches. Toybox contains egocentric-perspective videos of the camera-wearer holding and manipulating various objects in structured ways, e.g. completing two full revolutions of an object along a specified axis of rotation at a (roughly) constant speed, among other types of transformations. We also include a period of unstructured manipulation to capture a random assortment of off-axis views. (Details about the dataset and collection methods are given below.)

Figure 3 shows the viewpoint distributions provided by several existing multi-view datasets, as well as by the Toybox dataset. As can be seen in this figure, most of the existing multi-view datasets use turntables, and thus no bottom-facing views of objects are available. Toybox aims to provide a more complete set of object views. We do not know how valuable such views might (or might not) be, but at least having the data available will enable experiments to study viewpoint distributions and visual learning in more detail.

Dataset Reference Categories Objs/Cat Viewpoints/Obj Other Variants Imgs/Obj Total Imgs
COIL-100 [Nene et al.1996] 100 1 72 n/a 72 7,200
SOIL-47 [Burianek et al.2000] 47 1 21 lighting 42 1,974
NORB [LeCun, Huang, and Bottou2004] 5 10 648 lighting 3,888 194,400
ALOI [Geusebroek and others2005] 1000 1 72 lighting 111 110,250
3D Objects on Turntable [Moreels and Perona2007] 100 1 144 lighting 432 43,200
3D Object [Savarese and Fei-Fei2007] 8 10 24 zooming 72 7,000
Intel Egocentric [Ren and Philipose2009] 42 1   various background, manual activity 1,600 70,000
EPFL-GIMS08 [Ozuysal, Lepetit, and P.Fua2009] 1 20 120 n/a 120 2299
RGB-D [Lai et al.2011] 51 3-14 750 camera resolution 750 250,000
BigBIRD [Singh et al.2014] 100 1 600 n/a 600 60,000
iCubWorld-Trfms. [Pasquale et al.2016] 20 10 150-200 lighting, background, zooming 3,600 720,000
iLab-20M [Borji, Izadi, and Itti2016] 15 25-160 88 lighting, background, focus 18,480 21,798,480
CORe50 [Lomonaco and Maltoni2017] 10 5   various indoor/outdoor, slight handheld movement 300 164,866
eVDS [Culurciello and Canziani2017] 35 37-97   various n/a 144 420,000
Toybox [this paper] 12 30 4,200 translating, zooming, manual activity 6,600 2,300,000
Stereo pairs not included in counts.  RGB-D video.  Updated counts from dataset website. Handheld objects.   Egocentric video. Unstructured viewpoint distributions.
Table 1: Computer vision datasets that contain multiple real (i.e., not synthesized) images of the same physical object.

Toybox: Dataset Organization and Collection

Figure 4 provides an overview of the Toybox dataset. Representative video clips from Toybox can be viewed in Supplementary Video 1. This section provides details about the design of the dataset and our recording methods.

Selection of categories and objects. Toybox contains 12 categories, roughly grouped into three super-categories: household items (cup, mug, spoon, ball), animals (duck, cat, horse, giraffe), and vehicles (car, truck, airplane, helicopter).

To maximize the usefulness of Toybox for comparisons with studies of human learning, all 12 of these categories are among the most common early-learned nouns for typically developing children in the U.S. [Fenson et al.2007]. Categories were also selected to provide shape variety in each super-category (e.g., spoon vs. ball, duck vs. cat, etc.) as well as shape similarity (e.g., cup vs. mug, car vs. truck, etc).

Each category contains 30 individual physical objects. For both animals and vehicles, we cannot include real objects, and so objects are either realistic, scaled-down models or “cartoony” toy objects. Objects were purchased mostly in local stores, with some acquired online. Individual objects were selected to provide a variety of shapes, colors, sizes, etc., and can be considered a representative sampling of typical objects available in the U.S.

Recording devices. Videos were recorded using Pivothead Original Series wearable cameras, which are worn like sunglasses and have the camera located just above the bridge of the wearer’s nose. Camera settings included: video resolution set to 1920 x1080; frame rate set to 30 fps; quality set to SFine; focus set to auto; and exposure set to auto.

Canonical views. For each category, we defined a canonical view of the object, roughly centered in front of the camera-wearer’s eyes. For example, mugs start in an upright position with handle to the right. Animals and vehicles start in an upright position facing towards the left.

Video clips. For each object, a set of 12 videos was recorded, as shown in Figure 4 and Supplementary Video 1. Each clip is 20 seconds long, with the exception of absent/present video clips, which are 2 seconds long. For rotations, each clip contains two full revolutions of the object; for translations, each video contains three back-and-forth translations starting from the minus end of each axis. Rotations and translations were controlled to have an approximately constant velocity over the 20-second duration of the video. Thus, the pose of the object in every frame of a given video clip can be estimated according to its time.

Recording procedures. Objects were semi-randomly assigned to individual camera-wearers (members of our research lab) such that no individual was over-represented in any category or object size class, to reduce any biases related to specific personal attributes or individual hand gestures. All videos were collected in an indoor setting against an off-white wall. Recordings were made across various times of day and lighting conditions, and so there is variation in lighting across different objects (as can be seen in Figure 4).

Figure 4: Toybox overview. Toybox contains 12 categories with 30 individual physical objects per category. There are 12 video clips per object. Each clip contains a defined transformation of the object: two full revolutions for rotation clips and three back-and-forth shifts for translation clips. A final “hodgepodge” clip contains unstructured object motion, mostly rotations. Please see Supplementary Video 1 for representative clips.

Sample Experiments Using Toybox

For initial, proof-of-concept object recognition experiments with Toybox, we use the transfer learning methodology appearing in many recent studies, e.g.,

[Bambach et al.2016, Pasquale et al.2016], which involves re-training the last layer of a pre-trained, deep convolutional neural network.

In particular, we use the Inception v3 network, as implemented in the Tensorflow software library

[Abadi and others2015]. Inception is a representative convolutional neural network that has been shown to be highly successful in recognition tasks [Szegedy et al.2016]. The Inception v3 model we used here was pre-trained on the ImageNet ILSVRC 2012 dataset, which contains 1.2 million images from 1,000 categories. More than half of the Toybox categories do appear in the original 1,000 categories used for pre-training—except for helicopter, giraffe, horse, and duck. Our ongoing research includes training from scratch (see future work).

Experiment 1 studied instance diversity and view diversity, and Experiment 2 studied viewpoint-dependent hidden layer representations, as described in more detail below.

Experiment 1: Using Toybox to study the effects of instance diversity and view diversity on recognition

In Experiment 1, we re-trained the last layer of the ILSVRC 2012 pre-trained Inception v3 network using images from Toybox, and then tested recognition performance using images from the same categories from ImageNet. Test performance is measured as the top-1 error rate.

Note that the choice of using ImageNet images (instead of hold-out Toybox images) for testing was deliberate. We aimed to explore how well training on a small number of handheld, often toy objects would be able to generalize to the very different objects/views in ImageNet (e.g., training on toy cats to recognize real cats). Certainly other testing approaches would also be interesting and will be pursued in future work. We constructed this ImageNet test set to contain 100 images/category across the 12 Toybox categories.

Instance diversity. We first looked at the effect of instance diversity on recognition performance by varying the number of individual physical objects per category in the training dataset, while keeping the total number of training images per category fixed at 1100 and uniformly drawn from the various video clips contained in the Toybox dataset.

For example, with one object per category, each of the 12 categories is represented by 1100 images of a single object from that category. With two objects per category, each category is represented by 1100 images uniformly drawn across two objects (550 images per object on average).

Results from this experiment are shown in Figure 5A. A training set with images of only a single Toybox object per category (i.e., 1100 images per object) yields an average error rate of 60.63%, which while not excellent, is well below the random-guessing baseline error rate of 91.7%. Adding a second object (i.e., about 550 images per each of two objects) further reduces error to 51.98%. Adding more objects per category (with total training images per category fixed at 1100) continues to improve performance significantly, with our final experiment using 30 objects per category yielding an average error rate of 21.43%.

We also characterized the performance improvement by computing best-fit lines using both linear and exponential models. As shown in Figure 5A, the exponential curve yields a better fit. Therefore, at least from the perspective of this model fitting, it appears that increasing object diversity will reduce the error rate in an exponential manner, with much greater improvements in performance for the first few added objects, and smaller increases thereafter (especially after about 20 individual objects).

View diversity. We also looked at view diversity, by varying the number of images per object included in the Toybox training set. By sampling these images uniformly across all Toybox video clips, the number of images per object can be used as a proxy for views per object. We conducted this experiment under three conditions, with the total number of objects per category fixed at 6, 12, and 24, respectively.

For example, for 12 objects per category condition, we varied the total number of images per object from 2 to 100, drawn uniformly across all 12 objects. Specifically, if we pick 2 images per object, the training dataset would have images per category, and similarly, if we pick 100 images per object, the training dataset would have images per category.

Figure 5B shows results from this experiment. (Although we experimented with numbers of images per object up to 100, we noticed a near constant error rate once this number exceeded 40, and so the graph in the figure is truncated at .) Results across the three conditions show similar trends, and so we focus our discussion here on the 12 objects/category condition (blue data points and curve).

With a single image per object, the average top1 error rate is 33.0%. This error rate is subsequently reduced to 27.5% if we have 10 images per object, and is further reduced to 25.6% and 24.8% for 20 and 40 images per object, respectively. As with object diversity, the effects of view diversity appear to show an exponential trend, with only modest improvements after about 5-10 views per object.

Figure 5: Effects of object diversity and view diversity on recognition performance, measured as top-1 error rate on an ImageNet-sourced test set. Each trial was run 5-6 times, shown as individual data points. A. Recognition as a function of instance diversity, i.e., number of objects per category in the Toybox training set, with the total size of the training set held fixed. B. Recognition as a function of view diversity, i.e., number of images per object in the Toybox training set, ranging from 2 to 40 images per object.
Figure 6: A novel method to identify neurons that correlate with a rotating object using our Toybox video dataset. A. Temporal raster plot of neural activations in the final hidden layer of the pre-trained Inception v3 network while “watching” a rotating mug. Each row shows an individual neuron, and the x-axis depicts time/rotation. B. The same neurons sorted based on their FFT amplitude from high (top) to low (bottom). Note the stripe pattern on the top half of the plot showing strong periodicity. C. Four different types of neurons identified using this method. D. Comparison of FFT analysis of mug, cup, and car, showing top 10 neurons after FFT amplitude sorting.

Experiment 2: Using Toybox to study viewpoint- dependent hidden layer representations

One fundamental question related to deep neural networks and object recognition is how objects are represented within the hidden layers of a network. We propose that the structured transformations of objects in Toybox videos may help us to better understand these representations.

Similar to Experiment 1, we used the ILSVRC 2012 pre-trained Inception v3 network, with the final output layer retrained on Toybox data with 1100 images per category selected across all Toybox objects. We also used the same ImageNet-sourced test set with 100 images per category.

Quantifying neuron temporal activation profiles. We began by studying the “temporal” activation profiles of neurons in the last hidden layer of the Inception network, while the network is receiving a sequence of Toybox input images depicting a mug rotating along the z(+) axis for two full cycles. Figure 6 shows visualizations of these activations.

In particular, Figure 6A depicts the activations over time of all 2048 neurons in the final hidden layer of the network. Each row shows the activations of an individual neuron (unsorted in this subfigure), and the x-axis indicates time, which also approximates the rotation degree of the mug. (This visualization method is adapted from the temporal raster plots used in neural physiology research.) The various neuron “firing patterns” are clearly heterogeneous: some neurons are constantly firing throughout the two rotation cycles, some remain silenced, and some fluctuate as the mug rotates.

To differentiate these neuron types, we applied a Fast Fourier Transformation (FFT) to the activation of each neuron over the two rotation cycles. To capture general viewpoint trends, we focused our FFT analysis on a frequency of 4 (i.e., four cycles within the 20-second long video that contains two complete rotations). We then sorted the 2048 neurons shown in Figure

6A based on their FFT amplitude at the frequency of 4—larger amplitudes indicate more robust oscillations. Figure 6B shows a visualization of the same neurons but sorted (top-to-bottom) by their FFT amplitudes.

Since FFT analysis also returns the phase information (positive phase correlates with handle presence, negative phase correlates with handle absent in this case), we were able to identify four different types of neurons based on their activation profiles. Examples of these four types are shown in Figure 6C and also in Supplementary Video 2: (1) neurons that fire when the mug handle is present (blue line); (2) neurons that fire when the mug handle is behind or in front of the mug body (yellow line); (3) neurons that fire throughout the video clip (black line); and (4) neurons that do not fire at all (green line, these neurons presumably do not contribute to the representation of the mug).

We also tested this FFT analysis method on objects from other categories using our Toybox videos. As shown in Figure 6D, the ability to identify robust oscillating neurons mainly depends on the degree of asymmetry of the object along the z-axis. For instance, we were able to identify neurons with more robust oscillation for a mug and a car than for a cup (which is symmetric along the z-axis).

Figure 7: Effects of silencing hidden layer neurons on output layer activations, averaged over 100 images per category from ImageNet-sourced test set. A

. Silencing top N neurons based on FFT amplitude sorting leads to a much steeper reduction in normalized logit value of the mug output neuron. The blue line shows the reduction rate of silencing N randomly selected neurons as a control.

B. Similar to A, showing softmax values instead of logit values. C. Silencing mug-preferred neurons (MPNs) has a similar effect on the logit value of the cup output neuron but has no effect on that of the car output neuron. D. Zoomed-in plot of softmax values, showing that silencing the top 20 neurons decreases mug prediction confidence while increasing cup prediction confidence, consistent with the fact that the majority of these neurons correlate with the presence of the mug handle.

Effects of neuron silencing. To investigate the implicit representations of the various types of neurons identified above, we performed a neuron silencing/lesion experiment by selectively “zeroing out” the activations of certain neurons in the last hidden layer, and then observing effects on recognition performance. Figure 7 shows results from these experiments, where the logit and softmax values for particular output neurons are shown as averages over 100 images from various categories in the ImageNet-based test set.

First, for test images from the mug category, we silenced N neurons (N varying from 0 to 2048) in the last hidden layer and examined the changes of both logit and softmax values of the mug neuron in the output layer. As shown in Figure 7A, silencing 0 of these hidden layer neurons has no effect, while silencing all 2048 neurons reduces the normalized logit value of the mug neuron to 0. Randomly silencing a subset of N neurons leads to a linear reduction of the logit value with respect to N. However, if we silence neurons based on the FFT amplitude sorting as shown in Figure 6B (i.e., first silencing the neuron with the highest FFT amplitude, then the top two, then top three, and so on), we observed a much steeper drop in mug logit value at the beginning. After 700 neurons, silencing has no more reduction effect on the mug logit value. Similar effects can be seen with the softmax value of the mug neuron (Figure 7B). In a sense, by selecting neurons with highest FFT amplitude, we can identify what we call mug-preferred neurons (MPNs).

To examine the specificity of these MPNs, we tested the silencing effect on cup and car output neurons. Silencing the top MPNs has a significant impact on the logit value of the cup neuron (Figure 7C). This is not surprising given that a mug and a cup share many common features. However, the zoomed in softmax plot shows that silencing the top 20 MPNs slightly increases the softmax value of the cup output neuron, which is consistent with the fact that most of these neurons fire when the handle is present (Figure 7D and Supplementary Video 2). In other words, these neurons might be contributing to the difference between a mug and a cup.

In contrast, silencing the top MPNs has almost no effect on the car output neuron (Figure 7C). In fact, the effect of silencing MPNs is almost identical to that of silencing random neurons (dotted grey line in 7C, see also 7A blue line).

These experiments showed that the MPNs contribute significantly to mug identity and much less to the identity of other categories like car. A small portion of the MPNs may be coding the handle feature to differentiate a mug from a cup. Interestingly, although silencing one or a few neurons that are most prominent does decrease the input value to a specific output neuron, there is a significant amount of value that remains. This result confirms that object features do not seem to be represented by a single or few neurons, but rather by an ensemble of neurons.

Discussion and Future Work

In this paper, we presented the new Toybox dataset of egocentric visual object transformations. We also provided results from two sample experiments showing how this dataset can be used to study visual learning, including (1) effects of instance diversity and view diversity on recognition performance, and (2) using a novel FFT-based method to classify hidden layer neurons according to how they represent various category- and viewpoint-dependent visual properties.

In future research, in addition to continuing the types of experiments presented here, we expect that the Toybox dataset will be valuable for studying new types of representations and learning algorithms that lend themselves to continuous image sequence inputs. For example, in human vision, object motion is critical for segmentation, and also likely plays a role in many other aspects of object detection and recognition. [Ohki et al.2005, Sabbah, Gemmer, and others2017]. How motion features affect recognition performance, and how object motion might contribute to the learning phase as well (for instance, by providing a real-time version of data augmentation), are currently open research questions in the study of visual learning. With its structured object transformations and wide selection of categories and object instances, we believe the Toybox dataset will help drive continued research advances on these and many other important questions in AI and cognitive science.


Many thanks to Fernanda Elliot, Joshua Palmer, Soobeen Park, Joel Michelson, Aneesha Dasari, Ellis Brown, Max de Groot, Harsha Vankayalapati, and Joseph Eilbert for help in data collection. We would also like to thank early discussions influencing this research, with Linda Smith, Chen Yu, Fuxin Li, and Jim Rehg. This research was funded in part by a Vanderbilt Discovery Grant, ”New explorations in visual object recognition,” and by NSF award #1730044.


  • [Abadi and others2015] Abadi, M., et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from
  • [Bambach et al.2016] Bambach, S.; Crandall, D. J.; Smith, L. B.; and Yu, C. 2016. Active viewing in toddlers facilitates visual object learning: An egocentric vision approach. In Proceedings of the 38th Annual Conference of the Cognitive Science Society.
  • [Borji, Izadi, and Itti2016] Borji, A.; Izadi, S.; and Itti, L. 2016.

    ilab-20m: A large-scale controlled object dataset to investigate deep learning.


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2221–2230.
  • [Burianek et al.2000] Burianek, J.; Ahmadyfard, A.; Kittler, J.; et al. 2000. Soil-47, the surrey object image library. Centre for Vision, Speech and Signal processing, Univerisity of Surrey.[Online].
  • [Culurciello and Canziani2017] Culurciello, E., and Canziani, A. 2017. e-Lab video data set.
  • [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Li, F.-F. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009, 248–255. IEEE.
  • [Fenson et al.2007] Fenson, L.; Bates, E.; Dale, P. S.; Marchman, V. A.; Reznick, J. S.; and Thal, D. J. 2007. MacArthur-Bates communicative development inventories. Paul H. Brookes Publishing Company.
  • [Geusebroek and others2005] Geusebroek, J.-M., et al. 2005. The amsterdam library of object images. International Journal of Computer Vision 61(1):103–112.
  • [James et al.2014] James, K. H.; Jones, S. S.; Smith, L. B.; and Swain, S. N. 2014. Young children’s self-generated object views and object recognition. Journal of Cognition and Development 15(3):393–401.
  • [Lai et al.2011] Lai, K.; Bo, L.; Ren, X.; and Fox, D. 2011. A large-scale hierarchical multi-view rgb-d object dataset. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, 1817–1824. IEEE.
  • [LeCun, Huang, and Bottou2004] LeCun, Y.; Huang, F. J.; and Bottou, L. 2004. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004., volume 2, II–104. IEEE.
  • [Lomonaco and Maltoni2017] Lomonaco, V., and Maltoni, D. 2017. Core50: a new dataset and benchmark for continuous object recognition. In Levine, S.; Vanhoucke, V.; and Goldberg, K., eds., Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, 17–26. PMLR.
  • [Mervis1987] Mervis, C. 1987. Child-basic object categories and early lexical development. In Neisser, U., ed., Concepts and Conceptual Development: Ecological and Intellectual Factors in Categorization. Cambridge University Press. 201–233.
  • [Moreels and Perona2007] Moreels, P., and Perona, P. 2007. Evaluation of features detectors and descriptors based on 3d objects. International Journal of Computer Vision 73(3):263–284.
  • [Nene et al.1996] Nene, S. A.; Nayar, S. K.; Murase, H.; et al. 1996. Columbia object image library (COIL-100). Technical report CUCS-005-96.
  • [Ohki et al.2005] Ohki, K.; Chung, S.; Ch’ng, Y. H.; Kara, P.; and Reid, R. C. 2005. Functional imaging with cellular resolution reveals precise micro-architecture in visual cortex. Nature 433(7026):597.
  • [Ozuysal, Lepetit, and P.Fua2009] Ozuysal, M.; Lepetit, V.; and P.Fua. 2009. Pose estimation for category specific multiview object localization. In Conference on Computer Vision and Pattern Recognition.
  • [Pasquale et al.2016] Pasquale, G.; Ciliberto, C.; Rosasco, L.; and Natale, L. 2016. Object identification from few examples by improving the invariance of a deep convolutional neural network. In Intelligent Robots and Systems (IROS), 4904–4911. IEEE.
  • [Ren and Philipose2009] Ren, X., and Philipose, M. 2009. Egocentric recognition of handled objects: Benchmark and analysis. In Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference on, 1–8. IEEE.
  • [Sabbah, Gemmer, and others2017] Sabbah, S.; Gemmer, J. A.; et al. 2017. A retinal code for motion along the gravitational and body axes. Nature 546(7659):492.
  • [Savarese and Fei-Fei2007] Savarese, S., and Fei-Fei, L. 2007. 3d generic object categorization, localization and pose estimation. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 1–8. IEEE.
  • [Singh et al.2014] Singh, A.; Sha, J.; Narayan, K. S.; Achim, T.; and Abbeel, P. 2014. Bigbird: A large-scale 3d database of object instances. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, 509–516.
  • [Smith et al.2018] Smith, L. B.; Jayaraman, S.; Clerkin, E.; and Yu, C. 2018. The developing infant creates a curriculum for statistical learning. Trends in cognitive sciences.
  • [Szegedy et al.2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
  • [Torralba and Efros2011] Torralba, A., and Efros, A. A. 2011. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, 1521–1528. IEEE.
  • [Xiang et al.2016] Xiang, Y.; Kim, W.; Chen, W.; Ji, J.; Choy, C.; Su, H.; Mottaghi, R.; Guibas, L.; and Savarese, S. 2016. Objectnet3d: A large scale database for 3d object recognition. In European Conference Computer Vision (ECCV).