On the Capability of Neural Networks to Generalize to Unseen Category-Pose Combinations

07/15/2020
by   Spandan Madan, et al.
26

Recognizing an object's category and pose lies at the heart of visual understanding. Recent works suggest that deep neural networks (DNNs) often fail to generalize to category-pose combinations not seen during training. However, it is unclear when and how such generalization may be possible. Does the number of combinations seen during training impact generalization? Is it better to learn category and pose in separate networks, or in a single shared network? Furthermore, what are the neural mechanisms that drive the network's generalization? In this paper, we answer these questions by analyzing state-of-the-art DNNs trained to recognize both object category and pose (position, scale, and 3D viewpoint) with quantitative control over the number of category-pose combinations seen during training. We also investigate the emergence of two types of specialized neurons that can explain generalization to unseen combinations—neurons selective to category and invariant to pose, and vice versa. We perform experiments on MNIST extended with position or scale, the iLab dataset with vehicles at different viewpoints, and a challenging new dataset for car model recognition and viewpoint estimation that we introduce in this paper, the Biased-Cars dataset. Our results demonstrate that as the number of combinations seen during training increases, networks generalize better to unseen category-pose combinations, facilitated by an increase in the selectivity and invariance of individual neurons. We find that learning category and pose in separate networks compared to a shared one leads to an increase in such selectivity and invariance, as separate networks are not forced to preserve information about both category and pose. This enables separate networks to significantly outperform shared ones at predicting unseen category-pose combinations.

READ FULL TEXT

page 2

page 14

page 15

research
12/07/2020

Generating unseen complex scenes: are we there yet?

Although recent complex scene conditional generation models generate inc...
research
08/19/2021

DECA: Deep viewpoint-Equivariant human pose estimation using Capsule Autoencoders

Human Pose Estimation (HPE) aims at retrieving the 3D position of human ...
research
02/13/2023

Capsules as viewpoint learners for human pose estimation

The task of human pose estimation (HPE) deals with the ill-posed problem...
research
03/23/2023

Learning and generalization of compositional representations of visual scenes

Complex visual scenes that are composed of multiple objects, each with a...
research
04/20/2023

Generalizing Neural Human Fitting to Unseen Poses With Articulated SE(3) Equivariance

We address the problem of fitting a parametric human body model (SMPL) t...
research
02/10/2021

Systematic Generalization for Predictive Control in Multivariate Time Series

Prior work has focused on evaluating the ability of neural networks to r...

Please sign up or login with your details

Forgot password? Click here to reset