Scene Parsing by Weakly Supervised Learning with Image Descriptions

09/27/2017
by   Ruimao Zhang, et al.
0

This paper investigates a fundamental problem of scene understanding: how to parse a scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations). We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixel-wise object labeling and ii) a recursive neural network (RsNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative annotations (e.g., manually labeled semantic maps and relations), we train our deep model in a weakly-supervised learning manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and apply these tree structures to discover the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RsNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments show that our model is capable of producing meaningful and structured scene configurations, and achieving more favorable scene labeling results on PASCAL VOC 2012 and SYSU-Scenes datasets compared to other state-of-the-art weakly-supervised deep learning methods. In particular, SYSU-Scenes is a dedicated dataset released by us to facilitate further research on scene parsing, which contains more than 5000 scene images with their sentence-based semantic descriptions.

READ FULL TEXT

page 2

page 4

page 5

page 6

page 13

page 14

page 15

page 16

research
04/08/2016

Deep Structured Scene Parsing by Learning with Image Descriptions

This paper addresses a fundamental problem of scene understanding: How t...
research
01/08/2020

Weakly Supervised Visual Semantic Parsing

Scene Graph Generation (SGG) aims to extract entities, predicates and th...
research
10/17/2020

LID 2020: The Learning from Imperfect Data Challenge Results

Learning from imperfect data becomes an issue in many industrial applica...
research
03/27/2023

EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation Without Scene Supervision

We introduce Equivariant Neural Field Expectation Maximization (EFEM), a...
research
12/06/2019

Controlling Style and Semantics in Weakly-Supervised Image Generation

We propose a weakly-supervised approach for conditional image generation...
research
11/06/2018

Weakly Supervised Scene Parsing with Point-based Distance Metric Learning

Semantic scene parsing is suffering from the fact that pixel-level annot...
research
05/27/2019

Semantic Fisher Scores for Task Transfer: Using Objects to Classify Scenes

The transfer of a neural network (CNN) trained to recognize objects to t...

Please sign up or login with your details

Forgot password? Click here to reset