Bridging the Gap between Semantics and Multimedia Processing

11/25/2019
by   Marcio Ferreira Moreno, et al.
EPFL
puc-rio
ibm
0

In this paper, we give an overview of the semantic gap problem in multimedia and discuss how machine learning and symbolic AI can be combined to narrow this gap. We describe the gap in terms of a classical architecture for multimedia processing and discuss a structured approach to bridge it. This approach combines machine learning (for mapping signals to objects) and symbolic AI (for linking objects to meanings). Our main goal is to raise awareness and discuss the challenges involved in this structured approach to multimedia understanding, especially in the view of the latest developments in machine learning and symbolic AI.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

11/21/2019

An Introduction to Symbolic Artificial Intelligence Applied to Multimedia

In this chapter, we give an introduction to symbolic artificial intellig...
11/21/2019

An Introduction to Artificial Intelligence Applied to Multimedia

In this chapter, we give an introduction to symbolic artificial intellig...
03/24/2021

A Survey of Multimedia Technologies and Robust Algorithms

Multimedia technologies are now more practical and deployable in real li...
03/01/2015

Novel Metaknowledge-based Processing Technique for Multimedia Big Data clustering challenges

Past research has challenged us with the task of showing relational patt...
01/14/2021

OrigamiSet1.0: Two New Datasets for Origami Classification and Difficulty Estimation

Origami is becoming more and more relevant to research. However, there i...
08/14/2018

Cross-Lingual Cross-Platform Rumor Verification Pivoting on Multimedia Content

With the increasing popularity of smart devices, rumors with multimedia ...
06/17/2015

Learning Contextualized Semantics from Co-occurring Terms via a Siamese Architecture

One of the biggest challenges in Multimedia information retrieval and un...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A classic problem in multimedia representation and understanding is the semantic gap problem [Sikos-L-F-2017]. It states that there is a big representational gap between the audiovisual signals that compose multimedia objects and the concepts represented by these signals. For instance, the dominant color and movement trajectory of a given set of pixels in a video clip, which are low-level characteristics of the clip, usually do not provide much information about the meaning

of the set of pixels—at least not to computers. But recent developments in artificial intelligence (AI) are changing that.

Backed by large training datasets, current machine learning methods are able to extrapolate complex patterns from low-level multimedia data. These patterns are embodied in trained models which can be used to classify or identify persons and objects with reasonable speed and accuracy in images, audio clips, and to a lesser extent video clips 

[Zhang-W-2019].

But being able to identify persons and objects in multimedia data only solves half of the problem. To emulate human cognition and truly understand a scene—for instance, to determine who is doing what and the consequences of those actions—computers need additional information: they need common sense knowledge and domain knowledge, and also the capacity to infer new knowledge from preexisting knowledge. This is where symbolic AI comes in. The basic idea of symbolic AI is to describe the world, its entities, and their relationships using a formal language and to develop efficient algorithms to query and deduce things from these formal descriptions.

In this paper, we give an overview of the semantic gap problem in multimedia and discuss how machine learning (ML) and symbolic AI can be combined to narrow this gap. More specifically, we highlight the fact that what we call the semantic gap consists of many gaps which exist between the various layers of multimedia representation. A structured approach to tackle the gap as a whole is thus to tackle each of these smaller gaps individually, through a combination of ML (for mapping signals to objects) and symbolic AI (for linking objects to meaning).

Our main goal here is to raise awareness and discuss the challenges involved in this structured approach to multimedia understanding, especially in view of the latest developments in ML and symbolic AI.

The rest of the paper is organized as follows. In Section II, we define what we mean by “symbolic AI” and justify why we need it. In Section III, we describe how the semantic gap problem is distributed among the various layers of multimedia representation, and discuss a structured approach for multimedia understanding. In Section LABEL:sec:4:challenges, we discuss the challenges involved in such a structured approach. Finally, in section LABEL:sec:5:final, we draw our conclusions.

Ii Why We Need Symbolic AI

To illustrate the kind of applications enabled by the combination of symbolic AI with machine learning and multimedia consider Figure 1.

Knife

Jean-Paul Marat

Wound

Letter from Charlotte Corday
Fig. 1: The Death of Marat (detail), by Jacques-Louis David, 1793. (WikiMedia)

Suppose we are given this picture and suppose the only thing we know about it is what we can infer from the image. We can see it depicts Jean-Paul Marat (assuming we can identify him), a stab wound, a blood-tainted knife, and a letter addressed to him and signed by Charlotte Corday (assuming we can read the contents of the letter). The analogy here is that we have extracted these information—or facts

—using pattern matching. Although such basic facts allow us to perform simple computational tasks, such as keyword-based image classification and search, they are not enough to understand the image.

To truly understand what is being depicted in Figure 1 we need more than basic facts. We need (1) general knowledge about the world, (2) specific knowledge about the persons named, and (3) the capacity to combine general and specific knowledge with the facts extracted from the image in order to infer new facts.

Now suppose we are given (1), (2), and (3). From our general knowledge of the world, and possibly by further analyzing the image, we can assert with high confidence that Marat is holding the letter and that he has a stab wound on the chest. From this and from the blood-tainted knife depicted below him, we might infer that the depicted knife is the object that caused the wound. Since knifes are not autonomous beings, we might also conclude that someone (possibly himself) stabbed Marat in the chest. But who and why?

To answer these questions we will need more information. Suppose we are told that Marat was a journalist and political agitator, and one of the leaders of a radical political faction in the Reign of Terror period of the French Revolution (c. 1793). Suppose we are also told that Charlotte Corday, who signed the letter, was a declared political enemy of Marat—she blamed him for a number of killings in Paris and other cities and believed that he was a grave threat to the French Republic. Under the light of these new facts, we can conclude that Figure 1 looks like the scene of a political murder.

By combining this conclusion with the additional fact that Charlotte Corday is known to have murdered Jean-Paul Marat with a knife while he was in his bathtub, holding a letter from her, we can infer with a high degree of confidence that Figure 1 must be a graphical representation of this incident, that is, of the politically motivated assassination of Jean-Paul Marat by Charlotte Corday.

The derivation of this last fact from the visual patterns of Figure 1 has only been possible because we have had access not only to basic facts extracted from the image, but also to facts about the world (common sense knowledge) and about the depicted objects and persons (domain knowledge), and because we could combine all of these facts and make inferences.

One of the main goals of symbolic AI is to enable the representation and manipulation of pieces of knowledge by computers in ways that resemble or emulate the kind of manipulations performed by humans—manipulations similar to the considerations that enabled us to determine the true meaning of Figure 1. The combination of this capacity with multimedia opens up many possibilities. For instance, the Marat’s murder example is an application of automated image understanding. Two related applications are video understanding and audio understanding, which are often more complex as they involve the extraction of temporal information.

Other applications of symbolic AI to multimedia include the semantic retrieval, classification, recommendation, and inspection of multimedia data—for example, to automatically identify suspicious activity in surveillance videos, generate age ratings for music and movies, and identify risk factors for diseases in medical images and videos.

Iii A Structured Approach to Multimedia Understanding

Iii-a The many semantic gaps

There is a hierarchy of layers of processing and representation separating raw multimedia data (arrays of bytes) from their semantics (meaning). The so-called semantic gap can occur between any two of these layers. To see why this is the case, consider Figure LABEL:fig:gap which depicts the classic structure of a bottom-up pipeline for extracting semantics from multimedia content [Hare2006MindTG].