Multimodal Attribute Extraction

11/29/2017 ∙ by Robert L. Logan IV, et al. ∙ University of California, Irvine 0

The broad goal of information extraction is to derive structured information from unstructured data. However, most existing methods focus solely on text, ignoring other types of unstructured data such as images, video and audio which comprise an increasing portion of the information on the web. To address this shortcoming, we propose the task of multimodal attribute extraction. Given a collection of unstructured and semi-structured contextual information about an entity (such as a textual description, or visual depictions) the task is to extract the entity's underlying attributes. In this paper, we provide a dataset containing mixed-media data for over 2 million product items along with 7 million attribute-value pairs describing the items which can be used to train attribute extractors in a weakly supervised manner. We provide a variety of baselines which demonstrate the relative effectiveness of the individual modes of information towards solving the task, as well as study human performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given the large collections of unstructured and semi-structured data available on the web, there is a crucial need to enable quick and efficient access to the knowledge content within them. Traditionally, the field of information extraction has focused on extracting such knowledge from unstructured text documents, such as job postings, scientific papers, news articles, and emails. However, the content on the web increasingly contains more varied types of data, including semi-structured web pages, tables that do not adhere to any schema, photographs, videos, and audio. Given a query by a user, the appropriate information may appear in any of these different modes, and thus there’s a crucial need for methods to construct knowledge bases from different types of data, and more importantly, combine the evidence in order to extract the correct answer.

Motivated by this goal, we introduce the task of multimodal attribute extraction. Provided contextual information about an entity, in the form of any of the modes described above, along with an attribute query, the goal is to extract the corresponding value for that attribute. While attribute extraction on the domain of text has been well-studied [4, 7, 16, 18, 20], to our knowledge this is the first time attribute extraction using a combination of multiple modes of data has been considered. This introduces additional challenges to the problem, since a multimodal attribute extractor needs to be able to return values provided any kind of evidence, whereas modern attribute extractors treat attribute extraction as a tagging problem and thus only work when attributes occur as a substring of text.

In order to support research on this task, we release the Multimodal Attribute Extraction (MAE) dataset111The dataset is freely available at:, a large dataset containing mixed-media data for over 2.2 million commercial product items, collected from a large number of e-commerce sites using the Diffbot Product API.222 The collection of items is diverse and includes categories such as electronic products, jewelry, clothing, vehicles, and real estate. For each item, we provide a textual product description, collection of images, and open-schema table of attribute-value pairs (see Figure 1 for an example). The provided attribute-value pairs only provide a very weak source of supervision; where the value might appear in the context is not known, and further, it is not even guaranteed that the value can be extracted from the provided evidence. In all, there are over 4 million images and 7.6 million attribute-value pairs. By releasing such a large dataset, we hope to drive progress on this task similar to how the Penn Treebank [14], SQuAD [19]

, and Imagenet 

[6] have driven progress on syntactic parsing, question answering, and object recognition, respectively.

To asses the difficulty of the task and the dataset, we first conduct a human evaluation study using Mechanical Turk that demonstrates that all available modes of information are useful for detecting values. We also train and provide results for a variety of machine learning models on the dataset. We observe that a simple

most-common value classifier

, which always predicts the most-common value for a given attribute, provides a very difficult baseline for more complicated models to beat (33% accuracy). In our current experiments, we are unable to train an image-only classifier that can outperform this simple model, despite using modern neural architectures such as VGG-16

[21] and Google’s Inception-v3[22]. However, we are able to obtain significantly better performance using a text-only classifier (59% accuracy). We hope to improve and obtain more accurate models in further research.

Figure 1: An example item with its descriptions: image, tabular attributes, and textual.

2 Multimodal Product Attribute Extraction

Since a multimodal attribute extractor needs to be able to return values for attributes which occur in images as well as text, we cannot treat the problem as a labeling problem as is done in the existing approaches to attribute extraction. We instead define the problem as following: Given a product and a query attribute , we need to extract a corresponding value from the evidence provided for , namely, a textual description of it () and a collection of images (). For example, in Figure 1, we observe the image and the description of a product, and examples of some attributes and values of interest. For training, for a set of product items , we are given, for each item , its textual description and the images , and a set comprised of attribute-value pairs (i.e. ). In general, the products at query time will not be in , and we do not assume any fixed ontology for products, attributes, or values. We evaluate the performance on this task as the accuracy of the predicted value with the observed value, however since there may be multiple correct values, we also include hits@ evaluation.

Table 1: MAE dataset statistics. # products 2.2 m # images 4.0 m # attribute-value pairs 7.6 m # unique attributes 2.1 k # unique values 23.6 k Figure 2: Histograms of attribute and value counts.

The MAE Dataset

The MAE dataset is composed of mixed media data for 2.2 million product items, obtained by running the Diffbot Product API on over 20 million web pages from 1068 different commercial websites. As in the task definition, there is a textual description, set of product images, and open-schema table of product attributes for every item. The Diffbot API obtains this information using a machine learning based extractor which uses visual, textual and layout features of the fully rendered product webpage. For example, attribute-value pairs are automatically extracted from tables present on product webpages. Due to the automated nature of this collection process, there is some noise present in the dataset. For instance, the same attribute may be represented many different ways (e.g. Length, length, len.). We use regular-expression based preprocessing to normalize the most common attributes, however, we leave values unnormalized. We also remove any attribute-value pairs that satisfy any of the following frequency conditions: the attribute occurs less than 500 times, the value occurs less than 50 times, or the attribute’s most common value makes up more than 80% of the attribute-value pairs. The data is split into a training, validation, and test set using an 80-10-10 split.

Mechanical Turk Evaluation

Since the attributes and values have been extracted as they appear on the web sites, there is no guarantee that the attribute-value pairs appear in either the product images or textual descriptions. We perform a study using Amazon Mechanical Turk to determine the extent to which this issue affects the dataset, as well as collect a gold evaluation dataset of attribute-value pairs that are guaranteed to show up in the context information. Mechanical Turk workers are presented a product’s images and textual description, and asked to determine whether they can predict the value for a given product attribute (from a list of choices) using the provided information, and if so, using which pieces of information. We use a majority vote to eliminate noise in these annotations. The (preliminary) results of this study suggest that only 42% of the attribute-value pairs can be found using contextual information. Of those, 35% could be found using the product’s image and 70% could be found using the textual description. This suggests that while textual descriptions are the most useful mode for attribute extraction, there is still beneficial information contained in images.

Figure 3: Basic architecture of the multimodal attribute extraction model.

3 Multimodal Fusion Model

In this section, we formulate a novel extraction model for the task that builds upon the architectures used recently in tasks such as image captioning, question answering, VQA, etc. The model is composed of three separate modules: (1) an encoding module that uses modern neural architectures to jointly embed the query, text, and images into a common latent space, (2) a fusion module that combines these embedded vectors using an attribute-specific attention mechanism to a single dense vector, and (3) a similarity-based value decoder which produces the final value prediction. We provide an overview of this architecture in Figure 


Encoding Module

We assign a dense embedding for each attribute and values, i.e. attribute is represented by a -dimensional vector , and value by , where the vectors are learned during training. For textual description , we first tokenize the text using the Stanford tokenizer [13], followed by embedding all of the words using the Glove algorithm [17] on all of the descriptions in the training data. We use the CNN architecture of Kim [10]

, that consists of CNN layers, max-pooling, and a fully-connected layer, to combine these pretrained embeddings to a single dense vector for the description,

. Embeddings of the images

are also produced using convolutional neural networks. Specifically, we obtain intermediate image representations using the output of the fc7 layer (after applying the ReLU non-linearity) of a pretrained 16-layer VGG model 

[21]. We then feed the output through a fully connected layer to obtain a -dimensional embedding for each image. The final embedding is produced by performing max-pooling over the image embeddings.


To fuse attribute embeddings with the text and image embeddings, and , we experiment with two different techniques. The first, called Concat, is to concatenate the three of them and then feed them through a fully-connected layer, in order to produce the fused encoding . The second approach, called GMU for gated multimodal unit [3], first fuses the attribute vector with and independently using fully-connected layers, resulting in and . We combine them by first creating gating vector , followed by, . For unimodal baselines, the fusion module is replaced by a fully-connected layer.

Loss Function

We use a variant of the contrastive loss function introduced by

Chopra et al. [5]. Let denote the embedding produced by the fusion layer. Our goal is to produce an embedding which is close to the value embedding (e.g. the one from the training example), and distant from other value embeddings

. In order to measure closeness we use cosine similarity, denoted by

, followed by a variant of squared hinge loss:


where a negative value is sampled for each training example from the empirical distribution of value counts displayed in Figure 2. To obtain a value prediction given context, we identify the value with embedding closest to the context embedding , according to cosine similarity .

4 Experiments

We evaluate on a subset of the MAE dataset consisting of the 100 most common attributes, covering roughly 50% of the examples in the overall MAE dataset. To determine the relative effectiveness of the different modes of information, we train image and text only versions of the model described above. Following the suggestions in Zhang and Wallace [25] we use a 600 unit single layer in our text convolutions, and a 5 word window size. We apply dropout to the output of both the image and text CNNs before feeding the output through fully connected layers to obtain the image and text embeddings. Employing a coarse grid search, we found models performed best using a large embedding dimension of . Lastly, we explore multimodal models using both the Concat and the GMU strategies. To evaluate models we use the hits@ metric on the values.

The results of our experiments are summarized in Table 2. We include a simple most-common value model that always predicts the most-common value for a given attribute. Observe that the performance of the image baseline model is almost identical to the most-common value model. Similarly, the performance of the multimodal models is similar to the text baseline model. Thus our models so far have been unable to effectively incorporate information from the image data. These results show that the task is sufficiently challenging that even a complex neural model cannot solve the task, and thus is a ripe area for future research.

Model predictions for the example shown in Figure 1 are given in Table 3, along with their similarity scores. Observe that the predictions made by the current image baseline model are almost identical to the most-common value model. This suggests that our current image baseline model is essentially ignoring all of the image related information and instead learning to predict common values.

Hits@1 Hits@5 Hits@10 Hits@20
Most-Common Value 38.81 77.26 87.96 95.96
Image Baseline 38.07 76.11 86.99 95.00
Text Baseline 58.41 87.49 93.94 98.00
Multimodal Baseline - Concat 59.48 87.33 93.23 97.07
Multimodal Baseline - GMU 52.92 85.07 92.23 97.26
Table 2: Baseline model results.
Most-Common Value White Black Stainless Steel Chrome Gray
Text Baseline Gray Silver Grey White Beige
0.84 0.63 0.60 0.60 0.58
Image Baseline White Black Blue Gray Brown
0.81 0.70 0.63 0.62 0.59
Multimodal Baseline - Concat Gray Red Green Grey Blue
0.84 0.71 0.71 0.71 0.70
Multimodal Baseline - GMU Gray Blue Brown Green Red
0.85 0.71 0.69 0.68 0.67
Table 3: Top 5 predictions on the data in Figure 1 when querying for color finish.

5 Related Work

Our work is related to, and builds upon, a number of existing approaches.

The introduction of large curated datasets has driven progress in many fields of machine learning. Notable examples include: The Penn Treebank [14] for syntactic parsing models, Imagenet [6] for object recognition, Flickr30k [24] and MS COCO [12] for image captioning, SQuAD [19] for question answering and VQA [2] for visual question answering. Despite the interest in related tasks, there is currently no publicly available dataset for attribute extraction, let alone multimodal attribute extraction. This creates a high barrier to entry as anyone interested in attribute extraction must go through the expensive and time-consuming process of acquiring a dataset. Furthermore, there is no way to compare the effectiveness of different techniques. Our dataset aims to address this concern.

Recently, there has been renewed interest in multimodal machine learning problems. Vinyals et al. [23] demonstrate an effective image captioning system that uses a CNN to encode an image which is used as the input to an LSTM [8] decoder, producing the output caption. This encoder-decoder architecture forms the basis for successful approaches to other multimodal problems such as visual question answering [1]. Another body of work focuses on the problem of unifying information from different modes of information. Kiela and Bottou [9] propose to concatenate together the output of a text-based distributional model (such as word2vec [15]) with an encoding produced from a CNN applied to images of the word. Lazaridou et al. [11] demonstrate an alternative approach to concatenation, where instead the a word embedding is learned that minimizes a joint loss function involving context-prediction and image reconstruction losses. Another alternative to concatenation is the gated multimodal unit (GMU) proposed in [3]. We investigate the performance of different techniques for combining image and text data for product attribute extraction in section 4.

To our knowledge, we are the first to study the problem of attribute extraction from multimodal data. However the problem of attribute extraction from text is well studied. Ghani et al. [7]

treat attribute extraction of retail products as a form of named entity recognition. They predefine a list of attributes to extract and train a Naïve Bayes model on a manually labeled seed dataset to extract the corresponding values.

Putthividhya and Hu [18] build on this work by bootstrapping to expand the seed list, and evaluate more complicated models such as HMMs, MaxEnt, SVMs, and CRFs. To mitigate the introduction noisy labels when using semi-supervised techniques, More [16] incorporates crowdsourcing to manually accept or reject the newly introduced labels. One major drawback of these approaches is that they require manually labelled seed data to construct the knowledge base of attribute-value pairs, which can be quite expensive for a large number of attributes. Bing et al. [4] address this problem by using an unsupervised, LDA-based approach to generate word classes from reviews, followed by aligning them to the product description. Shinzato and Sekine [20] propose to extract attribute-value pairs from structured data on product pages, such as HTML tables, and lists, to construct the KB. This is essentially the approach used to construct the knowledge base of attribute-value pairs used in our work, which is automatically performed by Diffbot’s Product API.

6 Conclusions and Future Work

In order to kick start research on multimodal information extraction problems, we introduce the multimodal attribute extraction dataset, an attribute extraction dataset derived from a large number of e-commerce websites. MAE features images, textual descriptions, and attribute-value pairs for a diverse set of products. Preliminary data from an Amazon Mechanical Turk study demonstrates that both modes of information are beneficial to attribute extraction. We measure the performance of a collection of baseline models, and observe that reasonably high accuracy can be obtained using only text. However, we are unable to train off-the-shelf methods to effectively leverage image data.

There are a number of exciting avenues for future research. We are interested in performing a more comprehensive crowdsourcing study to identify the ways in which different evidence forms are useful, and in order to create clean evaluation data. As this dataset brings up interesting challenges in multimodal machine learning, we will explore a variety of novel architectures that are able to combine the different forms of evidence effectively to accurately extract the attribute values. Finally, we are also interested in exploring other aspects of knowledge base construction that may benefit from multimodal reasoning, such as relational prediction, entity linking, and disambiguation.


The authors are grateful to Diffbot for generously providing API access for the MAE dataset, as well as support for this research.