AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

Significant progress has been achieved in Computer Vision by leveraging large-scale image datasets. However, large-scale datasets for complex Computer Vision tasks beyond classification are still limited. This paper proposed a large-scale dataset named AIC (AI Challenger) with three sub-datasets, human keypoint detection (HKD), large-scale attribute dataset (LAD) and image Chinese captioning (ICC). In this dataset, we annotate class labels (LAD), keypoint coordinate (HKD), bounding box (HKD and LAD), attribute (LAD) and caption (ICC). These rich annotations bridge the semantic gap between low-level images and high-level concepts. The proposed dataset is an effective benchmark to evaluate and improve different computational methods. In addition, for related tasks, others can also use our dataset as a new resource to pre-train their models.



There are no comments yet.


page 3

page 8


KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations

Detecting 3D objects keypoints is of great interest to the areas of both...

ModaNet: A Large-Scale Street Fashion Dataset with Polygon Annotations

Understanding clothes from a single image has strong commercial and cult...

Auditing ImageNet: Towards a Model-driven Framework for Annotating Demographic Attributes of Large-Scale Image Datasets

The ImageNet dataset ushered in a flood of academic and industry interes...

BPFNet: A Unified Framework for Bimodal Palmprint Alignment and Fusion

Bimodal palmprint recognition leverages palmprint and palm vein images s...

Understanding Image Virality

Virality of online content on social networking websites is an important...

A Realistic Fish-Habitat Dataset to Evaluate Algorithms for Underwater Visual Analysis

Visual analysis of complex fish habitats is an important step towards su...

Large image datasets: A pyrrhic win for computer vision?

In this paper we investigate problematic practices and consequences of l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent progress achieved in Computer Vision tasks largely rely on deep neural networks

[17, 43]

and big data, such as ImageNet

[23], MSCOCO[29]

, Scene Understanding (SUN)

[54] and Flickr30k[59]. Most existing datasets focus on traditional (object or scene) classification and recognition tasks. Many images are annotated with only labels and bounding boxes. However predicting labels and bounding boxes of objects are far away from deep understanding of images. Those datasets with rich annotations, such as human keypoints, attributes and captions, are a small fraction of existing datasets and have a small scale. In human keypoint detection task, MSCOCO[29] and MPII[1] provided no more than 200k labelled images. The sum of images in frequently used attribute datasets (CUB, SUN Attribute, aP/aY and AwA ) is only 72k. Currently, Flickr8k-cn[27] provides 8k Chinese captions of images, however, because annotation does not specify any rules on wording, some captions may not contain all salient objects and some may not express the relationship between objects, which will result in a lack of information for training or evaluating methods.

The goal of this paper is to go deeper in image understanding by providing a dataset for three more comprehensive tasks, namely, human keypoint detection, attribute based zero-shot recognition and image Chinese captions (see Fig.1). These three tasks focus on the concept of daily life for ordinary people. In human keypoint detection task, we try to annotate and predict the keypoints of people, which is a fundamental task for capturing and understanding human activities. In attribute based zero-shot recognition task, we are inspired by human being’s learning ability, that people can learn new concepts from descriptions, and we annotate attributes of objects for implementing zero-shot recognition. In image Chinese captioning task, we try to understand the relation between objects in the image by captioning and we annotate Chinese captions for scenes of people’s daily life.

Figure 1: Examples of the proposed datasets.

To build such dataset, we first design the scene and object categories. Then the raw images are crawled from the Internet by querying label names in search engine. Then, these images are divided into three classes, namely ”useless”, ”single-object”, and ”multi-object”. We use only single-object images to build the attribute dataset for zero-shot recognition, while we use both single-object images and multi-object images for keypoint detection and Chinese captioning. The whole dataset contains 300,000 images (annotated with key points for main characters) for keypoint detection, 81,658 images (annotated with labels, bounding boxes, and attributes (partially)) for zero-shot recognition and 300,000 images ,annotated with 5 Chinese captions per image, for Chinese captioning. We should emphasize that there are more than 95% overlap between keypoint images and captioning images. Hence, this is a good resource to investigate how to jointly deal with two different visual tasks.

There are three main contributions in this paper. 1) Our dataset provides a new benchmark to evaluate methods in the three tasks. 2) The huge dataset is a new resource for pre-training models. 3) To our best knowledge, this is the first large-scale image Chinese captioning dataset.

2 Human Skeletal System Keypoint Detection

2.1 Overview

Human Skeletal System Keypoint Detection plays an important role in several computer vision tasks, such as pose estimation, activity recognition and abnormal action detection. Unfortunately, due to the unknown number, position and scale of human figure in the image, along with the interactions and occlusions that may occur between people, human keypoint detection can be a real challenging task.

Recent human keypoint detection approaches can be roughly divided into two categories[3]: top-down[19, 15, 38, 42] and bottom-up[3, 2, 7, 44]. The main idea of a top-down scenario is to divide and conquer, which treats the multi-person keypoint detection problem as a human detection followed by a single person keypoint detection. On the other hand, a bottom-up method directly extracts human keypoints from the image and clusters the results into different humans.

In the last few years, the deep neural networks especially the Convolutional Neural Networks(CNN), have been widely used to detect and localize the human keypoints

[33, 35, 45, 50]. To avoid over-fitting, such approaches require massive labelled data to train the deep neural networks. While existing datasets with human keypoint annotation like MSCOCO[29] and MPII[1] provide only no more than two hundred thousand labelled images, here we introduce the Human skeletal system Keypoint Detection Dataset(HKD) which contains 300,000 high resolution images with multiple persons and various poses, and each person is labeled with a bounding box and 14 human skeletal keypoints. The comparison between datasets is shown in Tab.1.

Datasets Images Humans Keypoints
MSCOCO[29] 200k 250k 17
MPII[1] 25k 40k 13
HKD(Ours) 300k 700k 14
Table 1: The comparison of human keypoint datasets.

The rest of this section is organized as follows: we first describe how we collected and annotated the images, some dataset statistics are shown in the next subsection, then the evaluation metrics we designed for the task is described, finally we introduce the baseline model and conduct some experiments.

2.2 Data Annotation

The annotation pipeline for HKD data set can be divided into three major parts, which are image filtering, human bounding box labeling and human skeletal keypoints labeling.

Similar to the SCH and the ICC dataset, images in the HKD dataset are collected from Internet search engines. So the first step is to remove inappropriate images out of the HKD dataset. These may include but are not limited to those images containing famous politicians, domestic police forces, sexual contents, violence or other inappropriate actions. In addition, we eliminate images where all human figures are too small(e.g. football players on the field taken from the top of stadium stand), or the ones that contain too many human figures (e.g. the crowd on the stadium stand) from our dataset.

The next step is to label human figures with bounding boxes. The bounding box should stay as close to the subject as possible, and in the mean time, contain all visible parts of this human figure. Note that not all humans in images are labelled with a bounding box. We skip the small human figure whose body parts are hard to distinguish, and the vague ones whose body contours are hard to recognize, because we want the algorithm to focus on detecting the most significant human body instead of all the humans in the image.

The final and the most important step is to label the locations and types of human skeletal keypoints for each human with a bounding box from the previous annotation stage. For each human, we labeled 14 human skeletal keypoints, and the numeric order of these keypoints is : 1-right shoulder, 2-right elbow, 3-right wrist, 4-left shoulder, 5-left elbow, 6-left wrist, 7-right hip, 8-right knee, 9-right ankle, 10-left hip, 11-left knee, 12-left ankle, 13-top of the head, and 14-neck. Each keypoint has one of three visibility flags: labeled and visible, labeled but not visible, or not labeled.

2.3 Data Statistics

Figure 2: The distribution of different type of keypoints.

We split the HKD dataset into training, validation, test A and test B with 70%, 10%, 10% and 10% ratio, which contain 210 000, 30 000, 30 000 and 30 000 images respectively. We only provide statistics on 210 000 training data.

For the 210 000 images in training set, there are 378 374 human figures with almost 5 million keypoints. Among all the human keypoints we have labeled, 78.4% of them are labeled as visible() and the rest of them are labeled as not visible(). The distribution of different types of keypoints are shown in Fig.2

Inconsistency in human-annotated keypoint locations is inevitable. We had 33 people labeled a same batch of 100 images to test the noise introduced by humans. In specific, we calculate the second central moment, which is the maximum likelihood estimation on standard deviation of the Euclidean distance between each type of keypoints and its center. The human label deviation is shown in Fig.

3(a), where the radius of bright circle is the human label deviation of corresponding keypoint type. We can see that the upper body is labeled more accurately and the hips are generally more difficult to annotate. These human label deviation of different types of keypoints are used in evaluation metrics to measure the prediction difficulty, which will be introduced in the next subsection.

Figure 3: Human Label Deviation and Pose Diversity. (a)The radius of bright circle is the human label deviation of corresponding keypoint type, which represents the difficulty of prediction. (b)To demonstrate the diversity of human poses in the HKD dataset, 100 human are randomly chosen and Human limbs are drawn after aligned.

To demonstrate the diversity of human poses in the HKD dataset, 100 human annotations are randomly chosen from the training set. We apply the keypoint alignment by linear transformation, where the parameters of the transformation are set to make these 5 keypoints, right shoulder(1), left shoulder(4), right hip(7), left hip(10) and neck(14), have the same first moment(center) and second central moment(standard deviation). As shown in Fig.

3(b), the most common poses are standing and sitting, while there are also quite a few other poses.

2.4 Evaluation Metrics

The evaluation metric of the human skeleton keypoint detection is similar to common object detection task, where the submission is scored in mean Average Precision (mAP). In common object detection tasks, Intersection over Union (IoU) is used to evaluate the similarity between a predicted bounding box and a ground truth bounding box. While in the human skeletal system keypoints detection task, we use Object Keypoint Similarity (OKS) proposed in [29] instead of IoU, to measure the similarity between the predicted keypoints and the ground truth keypoints.

The mAP score is the mean value of the Average Precision (AP) score under different OKS thresholds(0.50:0.05:0.95). The AP (Average Precision) score is calculated in the same way as in common object detection, but instead of IoU, OKS is used as similarity metric. Given the OKS threshold s, the AP under (AP@s) of the test results is predicted by the participants over the entire test set.

The OKS score is similar to the IoU score in common object detection task, which measures the similarity between the prediction and the ground truth. The main idea of OKS is the weighted Euclidean distance of the predicted keypoints and the ground truth keypoints, and for each human figure p, the OKS score is defined as follows:

Where is the index of human annotations; is the id number of the given human skeleton keypoint; is the Euclidean distance between the predicted keypoint position and the ground truth; is the scale factor of human figure , which is defined as the square root of the human bounding box area of human figure ; is the normalized factor of the human skeletal keypoint, which is calculated by the standard deviation of human annotation result; is the the visibility flag of the keypoint of the human figure ; is the Kronecker function, which means only visible human skeletal keypoints()are considered during evaluation.

An evaluation script will be comming soon to facilitate offline evaluation.

2.5 Baseline Model and Experiments

We provide a basic approach to detect human skeletal keypoints in natural images as the baseline model of the HKD dataset. The most straightforward way is to adopt a top-down type method, that we first detect the humans in the image and then apply a single person keypoint detection method. The baseline model consists of three major parts: a human detector, a keypoint detector and a post-processing procedure to complete the task.

For detector we choose the Single Shot multibox Detector(SSD)[30] pre-trained on Pascal VOC[10] and Mask R-CNN[16] pre-trained on MSCOCO[29]. Since person is one of the defined classes in both datasets, we are able to apply the pre-trained model on our images without retraining it. For the SSD we use the output human bounding boxes and for the DeepLab we use the ouput human masks.

We treat single person keypoint detection as a semantic segmentation problem by generating the ground truth masks where pixels in a small region near the keypoints are set as the corresponding keypoint classes and others are set as background class. Then we trained a DeepLab v2[5] model to learn this semantic segmentation representation.

During inference, we crop the human bounding box generated by the detector and adopt the DeepLab model to generate a pixel-wise saliency map of keypoints. If there are more than one region of the same keypoint type in the saliency map, we only take the one with the largest region area and eliminate the rest. Finally we get the final detection result by letting the centroid of keypoint regions in the saliency map be the final keypoints detection result.

We conducted the experiments by training the baseline model and all the training images we use are in the HKD training set. The quantitative results on the HKD validation set are in Tab.2

Algorithms mAP-12 mAP-14
Baseline(bbox) 0.228 0.234
Baseline(mask) 0.226 0.233
OpenPose[3] 0.296 -
Table 2: The mAP score on the HKD validation set. mAP-12 score is evaluated on the 12 keypoints identical to MSCOCO and mAP-14 score is evaluated on all 14 keypoints in the HKD dataset

As we can see, OpenPose[3], the winner of MSCOCO 2016 keypoint competition[29], scores only 0.296 mAP-12 on the HKD dataset. In the mean time, we provided a basic approach which adopts a top-down pipeline and scores a 0.228 mAP-12 value and a 0.234 mAP-14 value.

3 Attribute based Zero-shot Recognition

3.1 Overview

Human beings can learn a new concept from descriptions without seeing it. Zero-shot recognition, which aims to recognize objects from novel unseen classes, is a promising approach to realize large-scale object recognition. Significant progress[53, 14] has been achieved in zero-shot recognition. In most practices, ZSR is implemented by transferring knowledge from seen to unseen classes via auxiliary knowledge, e.g. attributes[26]

, word vectors

[32] and gaze embeddings[20]. Compared to other types of auxiliary knowledge, attributes have good discrimination and interpretability. Many state-of-the-art ZSR results[22, 57, 62, 63, 52] have been achieved based on attributes.

LAD CUB SUN aP/aY AwA ImageNet_A
Images 81,658 11,788 14,340 15,339 30,475 384,000*
Classes 240 200 717 32 50 384
Bounding Box Yes Yes No Yes No Yes
Attributes 359 312 102 64 85 25
Annotation Level 20 ins./class instance instance instance class 25 ins./class
Table 3: Statistics and comparison of different datasets. * means the estimated number.

However, there exists only a small number of image datasets annotated with attributes. The frequently used ones include Caltech-UCSD Birds-200-2011 (CUB)[49], SUN Attributes (SUN)[54], aPascal/aYahoo (aP/aY)[12], Animals with Attributes (AwA)[12] and ImageNet_A111The authors provide attributes for 384 popular synsets in ImageNet. In this section, we use ”ImageNet_A” to refer this subset of ImageNet.[41] (see in Tab.3). Existing attribute datasets have three major limitations: 1) Small image numbers. The sum of images in CUB, aP&aY, SUN and AwA datasets is only 72k. This is a small number compared to many object recognition datasets, e.g. the ImageNet[23], MSCOCO[29] and LSUN[60]. 2) Lack of semantic attributes. Only low-level visual attributes (e.g. color, size, shape, texture) are annotated in CUB and ImageNet_A. 3) Close to ImageNet. The categories in some datasets, e.g. AwA and aP&aY, have a large overlap with ImageNet. 4) Serious distribution bias. For instance, 30% classes in AwA have more than 10% images in which the object is along with ”person” . Such distribution bias may cause the inaccurate learning of some objects. These limitations block the evaluation and improvement of ZSR methods.

Figure 4: Examples in our dataset. We annotate both visual attributes (the upper two) and semantic attributes (the bottom two).

We present a Large-scale Attribute Dataset (LAD) with rich semantic attributes (shown in Fig.4) to promote the development of zero-shot recognition and other attribute-based tasks [46, 31, 56, 51, 61]. Our dataset contains 81,658 images, 240 classes and 359 attributes. Beyond low-level visual attributes (e.g. colors, sizes, shapes), we also provide many semantic attributes. For example, we annotate attributes of diets and habits for animals, edibility and medicinal property for fruits, safety and usage scenarios for vehicles, functions and usage mode for electronics.

3.2 Data Annotation & Statistics

To construct attribute based zero-shot recognition dataset, we first define the label list of all classes. Specifically, our dataset includes 240 classes. These classes can be divided into 5 subsets, namely animals, fruits, vehicles, electronics and hairstyles. The first four coarse-grained subsets contain 50 classes respectively, while the last fine-grained hairstyle subset contains 40 classes.

We crawl images for each class based on the search of the label and synonyms. Then, we filter these images and keep those images with only one foreground object matching the label. We also annotate the bounding box for every foreground object.

As our dataset includes 240 classes, it is unsuitable to design a list of many attributes for all classes. Hence, we design the attribute list for each subset. Specifically, we design 123, 58, 81, 75 and 22 attributes (359 in total) for animals, fruits, vehicles, electronics and hairstyles respectively. Beyond low-level visual attributes (e.g. colors, shapes, sizes), we provide many semantic attributes (e.g. habits of animals, functions of electronics, feelings about hairstyles). These semantic attributes are human-concerned ones, however, not well investigated in previous vision tasks.

Tab.3 shows the statistics of image and annotation numbers of our dataset and others. Clearly, our dataset has the largest number of attributes. Our dataset has 81,658 images which is greater than the sum of CUB, SUN, aP/aY and AwA. Fig.5 illustrates the distribution of image numbers per class. Most classes in our dataset have around 350 images, which is greater than aP/aY dataset (around 250 images).

(a) Statistics of image numbers per class.
(b) Statistics of class and attribute numbers.
Figure 5: Statistics of image, class and attribute numbers.

3.3 Data Split

We present a set of splits of seen/unseen classes for zero-shot recognition. We follow the traditional 80%/20% split ratio of seen/unseen classes. We shuffle these classes in each subset. Then we divide all 240 classes into 5 folders. Every 20% of these classes are chosen to be unseen classes and the rest are seen classes. In this way, we can obtain 5 random splits. For each subset, the ratio of seen/unseen is the same. We advocate to evaluate methods on the 5 splits and provide the mean accuracy.

For supervised learning of attributes and labels, we provide the split of training/testing data. We randomly select 70% data from each class as training data and the rest 30% are testing data. The validation data can be extracted from training data in experiments.

3.4 Experiments

Baseline Methods. We implement zero-shot recognition experiments on our dataset using three basic methods, namely, SOC[36], ESZSL[40] and MDP[64]

, which belongs to three popular frameworks. First, images and labels are embedded into the image feature space (using ResNet pre-trained on ImageNet) and the semantic embedding spaces (using annotated attributes). SOC tries to learn a linear mapping function from the image feature space to the semantic embedding space using seen class data. Then unseen instances are mapped to the semantic embedding space using the learned mapping function. These unseen instances are classified based on distances to the ground-truth unseen semantic embeddings using nearest-neighbour classifier. ESZSL learns a mapping that measures the compatibility between image features and semantic embeddings. MDP aims to learn the local structure of semantic embeddings. Then the structure is transferred to image feature space for synthesizing unseen image data. Labels of testing unseen images are predicted according to the distance to these synthesized data.

1 31.05 42.17 46.96
2 31.27 46.82 46.83
3 34.64 42.49 51.41
4 34.21 41.96 48.61
5 36.57 43.72 49.08
Ave 33.55 43.43 48.58
Table 4: Comparison of zero-shot recognition methods on our dataset.

Experimental Methods. Experimental results are shown in Tab.4. We can find that the zero-shot recognition accuracies on the five splits are balanced. MDP achieves the best performance, averagely 48.58%. The runner-up method is ESZSL, whose average recognition accuracy is 43.43%. The average recognition accuracy of SOC is 33.55% which is around 15% lower than MDP.

4 Image Captioning for Chinese

4.1 Overview

Image captioning has long been a challenging problem in computer vision and natural language processing. A great image model must capture not only primary objects contained in an image, but also the relationship between objects, their attributes, or the activities they are involved in. Moreover, the image captioning task requires that these semantic knowledge to be organized and conveyed in textual description, and therefore a language model is also needed.

Early approaches to tackle this issue could be roughly divided into two types: template-based methods[11, 24, 25, 55] and retrieval-based approaches[9, 13]. The first approaches start from detecting object, action, scene and attributes in images and then combined them by language models. The second approaches retrieve the visually similarity images from a large database, and then transfer the captions of retrieved images to fit the query image.

Recently, the encoder-decoder framework[21, 48, 58, 4]

and the reinforcement learning framework


have been introduced to image captioning. Researchers adopted encoder-decoder framework because translating an image to a sentence was analogous to the task in machine translation. Approaches following this framework generally encode an image as a single feature vector by convolutional neural networks, and then feed such vector into recurrent neural networks to generate captions. Reinforcement learning framework is based on decision-making which utilizes a ”policy network” and a ”value network” to collaboratively generate captions.

Although much of the progress have been made possible by the availability of image caption datasets such as Pascal VOC 2008[13], Flickr8k[18], Flickr30k[59], MSCOCO[29] and SBU[34] datasets, captions in existing datasets were all labeled in English. These datasets contain 8,000, 31,000 and 300,000 images respectively and each is annotated with 5 English sentences. To promote progress in this area, we created the image Chinese captioning (ICC) dataset (see in Fig.6). To our knowledge, the ICC dataset is the largest image captioning dataset whose sentences are labeled in Chinese.

Figure 6: Examples of ICC training dataset.

The rest of this section is organized as follows. Firstly, we describe the process of collecting Chinese captions for the ICC dataset. Secondly, we analyze the properties of the ICC dataset. Thirdly, we introduce a baseline for the ICC dataset. Finally, we perform experiments to assess the effectiveness of the baseline model using several metrics.

4.2 Dataset Statistics

We analyze the properties of the ICC dataset in comparison to several other popular datasets. The statistics of the datasets are shown in Tab.5.

Figure 7: Examples of different scenes.

4.3 Data Annotation

The pipeline to gather data for the ICC dataset can be divided into two major parts, which are image selection (similar as HKD) and Chinese caption labeling.

Apart from the HKD image selection principles, we add two more rules. Firstly, the image should be easy to be described using only one sentence. Secondly, the image should contain multiple objects in complex scenes. For example, an image contains only one person standing with no other significant poses is less likely to be selected than the one with two people hugging.

The ICC dataset contains five reference captions for every image, which were labelled by 5 different native speakers in China using Chinese language. Each of our captions is generated using human subjects which are similar to the ones in[6].

There are three principles that guide image caption annotation. Firstly, the annotations should include but not limited to key objects/attributes, locations and human actions. Secondly, the sentences should be fluent. Thirdly, the use of Chinese idioms or descriptive adjectives is encouraged.

dataset name train valid test-1 test-2 language
Pascal VOC[13] - - 1K - English
Flickr8k[18] 6K 1K 1K - English
Flickr30k[59] 28K 1K 1K - English
MSCOCO[29] 82K 40K 40K - English
SBU[34] 1M - - - English
ICC 210K 30K 30K 30K Chinese
Table 5: Statistics and comparison of different datasets.

The number of captions is 1,050,000 captions for 210,000 images in training, 150,000 captions for 30,000 images in validation, 150,000 captions for 30,000 images in testing-1 and 150,000 captions for 30,000 images in testing-2. ICC is the largest dataset whose captions are in Chinese and it is the first to provide two different test datasets which can better evaluate if the algorithm is overfitting.

In ICC training data there are more than 200 scenes and places such as ”football field” and ”grassland” (see in Fig.7), 150 actions such as ”sing” and ”run”. ICC dataset contains most of common daily scenes in which a person usually appear.

Baseline 0.765 0.648 0.547 0.461 1.425 0.370 0.633
Table 6: Scores of caption baseline for ICC testing-1.

4.4 Baseline Model

We adapt ”show and tell”, a popular encoder-decoder model for image captioning[48], as our base model. One difference is that we use the ”Jieba” Chinese word segmentation module during preprocessing, instead of the English tokenization module used in ”show and tell”.

This model directly maximize the probability of the correct description given the image by using the following formulation:

where are the parameters of our model, is an image, and is a correct transcription.

4.5 Experimental Results

To quantitatively evaluate how well the base model learns to generate Chinese captions, experiments were conducted on the ICC testing-1 dataset which contains 30,000 images. All the reported results are computed on the metrics BLEU[37], METEOR[8], ROUGE[28] and CIDEr[47] respectively, which are commonly used together for fair and thorough performance measurement.

In Tab.6, we provide a result summary of our baseline model. We achieve reasonable performance on ICC in most evaluation metrics.

In Fig.8, captions of MSCOCO and ICC datasets are shown respectively. Both of the first 5 captions are written by human. The sixth caption is generated by the baseline model trained on MSCOCO dataset and the seventh caption is generated by the same model trained on ICC dataset. The same model trained on ICC dataset produces better performance than the one trained on MSCOCO dataset in most cases. For example, the seventh caption in the first image, which translates as ”Beside two people next to a car on the road, there is a man wearing a white shirt getting off the car”, and the seventh caption in the second image, which ”In the room there are a man holding a guitar with two hands and a woman holding a microphone with her right hand”, both provide much more descriptive details than the captions generated by model trained on MSCOCO. The results show that the captions in ICC dataset could provide more context information.

Figure 8: Results of baseline model for ICC and MSCOCO.

5 Conclusion

In this paper, we propose a new dataset with rich annotations, for training and evaluating methods. Utilizing over 185,000 worker hours, a vast collection of images was collected, annotated and organized to provide three new, large volume datasets for human keypoint detection, attribute based zero-shot recognition and image Chinese captioning. Among these three datasets, intersection is significant, and people can cross reference low level image annotations such as class labels to high level segment annotations such as captioning. This provides a good benchmark for evaluating and improving methods for these three tasks and other possible tasks to cross correlate different levels of information. On our dataset, we also provide basic statistical tests and base line models to prove the basic validity and first insight.

There are several promising directions for future annotations on our dataset. For example, currently the human keypoint dataset only includes skeletal keypoints of human figures, but annotating ”expression” or ”action” may provide more information that can be useful for even higher-level visual tasks, such as pose estimation. Moreover, we currently only collect images containing human beings for image Chinese captioning dataset, but collecting other classes may provide better relationship between objects.

To download and learn more about AIC dataset, please refer to the project website222 Some code is released online333


  • [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In

    Computer Vision and Pattern Recognition

    , pages 3686–3693, 2014.
  • [2] A. Bulat and G. Tzimiropoulos. Human Pose Estimation via Convolutional Part Heatmap Regression. Springer International Publishing, 2016.
  • [3] Z. Cao, T. Simon, S. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. CoRR, abs/1611.08050, 2016.
  • [4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [5] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2016.
  • [6] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • [7] X. Chen and A. L. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. CoRR, abs/1407.3399, 2014.
  • [8] M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
  • [9] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M. Mitchell. Language models for image captioning: The quirks and what works. arXiv preprint arXiv:1505.01809, 2015.
  • [10] M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  • [11] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. Lawrence Zitnick, and G. Zweig. From captions to visual concepts and back. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [12] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, pages 1778–1785, 2009.
  • [13] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In European conference on computer vision, pages 15–29. Springer, 2010.
  • [14] Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, and S. Gong. Recent advances in zero-shot recognition. arXiv preprint arXiv:1710.04837, 2017.
  • [15] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets for detecting people and localizing their keypoints. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3582–3589, 2014.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [18] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. In

    Proceedings of the 24th International Conference on Artificial Intelligence

    , IJCAI’15, pages 4188–4192. AAAI Press, 2015.
  • [19] U. Iqbal and J. Gall. Multi-person pose estimation with local joint-to-person associations. In European Conference on Computer Vision, pages 627–642, 2016.
  • [20] N. Karessli, Z. Akata, B. Schiele, and A. Bulling. Gaze embeddings for zero-shot image classification. In 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [21] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [22] E. Kodirov, T. Xiang, and S. Gong.

    Semantic autoencoder for zero-shot learning.

    CVPR, 2017.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [24] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.
  • [25] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 359–368. Association for Computational Linguistics, 2012.
  • [26] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. TPAMI, 36(3):453–465, 2014.
  • [27] X. Li, W. Lan, J. Dong, and H. Liu. Adding chinese captions to images. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 271–275. ACM, 2016.
  • [28] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.
  • [29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
  • [31] W.-Y. Ma and B. S. Manjunath. Edgeflow: a technique for boundary detection and image segmentation. IEEE transactions on image processing, 9(8):1375–1388, 2000.
  • [32] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [33] A. Newell, K. Yang, and J. Deng. Stacked Hourglass Networks for Human Pose Estimation. Springer International Publishing, 2016.
  • [34] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems, pages 1143–1151, 2011.
  • [35] W. Ouyang, X. Chu, and X. Wang.

    Multi-source deep learning for human pose estimation.

    In Computer Vision and Pattern Recognition, pages 2337–2344, 2014.
  • [36] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In NIPS, pages 1410–1418, 2009.
  • [37] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
  • [38] L. Pishchulin, A. Jain, M. Andriluka, and T. Thormahlen. Articulated people detection and pose estimation: Reshaping the future. In Computer Vision and Pattern Recognition, pages 3178–3185, 2012.
  • [39] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li. Deep reinforcement learning-based image captioning with embedding reward. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [40] B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In ICML, 2015.
  • [41] O. Russakovsky and F.-F. Li. Attribute learning in large-scale datasets. In ECCV Workshops (1), volume 6553, pages 1–14, 2010.
  • [42] M. Sun and S. Savarese. Articulated part-based model for joint object detection and pose estimation. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November, pages 723–730, 2011.
  • [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  • [44] J. Tompson, A. Jain, Y. Lecun, C. Bregler, J. Tompson, A. Jain, Y. Lecun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
  • [45] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Computer Vision and Pattern Recognition, pages 1653–1660, 2014.
  • [46] D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur, and M. Turk. Attribute-based people search in surveillance environments. In Applications of Computer Vision (WACV), 2009 Workshop on, pages 1–8. IEEE, 2009.
  • [47] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  • [48] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [49] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, California Institute of Technology, 2011.
  • [50] S. E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
  • [51] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [52] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, pages 69–77, 2016.
  • [53] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600, 2017.
  • [54] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.
  • [55] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 444–454. Association for Computational Linguistics, 2011.
  • [56] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image captioning with attributes. arXiv preprint arXiv:1611.01646, 2016.
  • [57] M. Ye and Y. Guo. Zero-shot classification with discriminative semantic representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7140–7148, 2017.
  • [58] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [59] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions., 2014.
  • [60] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  • [61] X. Yu and Y. Aloimonos.

    Attribute-based transfer learning for object categorization with zero/one training example.

    Computer Vision–ECCV 2010, pages 127–140, 2010.
  • [62] L. Zhang, T. Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. CVPR, 2017.
  • [63] Z. Zhang and V. Saligrama. Zero-shot recognition via structured prediction. In ECCV, 2016.
  • [64] B. Zhao, B. Wu, T. Wu, and Y. Wang. Zero-shot learning posed as a missing data problem. In Proceedings of ICCV Workshop, pages 2616–2622, 2017.